Abstract
Multimodal large language models (MLLMs) provide new opportunities for blind and low vision (BLV) people to access visual information in their daily lives. However, these models often produce errors that are difficult to detect without sight, posing safety and social risks in scenarios from medication identification to outfit selection. While BLV MLLM users use creative workarounds such as cross-checking between tools and consulting sighted individuals, these approaches are often time-consuming and impractical. We explore how systematically surfacing variations across multiple MLLM responses can support BLV users to detect unreliable information without visually inspecting the image. We contribute a design space for eliciting and presenting variations in MLLM descriptions, a prototype system implementing three variation presentation styles, and findings from a user study with 15 BLV participants. Our results demonstrate that presenting variations significantly increases users’ ability to identify unreliable claims (by 4.9x using our approach compared to single descriptions) and significantly decreases perceived reliability of MLLM responses. 14 of 15 participants preferred seeing variations of MLLM responses over a single description, and all expressed interest in using our system for tasks from understanding a tornado’s path to posting an image on social media.
Design and Pipeline
| Dimension | Alternatives | |||
|---|---|---|---|---|
| Elicitation of Variants | Trials | Prompts | Models | Images |
| Comparison Support | List | Variation-aware description | Variation Summary | |
| Comparison Granularity | Words | Atomic facts | Sentences | Responses |
| Support Indicator | None | Percentage | Language | Source |
| Provenance Indicator | None | Trials | Prompts | Models |
| Modality | Text | Sound | Visualization | Haptics |
Design space for surfacing MLLM variations. Alternatives marked with backticks indicate the choices implemented in our prototype system.
We elicit variants by running multiple trials (repeated queries), using different prompts, and querying multiple models. We then compare MLLM-generated descriptions at the granularity of atomic facts (individual descriptive claims) and present variations using a list format, a variation-aware description, and a separate variation summary.
Support Indicators
We provide three support indicators (language, percentage, and source) to show how many of each model agree with each claim.
| None | Language | Percentage | Source |
|---|---|---|---|
| There are two white chairs on the left and a grey sofa on the right. At the center there is a white coffee table with a marble or glass or wood top and a gold base. There is a built-in shelf on the back wall with decorative items, like books and a television. | There are two white (well-supported) chairs on the left and a grey sofa on the right. At the center there is a white coffee table with a marble (moderately supported) or glass (poorly supported) or wood (very little support) top and a gold base. There is a built-in shelf on the back wall with decorative items, like books (moderately supported) and a television (moderately supported). | There are two white (100%) chairs on the left and a grey sofa on the right. At the center there is a white coffee table with a marble (56%) or glass (33%) or wood (11%) top and a gold base. There is a built-in shelf on the back wall with decorative items, like books (33%) and a television (33%). | There are two white (3 of 3 GPT, 3 of 3 Gemini, 3 of 3 Claude) chairs on the left and a grey sofa on the right. At the center there is a white coffee table with a marble (3 of 3 GPT, 2 of 3 Gemini) or glass (3 of 3 Claude) or wood (1 of 3 Gemini) top and a gold base. There is a built-in shelf on the back wall with decorative items, like books (3 of 3 GPT) and a television (3 of 3 Gemini). |
Variation-aware description without support indicator and with three variant support indicators (Language, Percentage, Source). Agreements (top, green), disagreements (middle, red), and unique mentions (bottom, blue) are highlighted.
Results
Surfacing variations in MLLM responses significantly increased the number of unreliable claims identified by 4.9x for ours or 4.2x for list compared to presenting a single description. Variations also significantly decreased the perceived reliability of MLLM responses from 5.78 of 7 for a single description to 4.76 for a list of descriptions and 3.93 for ours.
Right: average perceived reliability rating (1 = not reliable at all, 7 = most reliable) overall and in each image category.
Error bars represent a 95% confidence interval. We applied the Friedman test followed by pairwise Wilcoxon signed-rank tests with Bonferroni correction. Significance is marked as * p < 0.05, ** p < 0.01, and *** p < 0.001.
11 of 15 participants ranked our variation summary (ours) as their favorite option with 9 of 15 ranking it as their second favorite, while only 5 participants rated the list or single description as their first or second favorite, indicating strong support for our aggregated variation approaches.
Right: Participants’ preference for support indicators
(1 = most preferred, 4 = least preferred)
All BLV participants wanted to use our variation surfacing prototype in the future for a variety of purposes from high-stakes scenarios such as assessing the path of an incoming tornado to obtaining subjective critiques for social media posts.
BibTeX
@inproceedings{10.1145/3663547.3746393, author = {Chen, Meng and Iyer, Akhil and Pavel, Amy}, title = {Surfacing Variations to Calibrate Perceived Reliability of MLLM-generated Image Descriptions}, year = {2025}, isbn = {9798400706769}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3663547.3746393}, doi = {10.1145/3663547.3746393}, abstract = {Multimodal large language models (MLLMs) provide new opportunities for blind and low vision (BLV) people to access visual information in their daily lives. However, these models often produce errors that are difficult to detect without sight, posing safety and social risks in scenarios from medication identification to outfit selection. While BLV MLLM users use creative workarounds such as cross-checking between tools and consulting sighted individuals, these approaches are often time-consuming and impractical. We explore how systematically surfacing variations across multiple MLLM responses can support BLV users to detect unreliable information without visually inspecting the image. We contribute a design space for eliciting and presenting variations in MLLM descriptions, a prototype system implementing three variation presentation styles, and findings from a user study with 15 BLV participants. Our results demonstrate that presenting variations significantly increases users’ ability to identify unreliable claims (by 4.9x using our approach compared to single descriptions) and significantly decreases perceived reliability of MLLM responses. 14 of 15 participants preferred seeing variations of MLLM responses over a single description, and all expressed interest in using our system for tasks from understanding a tornado’s path to posting an image on social media.}, booktitle = {Proceedings of the 27th International ACM SIGACCESS Conference on Computers and Accessibility}, articleno = {66}, numpages = {17}, keywords = {Accessibility, Image Descriptions, Multimodal Large Language Models, Variations, AI Errors, Trust}, location = {Denver, CO, USA }, series = {ASSETS '25}}Acknowledgments
This work was supported in part by a Notre Dame-IBM Technology Ethics Lab Award.