Surfacing Variations to Calibrate Perceived Reliability of MLLM-Generated Image Descriptions

Meng Chen

UT Austin

Akhil Iyer

UT Austin

Amy Pavel

UC Berkeley

ASSETS 2025

ACM DL Accessible PDF arXiv Code Demo

Four labeled panels (A–D) show how multiple MLLM outputs are aggregated into variation‑aware descriptions. (A) Input photo of a home interior wall with framed art, a chair holding a laundry basket, and a small floral‑covered side table; prompt: Describe the room setting. Does this wall setting look okay? (B) Individual model descriptions (Gemini, GPT, Claude). (C) Hierarchical variation‑aware Description groups cross‑model observations: room is a living space (bedroom or den); walls soft green or gray textured; main furniture (bed or armchair or loveseat); other noted items (laundry basket on chair, small side table); wall decor pieces incl. larger artwork and red‑flower print; subjective opinions split between cohesive/cozy vs cluttered. (D) Variation summary. Agreements (living space, small table, framed decor), Disagreements (main furniture type; decor cohesion), and Unique Mentions (GPT mentioned patterned fabric; Gemini mentioned hanging tassel; Claude mentioned traditional and vintage style). — **(A)** Input image and prompt. **(B)** Raw image descriptions from 3 MLLMs (GPT-4o, Gemini-1.5-Pro, Claude-3.7-Sonnet). **(C)** *Variation-Aware Description* aggregates all model outputs into a hierarchical markdown. Major variations are highlighted in indigo. **(D)** *Variation summary* further surfaces key agreements, disagreements, and unique mentions across models.

Abstract

Multimodal large language models (MLLMs) provide new opportunities for blind and low vision (BLV) people to access visual information in their daily lives. However, these models often produce errors that are difficult to detect without sight, posing safety and social risks in scenarios from medication identification to outfit selection. While BLV MLLM users use creative workarounds such as cross-checking between tools and consulting sighted individuals, these approaches are often time-consuming and impractical. We explore how systematically surfacing variations across multiple MLLM responses can support BLV users to detect unreliable information without visually inspecting the image. We contribute a design space for eliciting and presenting variations in MLLM descriptions, a prototype system implementing three variation presentation styles, and findings from a user study with 15 BLV participants. Our results demonstrate that presenting variations significantly increases users’ ability to identify unreliable claims (by 4.9x using our approach compared to single descriptions) and significantly decreases perceived reliability of MLLM responses. 14 of 15 participants preferred seeing variations of MLLM responses over a single description, and all expressed interest in using our system for tasks from understanding a tornado’s path to posting an image on social media.

Design and Pipeline

Dimension	Alternatives
Elicitation of Variants	`Trials`	`Prompts`	`Models`	Images
Comparison Support	`List`	`Variation-aware description`	`Variation Summary`
Comparison Granularity	Words	`Atomic facts`	Sentences	`Responses`
Support Indicator	`None`	`Percentage`	`Language`	`Source`
Provenance Indicator	None	Trials	Prompts	`Models`
Modality	`Text`	Sound	Visualization	Haptics

Design space for surfacing MLLM variations. Alternatives marked with backticks indicate the choices implemented in our prototype system.

We elicit variants by running multiple trials (repeated queries), using different prompts, and querying multiple models. We then compare MLLM-generated descriptions at the granularity of atomic facts (individual descriptive claims) and present variations using a list format, a variation-aware description, and a separate variation summary.

Support Indicators

We provide three support indicators (language, percentage, and source) to show how many of each model agree with each claim.

None	Language	Percentage	Source
There are two white chairs on the left and a grey sofa on the right. At the center there is a white coffee table with a marble or glass or wood top and a gold base. There is a built-in shelf on the back wall with decorative items, like books and a television.	There are two white (well-supported) chairs on the left and a grey sofa on the right. At the center there is a white coffee table with a marble (moderately supported) or glass (poorly supported) or wood (very little support) top and a gold base. There is a built-in shelf on the back wall with decorative items, like books (moderately supported) and a television (moderately supported).	There are two white (100%) chairs on the left and a grey sofa on the right. At the center there is a white coffee table with a marble (56%) or glass (33%) or wood (11%) top and a gold base. There is a built-in shelf on the back wall with decorative items, like books (33%) and a television (33%).	There are two white (3 of 3 GPT, 3 of 3 Gemini, 3 of 3 Claude) chairs on the left and a grey sofa on the right. At the center there is a white coffee table with a marble (3 of 3 GPT, 2 of 3 Gemini) or glass (3 of 3 Claude) or wood (1 of 3 Gemini) top and a gold base. There is a built-in shelf on the back wall with decorative items, like books (3 of 3 GPT) and a television (3 of 3 Gemini).

Variation-aware description without support indicator and with three variant support indicators (Language, Percentage, Source). Agreements (top, green), disagreements (middle, red), and unique mentions (bottom, blue) are highlighted.

Results

Surfacing variations in MLLM responses significantly increased the number of unreliable claims identified by 4.9x for ours or 4.2x for list compared to presenting a single description. Variations also significantly decreased the perceived reliability of MLLM responses from 5.78 of 7 for a single description to 4.76 for a list of descriptions and 3.93 for ours.

Two bar charts compare three presentation styles—Single (red), List (yellow), and Ours (blue)—on two metrics: the number of identified unreliable claims (left) and perceived reliability (right). The left chart shows that Ours led to significantly more unreliable claims being identified across all categories (Overall, Model Limitation, Image Quality, and Subjectivity). The right chart shows Single descriptions were rated highest in perceived reliability, followed by List, with Ours rated lowest in several categories. Statistical significance is indicated with asterisks (* p < 0.05, ** p < 0.01, *** p < 0.001). — **Left:** average identified unreliable claims reported by participants overall and in each image category.
**Right:** average perceived reliability rating (1 = not reliable at all, 7 = most reliable) overall and in each image category.
Error bars represent a 95% confidence interval. We applied the Friedman test followed by pairwise Wilcoxon signed-rank tests with Bonferroni correction. Significance is marked as * p < 0.05, ** p < 0.01, and *** p < 0.001.

11 of 15 participants ranked our variation summary (ours) as their favorite option with 9 of 15 ranking it as their second favorite, while only 5 participants rated the list or single description as their first or second favorite, indicating strong support for our aggregated variation approaches.

Two horizontal stacked bar charts. Left: titled Preference for Presentation Style compares user rankings (1 = most preferred, 4 = least preferred) across four styles: Variation Summary, Variation-Aware Description, List of Multiple Descriptions, and Single Description. Variation Summary received the most Rank 1 votes (11), followed by Variation-Aware Description (9). Single Description received the most Rank 4 votes (7), indicating it was the least preferred overall. Each bar is color-coded by rank. Right: titled Preference for Support Indicator compares user rankings (1 = most preferred, 4 = least preferred) for four styles: Source, Percentage, None, and Language. Source received the most Rank 1 and Rank 2 votes combined (5 and 7), indicating a strong preference. Language received the most Rank 4 votes (7), making it the least preferred overall. Each bar is divided into four color-coded segments representing ranks 1 through 4. — **Left:** Participants’ preference for variation presentation style and single description.
**Right:** Participants’ preference for support indicators
(1 = most preferred, 4 = least preferred)

All BLV participants wanted to use our variation surfacing prototype in the future for a variety of purposes from high-stakes scenarios such as assessing the path of an incoming tornado to obtaining subjective critiques for social media posts.

BibTeX

@inproceedings{10.1145/3663547.3746393,
  author = {Chen, Meng and Iyer, Akhil and Pavel, Amy},
  title = {Surfacing Variations to Calibrate Perceived Reliability of MLLM-generated Image Descriptions},
  year = {2025},
  isbn = {9798400706769},
  publisher = {Association for Computing Machinery},
  address = {New York, NY, USA},
  url = {https://doi.org/10.1145/3663547.3746393},
  doi = {10.1145/3663547.3746393},
  abstract = {Multimodal large language models (MLLMs) provide new opportunities for blind and low vision (BLV) people to access visual information in their daily lives. However, these models often produce errors that are difficult to detect without sight, posing safety and social risks in scenarios from medication identification to outfit selection. While BLV MLLM users use creative workarounds such as cross-checking between tools and consulting sighted individuals, these approaches are often time-consuming and impractical. We explore how systematically surfacing variations across multiple MLLM responses can support BLV users to detect unreliable information without visually inspecting the image. We contribute a design space for eliciting and presenting variations in MLLM descriptions, a prototype system implementing three variation presentation styles, and findings from a user study with 15 BLV participants. Our results demonstrate that presenting variations significantly increases users’ ability to identify unreliable claims (by 4.9x using our approach compared to single descriptions) and significantly decreases perceived reliability of MLLM responses. 14 of 15 participants preferred seeing variations of MLLM responses over a single description, and all expressed interest in using our system for tasks from understanding a tornado’s path to posting an image on social media.},
  booktitle = {Proceedings of the 27th International ACM SIGACCESS Conference on Computers and Accessibility},
  articleno = {66},
  numpages = {17},
  keywords = {Accessibility, Image Descriptions, Multimodal Large Language Models, Variations, AI Errors, Trust},
  location = {Denver, CO, USA },
  series = {ASSETS '25}
}

Acknowledgments

This work was supported in part by a Notre Dame-IBM Technology Ethics Lab Award.