AI-based burn image assessment: Reliability and clinical error patterns of multimodal large language models in a repeated-inference study

Accurate assessment of #burn depth and total body surface area (TBSA) is critical for clinical decision-making; however, it remains subjective and prone to interobserver variability. Multimodal large language models (MLLMs) are increasingly encountered in clinical contexts, but whether these systems can reliably assess burn images remains unclear. We evaluated four MLLMs (GPT-5.4 Pro, Grok 4.1, Gemini 3.1 Pro, and Claude Opus 4.6) on 50 clinical burn photographs using a repeated-inference design with five independent runs per model. Burn depth classification was assessed in numeric and text-based formats, alongside ordinal TBSA estimation. Performance varied across the models, with burn depth accuracy ranging from 34.0 ± 6.5% to 76.4 ± 6.8% and TBSA accuracy from 32.8 ± 9.4% to 68.4 ± 3.3%. Inter-run reliability (Fleiss’ κ) ranged from slight (κ = 0.171) to almost perfect (κ = 0.916), demonstrating response variability not captured by single-query evaluations. Notably, no model combined high accuracy and high reliability, indicating a dissociation between performance and consistency. All models showed a tendency toward overestimation of burn depth, including assignment of fourth-degree burns despite their absence in the dataset. Error direction analysis revealed model-specific and task-dependent biases, including opposing patterns within the same model. Internal consistency between numeric and text classifications was near-perfect (99.6-100%), indicating format-invariant but systematically biased outputs. These findings demonstrate that MLLM performance is characterized by stochastic response instability invisible to single-query evaluations. Such inconsistency for identical inputs represents a fundamental limitation for workflows requiring consistent outputs across repeated evaluations.

https://www.jprasurg.com/article/S1748-6815(26)00337-2/fulltext