AI-based burn image assessment: Reliability and clinical error patterns of multimodal large language models in a repeated-inference study

Accurate assessment of #burn depth and total body surface area (TBSA) is critical for clinical decision-making; however, it remains subjective and prone to interobserver variability. Multimodal large language models (MLLMs) are increasingly encountered in clinical contexts, but whether these systems can reliably assess burn images remains unclear. We evaluated four MLLMs (GPT-5.4 Pro, Grok 4.1,… Continue reading AI-based burn image assessment: Reliability and clinical error patterns of multimodal large language models in a repeated-inference study