Human- #AI collaboration (H + AI) using large language models ( #LLMs ) offers a promising approach to enhance clinical reasoning, documentation, and interpretation tasks. Following PRISMA 2020 (PROSPERO registration: CRD420251068272), we systematically compared H + AI with human-only (H) workflows, searching four databases through June 28, 2025. Ten peer-reviewed studies met eligibility criteria, with three preprints informing sensitivity analyses only. Diagnostic/interpretation accuracy (k = 2) showed a positive trend for H + AI (Risk Ratio [RR] 1.59), but was statistically imprecise and non-significant (95% CI 0.08 to 32.74), with 95% prediction intervals (PI) crossing the null. Composite diagnostic/management scores (k = 2) showed a statistically significant improvement (Mean Difference [MD] +4.88 percentage points, 95% CI + 0.65 to +9.12), yet the PI (-31.65 to 41.42) indicates high real-world uncertainty. Time efficiency (k = 3) showed no overall difference (MD + 0.4 min, 95%CI -4.18 to +4.97; I² = 70.1%). While documentation quality improved, but factual error rates remained high (~26-36%), undermining quality gains. In three-arm settings, H + AI did not universally outperform AI-only. Evidence remains preliminary yet highly uncertain and context-dependent. We recommend preregistered, pragmatic, multicenter trials embedded in real workflows, with harmonized core outcomes that prioritize safety/error metrics and interfaces that surface uncertainty and support