Bayesian Uncertainty Auditing for Medical LLMs: Calibrated, Evidence-Grounded Triage Summaries with Selective Abstention
Main Article Content
Medical Large Language Models (LLMs) can draft triage summaries rapidly, yet their clinical value is limited by overconfidence, unsupported claims, and weak patient-specific reliability. This study proposes Bayesian Uncertainty Auditing (BUA), a safety-oriented framework that couples Bayesian predictive approximations (ensembles/MC sampling) with (i) probability calibration, (ii) risk–coverage selective prediction (abstention), and (iii) proposition-level evidence auditing to prevent high-confidence hallucinations. In a retrospective evaluation on 1,050 patient cases, BUA improved triage-tag discrimination to AUROC = 0.89 and reduced miscalibration from ECE = 0.090 to 0.035 while lowering probabilistic error (Brier score) to 0.12. Under uncertainty-based abstention, BUA achieved 0.82 coverage at risk ≤ 0.10, demonstrating that safety can be increased without forcing full automation. Faithfulness auditing improved evidence support rates for summary propositions and reduced the high-confidence unsupported claim rate to 0.03 overall; gains were most pronounced in difficult settings such as missing/pending key labs, where subgroup ECE dropped from 0.160 to 0.070 and coverage at risk ≤ 0.10 improved from 0.70 to operationally acceptable levels. In older patients (age ≥ 65), calibration improved from ECE = 0.120 to 0.055, while rare comorbidity patterns remained a stress test (post-audit ECE = 0.085), motivating stricter abstention thresholds and governance rules for out-of-distribution contexts. Collectively, these results show that Bayesian uncertainty, when paired with calibration, abstention, and evidence verification, yields triage summaries that are more reliable, more auditable, and safer for deployment than single-pass generation.