EVALUATING LARGE LANGUAGE MODELS IN CARDIOLOGY: INSIGHTS FROM A STRUCTURED BENCHMARKING FRAMEWORK

Galeazzi Michele Ancona (Ancona) – Cardiac Surgery Unit, Lancisi Cardiovascular Center, Polytechnic University Of Marche, Ancona, Italia | D’Alessio Simone Ancona (Ancona) – Cardiac Surgery Unit, Lancisi Cardiovascular Center, Polytechnic University Of Marche, Ancona, Italia | Capodaglio Irene Ancona (Ancona) – Cardiac Surgery Unit, Lancisi Cardiovascular Center, Polytechnic University Of Marche, Ancona, Italia | Dottori Melissa Ancona (Ancona) – Cardiac Surgery Unit, Lancisi Cardiovascular Center, Polytechnic University Of Marche, Ancona, Italia | Corinaldesi Christian Ancona (Ancona) – Cardiac Surgery Unit, Lancisi Cardiovascular Center, Polytechnic University Of Marche, Ancona, Italia | Marini Marco Ancona (Ancona) – Cardiac Surgery Unit, Lancisi Cardiovascular Center, Polytechnic University Of Marche, Ancona, Italia | Pierri Michele Danilo Ancona (Ancona) – Cardiac Surgery Unit, Lancisi Cardiovascular Center, Polytechnic University Of Marche, Ancona, Italia | Di Eusanio Marco Ancona (Ancona) – Cardiac Surgery Unit, Lancisi Cardiovascular Center, Polytechnic University Of Marche, Ancona, Italia

CARDIOLOGIA DIGITALE – INTELLIGENZA ARTIFICIALE

Background: Conversational large language models (LLMs) are rapidly entering the medical domain, raising interest in their potential role as information-support tools in cardiology. However, objective, head-to-head evaluations of different LLMs in high-risk cardiovascular scenarios remain limited, particularly with respect to clinical accuracy and contextual adequacy. Aim: To benchmark the performance of three widely used LLMs (ChatGPT, Claude, and Gemini) when responding to structured cardiology-related queries representative of real-world clinical information needs. Methods: Seventy cardiology-related prompts were developed to reflect common pre-diagnostic and post-diagnostic scenarios and were framed for two user profiles: patients and general practitioners. Each prompt was submitted to all three models under standardized conditions. Responses were anonymized and independently assessed by three senior cardiologists blinded to model identity. Scientific accuracy, completeness, clarity, and coherence were rated using a 5-point Likert scale. Model comparisons were performed using non-parametric statistical tests, with inter-rater reliability and sensitivity analyses to ensure robustness. Results: Significant differences emerged across all evaluated domains. ChatGPT consistently achieved higher scores for accuracy, completeness, clarity, and coherence, followed by Claude and Gemini. Agreement among expert reviewers was substantial. Across models, responses to pre-diagnostic and patient-oriented prompts were rated more favorably than those addressing post-diagnostic management or physician-level inquiries. Sensitivity analyses confirmed that results were not driven by individual evaluators. Conclusions: General-purpose LLMs show heterogeneous performance in cardiology-related tasks, with ChatGPT demonstrating the most consistent outputs among the evaluated models. Nevertheless, none of the systems reached a level of reliability sufficient for autonomous clinical use. These findings support the use of LLMs as adjunctive tools for information delivery and education, while underscoring the necessity of human oversight, continuous validation, and domain-specific optimization in cardiovascular medicine.

CONGRESS ABSTRACT

CONGRESS ABSTRACT

EVALUATING LARGE LANGUAGE MODELS IN CARDIOLOGY: INSIGHTS FROM A STRUCTURED BENCHMARKING FRAMEWORK