Are AI chatbots giving you the wrong diagnosis? Study flags 50% error risk

Strong on answers, weak on reasoning: AI’s hidden risk in medical advice

medical-healthcare-ai - 1 Representation

With every leap in technology, a familiar debate resurfaces - what if it helps us, and what if it harms us? Artificial Intelligence sits right at the centre of this dilemma. Once, a quick Google search was our go-to for health concerns - typing symptoms, scrolling through links, and often self-diagnosing. Today, that behaviour has evolved. Instead of searching, people are asking AI chatbots directly: What’s wrong with me? What do these symptoms mean? Should I be worried?

The shift comes with concerns too. A new study by researchers at Massachusetts General Brigham, published in JAMA Network Open, highlights a critical gap in how these systems function. While AI models may appear accessible and provide easy-to-understand, detailed responses, they struggle with something fundamental to medicine—clinical reasoning. 

For your daily dose of medical news and updates, visit: HEALTH

At a time when people are turning to AI chatbots for quick health answers, the study raises urgent questions about safety, accuracy, and the expanding role of AI in healthcare. 

How was the study conducted? 

“Large language models (LLMs) are increasingly marketed for clinical use, yet their ability to replicate full-spectrum clinical reasoning remains uncertain,” the researchers note.

To better understand this, the study set out to evaluate how well AI systems perform across the entire clinical workflow, not just in isolated tasks. Unlike traditional evaluations that rely on multiple-choice questions, this research used a more realistic and layered approach.

In a cross-sectional study design, researchers assessed 21 off-the-shelf LLMs using standardised clinical vignettes from the January 2025 update of MSD Manual cases. These included some of the most advanced AI systems available today, such as GPT-5, Claude 4.5 Opus, Gemini 3.0 Flash and Pro, and Grok 4.

Each model was tested across sequential stages of clinical reasoning, starting from forming a differential diagnosis to recommending diagnostic tests, arriving at a final diagnosis, suggesting management plans, and answering miscellaneous reasoning questions.

To measure performance, researchers introduced a new benchmark - the Proportional Index of Medical Evaluation for LLMs (PrIME-LLM) score. This metric captures how balanced and accurate a model is across all stages of clinical reasoning, rather than just focusing on final answers.

Strong final answers, weak diagnostic thinking

The results reveal a striking contrast. While AI models performed relatively well in arriving at final diagnoses, they struggled significantly in the earlier, and arguably more critical, stage of differential diagnosis.

LLMs were tested across 29 clinical scenarios, generating a total of 16,254 responses. The PrIME-LLM scores ranged from 0.64 for Gemini 1.5 Flash to 0.78 for Grok 4, indicating moderate overall performance. Models specifically optimised for reasoning performed better than standard ones, with GPT-based systems scoring among the highest.

However, a deeper look shows where things fall apart.

“Differential diagnosis was less accurate than diagnostic testing, while final diagnosis, management, and miscellaneous reasoning were more accurate,” the study found.

In fact, failure rates for differential diagnosis exceeded 80% across all models, with some reaching as high as 90–100%. In contrast, failure rates for final diagnosis were below 40%.

This suggests that while AI can often land on the correct answer, it does not reliably follow the logical pathway needed to get there.

The researchers highlight a key difference between human clinicians and AI systems. “Clinicians preserve uncertainty and iteratively refine differential diagnoses, whereas LLMs collapse prematurely onto single answers.”

This tendency to jump to conclusions, without adequately considering alternatives, can be risky in medicine, where multiple conditions may present with similar symptoms. Missing out on these possibilities can lead to misdiagnosis or delayed treatment.

The study also notes that while multimodal inputs (such as images) improved performance in some cases, these gains were “limited and inconsistent.” This indicates that AI still struggles to handle the diverse and complex data that real-world clinical practice demands.

Importantly, AI might give a wrong answer but sound very sure about it, especially in complex situations where even doctors need time and careful thinking to be certain.

Limitations of the study

While the findings are significant, the researchers acknowledge several limitations that must be considered.

“Importantly, this study evaluates off-the-shelf LLMs without external augmentation to enable controlled, comparable benchmarking across model families.”

This means that the models were tested in their baseline form, without additional tools that are often used in real-world applications, such as web search, clinical guidelines, or advanced reasoning frameworks.

The study also notes that models were accessed through a mix of API-based and web-based interfaces, with optional features like search and enhanced reasoning turned off. This could influence performance outcomes.

Another limitation is the possibility of prior exposure. Since the clinical vignettes used in the study are publicly available, “prior exposure during model pretraining cannot be fully excluded.”

Additionally, the evaluation did not include model augmentations like retrieval-augmented generation, calculators, or agent-based tools, which could improve accuracy in practical settings. As a result, the findings reflect baseline reasoning ability rather than maximum potential performance.

The researchers also clarify that the PrIME-LLM framework is not intended to compare AI systems directly with human clinicians. “The present study was not designed to answer human comparison questions.”

However, the implications remain important.

“Most importantly, the findings of this study caution against vendor claims that general purpose, off-the-shelf LLMs are ready for patient-facing clinical use.”

The study warns that strong performance in final diagnosis tasks may create a misleading impression of reliability. In reality, persistent gaps in reasoning and uncertainty handling make these systems unsuitable for frontline medical decision-making.

“Marketing LLMs as diagnostic agents risks fostering false confidence precisely where they are least reliable.”

‘AI can analyse data, but cannot take responsibility’

Dr Rajiv Kovil, Head of Diabetology and Weight Loss Expert at Zandra Healthcare, said the comparison between AI and doctors, while popular, is fundamentally flawed.

He explained that medicine is not just about finding the right answer but about responsibility and decision-making. “AI versus doctors is a very attractive narrative, but it is fundamentally flawed because medicine, whether diagnosis or therapy, is about owning responsibility. And responsibility cannot be outsourced,” he said.

According to him, AI does have a clear and important role in healthcare, particularly in analysing large volumes of data and identifying patterns that humans may miss. “AI is exceptionally good at pattern recognition,” he noted, pointing to examples like subtle changes in diabetic retinopathy, ECG readings, radiology scans, or continuous glucose monitoring (CGM) data.

However, he emphasised that interpreting what those findings mean for a real patient requires clinical judgement. He drew a clear distinction between knowledge and clinical wisdom. “AI can give you a lot more intelligence from data, but it cannot give you clinical wisdom. These two are not interchangeable at all,” he said.

Dr Kovil further explained that clinical acumen goes beyond textbook knowledge and involves making decisions under uncertainty, something that comes with experience and responsibility.

He also highlighted how real-world medical decision-making differs from simulations or theoretical knowledge. Doctors often have to act in uncertain, high-stakes situations, where outcomes directly affect patients’ lives, something AI does not account for.

While acknowledging AI’s growing importance, he stressed that it should be seen as a support system rather than a replacement. “It is not ever going to be a replacement, but definitely a reinforcement,” he said.

In his final message, Dr Kovil cautioned patients against over-relying on AI for medical decisions. While AI may help detect diseases or provide preliminary insights, it cannot take responsibility for outcomes. “AI can detect diseases, there is no doubt about it. But only a doctor can understand the patient and take responsibility for what comes next,” he said.

This story is done in collaboration with First Check, which is the health journalism vertical of DataLEADS