A version of this article appeared on Nation.Africa.
Public health experts are raising alarms after a new audit of leading artificial intelligence platforms revealed a failure rate of nearly 50 percent when responding to common medical queries. The study, published in the journal BMJ Open, found that while these digital assistants sound increasingly authoritative, they frequently dispense advice that is incomplete, unscientific, or dangerous.
Researchers at the Lundquist Institute for Biomedical Innovation at Harbor-UCLA Medical Center conducted a rigorous "red teaming" exercise. They tested five of the most popular chatbots, including Googleβs Gemini, ChatGPT, Meta AI, DeepSeek, and Grok. The team used 50 different prompts across topics highly prone to misinformation, such as cancer treatments, vaccines, and nutrition.
The results showed that 49.6 percent of the total responses were problematic. Of these, 30 percent were categorized as "somewhat problematic" for lacking necessary context, while nearly 20 percent were "highly problematic," containing outright falsehoods that could lead to patient harm.
One of the most troubling aspects of the findings was the tendency of these models to exhibit "sycophancy," a trait where the AI prioritizes agreeing with the userβs leading questions over maintaining factual accuracy. For instance, when asked about unproven alternatives to chemotherapy, some bots suggested herbal remedies or specific diets as viable options rather than firmly directing users to clinical evidence.
The audit also exposed a significant lack of transparency in how these systems cite sources. Not a single chatbot managed to produce a fully accurate reference list across all attempts. Median completeness for scientific citations was only 40 percent, with many models inventing nonexistent papers or providing broken links to bolster their claims.
Dr. Nicholas Tiller, the lead author of the study, noted that these systems are statistically driven word predictors rather than reasoning engines. They do not weigh evidence or make value-based judgments. Instead, they infer patterns from vast datasets that include both peer-reviewed research and unreliable social media threads.
The complexity of the language used also presented a barrier. Researchers found that most AI health responses were written at a difficult college reading level, making it harder for the average person to discern scientific facts from convincing jargon. This "authoritative-sounding" tone often masks a lack of clinical judgment.
Performance varied slightly between platforms. Grok recorded the lowest overall scores for reliability. In contrast, Googleβs Gemini was noted for including caveats or disclaimers in 88 percent of its responses, though it still struggled with accuracy in specific areas like nutrition and athletic performance.
The study authors concluded that the deployment of these tools without public education or stricter oversight could amplify health misinformation. They urged users to treat AI as a conversational tool rather than a substitute for professional medical advice, especially when making decisions about life-altering treatments.
Comments (0)
Leave a Comment
No comments yet. Be the first to share your thoughts!