Beyond Hallucinations: Why Training AI to Reason Harder Is Making It Less Reliable -

By Stuart Kerr, Technology Correspondent
Published: 12 September 2025 | Last updated: 9 May 2026
Contact: [email protected] | Follow @LiveAIWire on X
Author Bio: https://liveaiwire.com/p/to-liveaiwire-where-artificial.html

The AI Reasoning Trap: Why Smarter Means Less Reliable

AI hallucinations are no longer just an inconvenience and a landmark paper presented at ICLR 2026 in Rio de Janeiro has just demolished one of the central assumptions of modern AI development. The paper, titled The Reasoning Trap, proves that training language models to reason harder makes them hallucinate more, not less. That finding has significant consequences for anyone relying on AI tools in healthcare, law, finance, journalism, or everyday decision-making, and it fundamentally reframes the question of what progress in AI actually looks like.

The numbers are stark. OpenAI’s o3 reasoning model hallucinates on 33 percent of queries on the PersonQA benchmark, more than double the 16 percent rate of its predecessor o1. OpenAI’s o4-mini reaches 48 percent on the same benchmark. These are not fringe models. They represent OpenAI’s most advanced publicly available reasoning systems, and they are confidently wrong nearly half the time on questions about real people. The ICLR researchers found that reasoning reinforcement learning disproportionately collapses the parts of the neural network that track whether a tool or fact actually exists. The model still reasons. It just reasons confidently about things that are not there.

What Hallucinations Actually Are and Why They Matter

An AI hallucination is not a mistake in the way a human makes mistakes. It is a confident, fluent, grammatically perfect statement that is factually wrong. The model does not know it is wrong. It does not flag uncertainty. It presents the fabrication with the same assured tone it would use for a correct answer, which is precisely what makes it dangerous.

A broader MIT-Harvard collaboration found that while models excel at familiar domains, their ability to transfer reasoning across new contexts is unreliable. Projects like The Hallucinations Leaderboard have formalised the measurement of these failures, cataloguing what today’s models can and cannot do across more than a hundred benchmarks. The financial consequences are already enormous. Global losses tied to AI hallucinations reached 67.4 billion dollars in 2024 according to Forrester Research, and each enterprise employee now costs companies roughly 14,200 dollars per year in hallucination-related mitigation efforts.

In legal practice, the consequences have already reached the courts. In Mata v. Avianca, a New York lawyer was sanctioned for submitting a brief containing fabricated case citations generated by ChatGPT. In a 2026 UC San Diego study, AI-generated summaries hallucinated 60 percent of the time, influencing real consumer purchase decisions. These are not theoretical risks. They are documented, measured failures with real consequences.

The Progress That Is Being Made

The picture is not entirely bleak. Google’s Gemini 2.0 Flash recorded a hallucination rate of just 0.7 percent on Vectara’s grounded summarisation benchmark as of early 2025, making it the most factually consistent large language model tested to date. Four models now sit below the one percent threshold on that benchmark. Retrieval-Augmented Generation, or RAG, remains the most effective countermeasure available at scale, cutting hallucination rates by 71 percent when properly integrated according to 2026 research.

Techniques continue to improve. A 2025 Nature study confirmed that prompt-based mitigation reduces hallucinations by approximately 22 percentage points. Medical AI research demonstrated a 33 percent reduction using structured prompts. Cross-Layer Attention Probing trains a lightweight classifier on a model’s own internal activations to flag likely hallucinations in real time, even when external verification is not possible. The market for hallucination detection tools grew 318 percent between 2023 and 2025 as organisations invested heavily in solutions, and 76 percent of enterprises now run human-in-the-loop processes specifically to catch hallucinations before deployment.

As explored in Can Simpler AI Beat Deep Learning?, the lesson from climate science applies here too. Bigger and more complex does not always mean better. Sometimes the most reliable system is the one that sticks closest to what it actually knows.

The Benchmarking Problem

Part of the challenge is that hallucinations are difficult to measure consistently across domains. The MIT-Harvard research underscored a growing consensus that benchmarks must test generalisation, not just memorisation. A model that scores brilliantly on a standard benchmark may still fail badly when asked to apply its knowledge to a genuinely novel situation. Open-source models still show hallucination rates above 80 percent in some complex reasoning scenarios, and reasoning-focused models show higher error rates than their simpler predecessors, suggesting that the trade-off between reasoning depth and factual accuracy is a genuine architectural challenge rather than a temporary limitation.

The shallow benchmarking problem mirrors concerns raised across other domains. Just as Beyond Buzz: Why the AI Hype Cycle Is Over argued that AI product development needs to prioritise real value over headline performance, AI evaluation needs to prioritise real-world reliability over laboratory scores.

What This Means for How You Use AI

The practical implications are clear. For any high-stakes use of AI, human oversight remains essential. Verification of AI outputs against primary sources should be standard practice, not an optional extra. Structured prompting and RAG-based systems should be preferred over open-ended generation wherever accuracy matters. And users should treat AI confidence as a signal to verify rather than a signal to trust.

The goal is not to abandon AI reasoning capabilities. Dario Amodei suggested at an Anthropic 2025 developer event that on some factual tasks, frontier models may already hallucinate less often than human experts working under similar conditions. The goal is to understand where reasoning models excel, where they fail, and to build systems that deploy each capability in contexts appropriate to its reliability.

The AI industry has made extraordinary progress on what models can do. The defining challenge of the next phase is ensuring that what they do can actually be trusted. The Reasoning Trap paper is an uncomfortable reminder of how much work remains.

About the Author

Stuart Kerr is Technology Correspondent at LiveAIWire. He writes about artificial intelligence, ethics, and how technology is reshaping everyday life. Follow @LiveAIWire on X.

Leave a Reply Cancel reply

Related Stories

How AI Is Reshaping Insurance: Who Benefits and Who Gets Left Behind

Why AI Will Make Tax Returns Obsolete Within a Decade

Why AI Will Make Flying Safer Than It Has Ever Been in Human History

You may have missed

How AI Is Reshaping Insurance: Who Benefits and Who Gets Left Behind

Why AI Will Make Tax Returns Obsolete Within a Decade

Elon Musk, a Million Satellites, and a Data Centre in the Sky: The Most Ambitious Plan in Tech History

Why AI Will Make Flying Safer Than It Has Ever Been in Human History