The AGI Consistency Problem Why Smarter Machines Still Stumble
Artificial General Intelligence (AGI) has long been the holy grail of AI research—a system capable of performing any intellectual task a human can do. But despite headline-grabbing wins in elite competitions, AI still falters on tasks most high school students can solve. The gap, according to experts, is rooted in a fundamental flaw: inconsistency.
By Stuart Kerr, Technology Correspondent
Published: 14 August 2025
Last Updated: 14 August 2025
Contact: liveaiwire@gmail.com | Twitter: @LiveAIWire
Author Bio: About Stuart Kerr
Google DeepMind CEO Demis Hassabis recently told Business Insider that even the most advanced AI models can ace Olympiad-level math problems yet fail simpler high school exercises. He frames this as a challenge of “consistency” — the ability to apply reasoning skills reliably across different levels of difficulty.
This mirrors coverage from Times of India, which highlights how AI’s uneven performance raises concerns about readiness for real-world deployment in sensitive sectors like healthcare and law.
Beyond the Hype
In an interview with India Today, Hassabis noted that current AI benchmarks may be too narrow, rewarding exceptional results in niche domains without penalising failures on everyday problems. This creates a public perception that AGI is closer than it really is.
The academic community has been warning of this illusion for years. In Why AI is Harder Than We Think (PDF), Melanie Mitchell outlines how intelligence is context-dependent and prone to brittleness when faced with novel situations.
The Policy Perspective
Another camp of researchers argues that the very pursuit of AGI as a single goal is flawed. Stop Treating “AGI” as the North-Star Goal of AI Research (PDF) contends that AI policy and funding should prioritise diverse, specialised systems that address specific human needs over monolithic “human-level” machines.
These arguments echo debates we’ve covered in our own reporting, such as the AI-arms race: Google, Amazon, Meta, where corporate competition often sidelines nuanced discussions about capability limitations.
Building Reliability Into AI
Some developers are tackling the consistency problem head-on. The emerging MoR (Mixture of Reasoners) model architecture—explored in our feature New AI model MoR succeeds transformers—aims to combine specialised reasoning modules under a unified control system, potentially mitigating erratic performance.
But while architectures evolve, so too must evaluation methods. Experts propose a mix of high-stakes testing and low-level diagnostics to ensure AI systems are not just brilliant but dependable.
A Measured March Toward AGI
It’s tempting to see breakthroughs as proof AGI is within reach, but the reality is more complex. As Hassabis acknowledges, the technology’s unevenness could undermine trust if left unresolved.
For policymakers, the lesson is clear: fund innovation, but pair it with rigorous, broad-spectrum testing standards. For the public, the takeaway may be even simpler—don’t mistake flashes of genius for full human equivalence.
Our earlier analysis of Google’s EU AI code of practice decision reinforces this point: transparency and accountability must evolve alongside capability if AGI is to be both real and reliable.
About the Author
Stuart Kerr is the Technology Correspondent for LiveAIWire. He writes about artificial intelligence, ethics, and how technology is reshaping everyday life. Read more