By
Stuart Kerr, Technology Correspondent,
LiveAIWire
There are approximately 7,000 languages currently spoken in the
world. The overwhelming majority of AI language and speech capability is
concentrated in fewer than 20, with English, Mandarin, Spanish, and a small
number of other high-resource languages commanding the vast bulk of training
data, research investment, and deployment. The implications of this
concentration are not trivial. Language is not merely a communication tool.
It is the primary medium through which culture, identity, and knowledge are
expressed and transmitted. AI systems that can only operate effectively in a
small number of dominant languages are not language-neutral tools. They are
instruments that amplify the global reach of some linguistic and cultural
frameworks while leaving others progressively further behind.
Research
published on linguistic diversity in large language models has
documented the performance gap between high-resource and low-resource
languages systematically, finding that models trained primarily on English
data produce outputs in other languages that exhibit substantially higher
error rates, more frequent hallucinations, and weaker cultural
contextualisation than their English equivalents. The gap reflects the deeper
problem that language models trained on one linguistic corpus encode the
conceptual frameworks, default assumptions, and cultural context of that
corpus, which do not translate directly into other languages even when
surface-level translation is accurate.
The Dialect and Slang Problem
Within languages that AI systems do handle, the coverage is itself
uneven. Dialect, regional slang, and minority language varieties within
larger language families are systematically underrepresented in training data
relative to standardised written forms. The written corpus of English that
large language models are trained on is dominated by formal written prose,
online content from educated writers in major English-speaking markets, and
digital text produced by users with reliable internet access. The spoken
vernacular of regional communities, the generational slang of young people in
specific urban contexts, and the creole and pidgin varieties that are native
languages for millions of speakers all appear at much lower rates in the
training data than their actual linguistic importance
warrants.
The consequences are practical. AI assistants that perform well
for standard written queries respond poorly to the natural spoken register of
users from regional or marginalised linguistic communities. Content
moderation systems trained on standard written language apply inconsistently
to the vernacular registers where harmful content actually circulates, with
documented higher error rates in both over-moderation and under-moderation of
vernacular content. These are not minor calibration issues. They represent
systematic service inequality that tracks closely with existing social
inequalities, as our analysis of AI’s
specific struggles with regional British dialects and slang found
in a UK context that is representative of the global pattern.
Efforts to Address the Gap
Several significant initiatives are working to expand AI language
coverage beyond high-resource varieties. The Masakhane project, a
community-driven effort to develop natural language processing for African
languages, has produced training datasets and model evaluations for more than
60 African languages, demonstrating that the technical barriers to AI
capability in low-resource languages are addressable with appropriate
investment and community involvement. Meta’s No Language Left Behind project
has developed translation models covering more than 200 languages, including
many with limited prior AI coverage, though performance disparities between
high-resource and low-resource languages remain significant even within that
expanded coverage.
These projects represent meaningful progress, but the structural
incentive problem remains. Investment in AI language capability follows
commercial opportunity, which is concentrated in languages with large
populations of users with purchasing power. Languages spoken by smaller
populations, or by populations with limited consumer market significance,
will not attract commercial investment proportionate to their cultural and
communicative importance. Closing the language coverage gap at scale requires
either regulatory intervention requiring minimum language performance
standards for AI systems used in public services, public investment in
low-resource language data infrastructure, or both.
What Universal AI Language Capability Would Actually
Require
The aspiration of AI systems that work equally well across all
languages is technically possible in principle and practically distant given
current investment patterns. What meaningful progress requires is systematic
and well-resourced data collection in low-resource languages and dialects,
conducted in collaboration with linguistic communities rather than through
passive harvesting of available text. It requires evaluation frameworks that
measure performance across linguistic diversity rather than defaulting to
English benchmarks. And it requires governance frameworks for AI deployment
in multilingual contexts that specify minimum performance standards across
the language varieties of the populations being served.
As our coverage of how
AI deployment creates unequal access across populations found, the
populations with the most to gain from AI capability are frequently those
whose contexts and languages are least represented in AI development. The
algorithmic accent, the bias toward the linguistic varieties that dominate AI
training, is not an inevitable feature of the technology. It is a consequence
of investment priorities and governance choices that can be made differently,
and that will need to be if AI is to deliver on its promise of universal
capability rather than simply extending the reach of already dominant
linguistic and cultural frameworks.
The Cultural Stakes
The language coverage gap in AI systems has cultural consequences
that extend beyond the practical service delivery failures it causes. When AI
systems trained primarily on English-language data become the dominant
interface for information access, creative production, and professional
communication globally, they create systematic pressure toward the linguistic
and cultural frameworks encoded in that data. Communities whose languages are
not well served by AI tools face a practical choice between accepting
lower-quality AI assistance in their own language or switching to a
high-resource language to access better AI capability. That choice is not
neutral. It is a form of linguistic and cultural pressure that compounds the
existing dynamics of language shift and endangerment that linguists have been
documenting for decades.
The stakes of getting AI language diversity right are therefore
not only about service equity, though they are certainly about that. They are
about whether AI development reinforces the global dominance of a small
number of linguistic and cultural frameworks, or whether it creates a
genuinely pluralistic information environment in which all communities can
access and contribute to AI capability on terms consistent with their own
languages and cultural frameworks. As our analysis of how
AI shapes the information environments that communities inhabit
found, the systems that mediate access to information are never neutral. They
encode priorities, assumptions, and cultural frameworks that shape what is
findable, credible, and communicable. Language diversity in AI is not a niche
technical concern. It is a question about what kind of global information
environment AI development is building, and whose world that environment
reflects.
What Is Being Done and What Remains
Some progress is being made on linguistic diversity in AI systems.
Meta’s No Language Left Behind initiative produced a translation model
covering more than 200 languages, including many with very limited prior AI
coverage. Google’s Pathways Language Model included more substantial
multilingual training than earlier large language models. The African
Languages Technology Initiative and similar regional programmes are building
datasets for languages that have been almost entirely absent from major AI
training corpora. These are meaningful steps, and they demonstrate that the
problem is tractable if the investment is made.
What remains is the structural imbalance between the pace of
capability development in high-resource languages and the pace of improvement
for the remaining 6,980. The gap is not closing at a rate that would produce
genuinely equitable AI language capability within a reasonable timeframe
under current investment patterns. Closing it requires a combination of
commercial investment where business cases exist, public funding for
languages where they do not, and the active involvement of speaker
communities in dataset creation and system evaluation rather than the
extraction of linguistic data without meaningful community
participation.
For related coverage, see our analysis of whether
AI is erasing linguistic diversity and our broader look at the
deeper challenge of AI bias across systems and
populations.
About the Author
Stuart Kerr is the Technology Correspondent for LiveAIWire. He
writes about artificial intelligence, emerging technology, and the forces
reshaping work, business, and society.