If you ask ChatGPT to describe a traditional wedding, it will almost certainly describe one with white dresses, a church or registry office, speeches at the reception, and a cake cutting. You might receive the same response translated if you ask it in Tagalog. The response will still use a Western ceremony. Anglo-American presumptions will persist. While thinking in the same language as before, the AI will have responded in a different language.
This is not a mistake in translation. It is something more structural than that, and an increasing amount of research is now accurately describing it, making the cozy explanations challenging to uphold.
The term “language modeling bias” was first used in a 2024 paper published in Ethics and Information Technology by Paula Helm and colleagues at universities in Germany and Italy. It refers to an unintentional design-level preference that causes language technology to favor some languages, dialects, and sociolects over others. Although the well-known finding that English predominates in AI training data is still significant, the paper’s argument goes beyond it. English is used in about 60% of online content. Of the roughly 7,000 to 8,000 languages spoken worldwide, less than 5% have any significant digital representation. The deeper issue, however, is that when researchers and businesses try to expand AI tools to underrepresented languages, they typically do so by translating or modifying pre-existing English-centric systems rather than starting from scratch with the target language’s linguistic and cultural logic, as Helm and her co-authors document. The end product is a system that speaks a different language while conceptually adhering to the dominant one’s presumptions.
The specificity of the example they use to highlight the digital language divide is striking. The number of Wikipedia pages for Kiswahili, which is spoken by about 80 million people in East Africa, is comparable to that of Breton, a Celtic language that is endangered in western France and may have 200,000 speakers. The discrepancy between these figures and how they relate to the real size of the two communities highlights a crucial point: the number of people who speak a language does not determine its digital representation. It depends on colonial history, political and economic power, and which cultural groups had the institutional support and resources to develop digital infrastructure. AI systems that are trained using this data don’t just inherit a language gap. They perpetuate inequality after inheriting it.
| Core Research Paper | “Diversity and Language Technology: How Language Modeling Bias Causes Epistemic Injustice” |
|---|---|
| Published In | Ethics and Information Technology, Volume 26, Article 8 (January 2024) |
| Authors | Paula Helm, Gábor Bella, Gertraud Koch, Fausto Giunchiglia |
| Key Concept Introduced | Language Modeling Bias — technology by design favors certain languages over others |
| Related Concept | Epistemic Injustice — marginalized language communities denied self-representation and knowledge production |
| Digital Language Divide Statistic | Less than 5% of the world’s 7,000–8,000 languages have significant digital representation |
| Online Content Statistic | Approximately 60% of online content is in English |
| Kiswahili vs. Breton Example | Kiswahili (~80 million speakers) has comparable Wikipedia coverage to Breton (~200,000 speakers) |
| AI Models with Western Bias | ChatGPT, Gemini, Claude (English-first training data) |
| AI Models Cited as Exceptions | Alibaba’s Qwen, China’s DeepSeek (large Chinese-language datasets) |
| Key Medium Article | “Why AI’s Language Bias Is More Than a Glitch! It’s a Global Inequality” — Josef Röyem (August 2025) |
| Dialect Bias Research | Hofmann et al. (2024) — LLM bias toward African American English |
| Alternative Framework Proposed | LiveLanguage Initiative — co-design approach for language technology |
| Harvard Kennedy School Source | HKS Misinformation Review — language models developed in authoritarian countries and bias concerns (September 2025) |
| arXiv Study | “Framing Political Bias in Multilingual LLMs Across Languages” (January 2026) |
| ScienceDirect Study | “Diagnosing the Bias Iceberg in Large Language Models” — Xiang (2026) |

The problem also affects major language dialects and registers. According to research, text written in African American English is more likely to receive harsh evaluations from large language models than text with the same semantic content written in standard American English. Regional variations of Arabic and Spanish exhibit the same pattern. For speakers who never spoke the “standard” form in the first place, everyday technology is less helpful because speech recognition software frequently performs poorly on non-standard accents and dialects. These are not examples of edge cases. They speak for hundreds of millions of people for whom the technology that is supposed to benefit everyone subtly serves them less effectively, less accurately, and with less regard for the entirety of what their language actually conveys.
The concern is extended into an explicitly geopolitical dimension in the January 2026 arXiv paper on political bias in multilingual large language models, which notes that bias in language models is influencing real-world outputs like news summaries and headlines, reinforcing dominant ideologies in ways that are almost imperceptible to users who lack the cultural knowledge to recognize the distortion. The same event will be described differently in English and Arabic by a model that was primarily trained on Western English-language news. This is not because the translation is incorrect, but rather because the underlying framing from which the model is generating its response was never neutral in the first place.
The AI firms creating these systems are not totally oblivious to the issue. Open-source projects have gathered local-language corpora in historically underserved communities, and Google, Microsoft, and Meta have started a variety of multilingual initiatives. However, Helm and her colleagues are dubious of what they refer to as the “argument of size”—the notion that increasing the amount of data collected in underrepresented languages will inevitably reduce the disparity. Even more data processed using an Anglo-centric methodology results in a system that ignores the cultural specificities encoded in those languages, the concepts that are difficult to translate, and the methods of knowing that are incompatible with the models currently in use. Building a system that can think in the terms of the new language is not the same as scaling an existing architecture into that language.
Reading this research gives me the impression that the AI sector is treating linguistic diversity as a representation issue when, in reality, it is an epistemological one. The question goes beyond a model’s proficiency in Hausa, Tagalog, or Kiswahili. The question is whether the model can think in the ways made possible by those languages when it speaks them, and whether the knowledge systems created by those communities will have any significant role in the AI future that is being built primarily without them.
London Bilingualism's content on health, medicine, and weight loss is solely meant for general educational and informational purposes. This website does not offer any diagnosis, treatment recommendations, or medical advice.
We consistently compile and disseminate the most recent information, findings, and advancements from the medical, health, and weight loss sectors. When content contains opinions, commentary, or viewpoints from professionals, industry leaders, or other people, it is published exactly as it is and reflects those people's opinions rather than London Bilingualism's editorial stance.
We strongly advise all readers to consult a qualified medical professional before acting on any medical, health, dietary, or pharmaceutical information found on this website. Since every person's health situation is different, only a qualified healthcare provider who is familiar with your medical history can offer you advice that is suitable for you.
In a similar vein, any legal, regulatory, or compliance-related information found on this platform is provided solely for informational purposes and should not be used without first obtaining independent legal counsel from a licensed attorney.
You understand and agree that London Bilingualism, its editors, contributors, and affiliated parties are not responsible for any decisions made using the information on this website.
