Lost in Translation

01 — The Problem

AI Was Designed for English Speakers

There are approximately 7 thousand recorded languages in the world, yet a majority of Large Language Models are trained on one. Most major AI language models were built using datasets that are overwhelmingly English. The dominant training dataset used by AI developers — Common Crawl — is nearly half English by volume. The languages spoken by the majority of the world's population are a fraction of what we see represented.

The Stanford HAI white paper Mind the Language Gap identifies two compounding problems. First, low-resource languages lack the volume of digital text needed to train reliable models. Second, the data that does exist is often poor quality, often sourced from religious texts and legal documents rather than typical everyday vernacular. Languages with non-Latin scripts face additional barriers to digitization. Bengali has 300 million speakers, and Swahili 200 million, yet both are classified as "low-resource" in AI research.

Two-thirds of the world's languages are spoken in Africa and Asia, yet these are the regions with the least AI infrastructure.

"Most major LLMs are predominantly trained using English data and are not attuned to Global South contexts. As a result, they underperform for many non-English languages." — Stanford HAI / The Asia Foundation, Mind the Language Gap, 2024

03 — The Numbers

7,000+

Languages spoken on Earth

~30

Classified "high-resource" in AI

46%

of LLM training data is English

300M

Bengali speakers — low-resource in AI

200M

Swahili speakers — also low-resource

2/3

of Earth's languages are in Africa and Asia — the regions with least AI infrastructure

A language is classified as "high-resource" in AI research based on how much digital text exists in that language. English dominates because decades of internet content, journalism, books, and government records have been digitized in English. Languages that lack that digital footprint are classified as low-resource regardless of their speaker count.

04 — Policy Connection

The Gap Is Growing

A 2025 cross-country study found that nations where low-resource languages are dominant use AI tools at less than half the per-capita rate of other countries. After controlling for income levels and internet access, language alone accounts for roughly 20% lower AI adoption. The researchers found no evidence that this gap is narrowing.

As AI becomes embedded in healthcare delivery, legal systems, financial platforms, and government services, communities that had no input into how these systems were built will still be subject to them. The decisions that produce this gap, like where to site data centers, which languages to include in training data, and how to allocate government subsidies, are policy decisions. They are made by governments, corporations, and international bodies. They can be made differently.

Africa · Founded 2019 Masakhane Grassroots NLP research building language technology for African languages, by Africans. 2,000+ researchers across 30 countries. South Asia · IIT Madras AI4Bharat Open-source models and datasets covering 22 official Indian languages including Bengali, Tamil, and Marathi. Global · Cohere for AI Aya A multilingual dataset covering 101 languages, built by native speaker volunteers to address the quality gap in low-resource training data.

05 — Process

How This Was Built

The visualization began as an R script using ggplot2 and sf, with geographic data sourced from Natural Earth GeoJSON files. I felt the visualizations were too static, so to add interactivity I moved to JavaScript using D3.js and TopoJSON. D3 handles the map projection, color scaling, and country boundaries. TopoJSON provides the compressed geographic data. The result is a data visualization that lets viewers switch between the two datasets and watch the map invert — the same countries that dominate in infrastructure narrow in comparison in language diversity, and vice versa.

Data sources: data center counts from datacentermap.com; Linguistic Diversity Index values from Greenberg's published index as maintained by Ethnologue and UNESCO.

06 — Artist Statement

Pete Ngwa

This project began with a question posed by AI researcher Yejin Choi: why is artificial intelligence centralized through the English language? The answer, as this project argues, is infrastructural. It's a direct consequence of where data centers are built, who builds them, and which communities are considered worth building for.

The language tabs at the top of this page offer the site in English, Spanish, Swahili, Yoruba, and Bengali. Spanish and English represent the two dominant languages in global AI development. Swahili, Yoruba, and Bengali were chosen as a sample representative of the many languages spoken by hundreds of millions of people that remain classified as low-resource languages in AI research. They represent the gap this project seeks to point out. The translations themselves are admittedly clumsy, as I am not fluent in any of these and heavily relied on Google Translate and AI assistance — an intentional choice that furthers my argument.

Pete Ngwa · Data and the Art of Policy and Planning — SPIA 4464
Virginia Tech · Moss Arts Center Exhibition · April 2026

07 — Data Sources

Transparent Documentation

All data used in this project is publicly available. Linguistic Diversity Index values reflect published academic research. Data center counts are approximate figures from the most comprehensive public inventory available.

Pava, J.N. et al. Mind the Language Gap. Stanford HAI / The Asia Foundation, 2024.
Datacentermap.com — Global data center inventory, accessed April 2025.
Greenberg, J.H. "The measurement of linguistic diversity." Language, 32(1), 1956. Values via Ethnologue (SIL International) and UNESCO.
Common Crawl language distribution — documented by researchers at Allen AI, Meta, and others.
Masakhane Research Foundation. masakhane.io
AI4Bharat, IIT Madras. ai4bharat.iitm.ac.in
Aya, Cohere for AI. cohere.com/research/aya
Choi, Yejin. On AI centralization and the English language divide. Public lecture, 2025.
Visualization: Globe.gl v2.30, TopoJSON v3, Natural Earth / world-atlas.

Pete Ngwa
Data and the Art of Policy and Planning — SPIA 4464
Virginia Tech · Moss Arts Center Exhibition · April 2026

Lost inTranslation

AI Was Designed for English Speakers

The Gap Is Growing

How This Was Built

Pete Ngwa

Transparent Documentation

Lost in
Translation