Can we use the rise and fall of languages as a lagging indicator of the rise and fall of countries and empires?
Indeed, as Ray Dalio wrote in his upcoming book The Changing World Order:
“A reserve currency’s usage, like a language’s usage, lags the fundamental reasons for using it by many years because the usage of currency is not easy to change.”
Koine Greek and Latin stayed Europe’s lingua francas for centuries, even after the fall of the Hellenistic civilization and the Roman Empire. Fifty years after its decolonization, English, French, and Portuguese are still the vehicular languages of Sub-Saharan Africa. The USSR collapsed in 1991, yet Russian continues to be the common language of Post-Soviet states. There are many similar examples.
But it’s hard to measure languages’ usage. You can’t rely on official languages because speakers don’t follow their Constitution. For instance, Irish Gaelic is the national and first official language of the Republic of Ireland, yet less than 5% of Irish people use it daily. There’s also a difference between first and second language (L2). Modern Standard Arabic has 0 native speakers but 274 million total speakers: which number should you choose? And how fluent are these L2 speakers? Last, context matters a lot. Consider a Turkish immigrant working in a tech startup in Berlin; they may speak Turkish at home, English at work, and German in their day-to-day life. Which language would they declare in a census? Besides, these surveys don’t cover the whole world and are only done every 5 to 10 years. They can also be biased: in some places, speakers of minority languages can be ashamed (e.g., regional languages in France 100 years ago) or face persecutions (e.g., Rohingya language in Burma), so they may not declare the truth in government surveys.
That’s why I used Wikipedia’s data. Wikipedia is among the top 20 most visited websites in each country, except for Mainland China, where it’s banned. We use Wikipedia for work, at home, and in our day-to-day lives. And we only use it in a language we understand well. If someone speaks different languages, they may read several Wikipedia editions, reflected in the page views statistics.
So here’s the map of the most common languages used to read Wikipedia. You can find the code on GitHub, and I uploaded images on Wikimedia. Only languages that are predominant in at least four territories are shown:
At first glance, nothing surprising. It shows the heritage of previous empires and looks similar to other “civilization maps” with the Sinosphere, the Anglosphere, Protestant Europe, Catholic Europe, the Russian world, the Arab World, the Indosphere, Sub-Saharan Africa, and Latin America.
However, this map was slightly different in 2015 (there’s no data before): English was the most used language in most of Eastern and Northern Europe and the Balkans; Russian was strong in the Caucasus; French & English were standard in the Islamic world. Not any more. National languages replaced English, Russian, or French as the most common to read Wikipedia in 14 countries: Iran, Armenia, Georgia, Azerbaijan, Romania, Bulgaria, Greece, Serbia, Norway, Latvia, Algeria, Mauritania, Sudan, and Oman.
Even in small states with high or very high English proficiency, such as Norway or Latvia, users prefer to use Wikipedia in their native language, even though there are 60 times more articles on the English Wikipedia than on the Latvian one.
I didn’t expect English to lose ground to Greek or Serbian! It seems that as soon as a local Wikipedia reaches a certain threshold, users prefer it. It probably answers 80% of their questions, and the cost to switch to English, even though they may be fluent, is higher than the benefit of reading a more detailed page in English. I estimate this threshold at around 500,000 articles: the number of pages on the English Wikipedia in 2005 when it became the most popular reference website on the Internet and was deemed as good as the Encyclopedia Britannica. You only need a small group of language activists dedicated to promoting their language and creating content to reach this threshold. For instance, there are just 340 active editors on the Georgian Wikipedia, enough to make it the number #1 edition in the nation, before English and Russian. And actually, 77% of the English Wikipedia content is written by 1% of editors!
Also, in some developing countries, only the French-, Russian-, or English-speaking elite had access to the Internet, and therefore to Wikipedia. As the Internet becomes more accessible, the rest of the population, who’s not as comfortable in colonial languages, turns to Wikipedia in their mother tongue.
That’s why it’s so important to translate products and talk to customers in their native language.
More than the decline of English, it’s, as would Niall Ferguson say, the rise of the “Rest”: non-Western countries and their languages. Here’s the breakdown of editors—instead of page views—on Wikipedia by language, from its launch in 2001 to today. Western languages keep declining, while “Others”—in white—keep increasing. [I don’t know what happened on the German Wikipedia (de) around 2005.] The growth of Arabic (ar), Persian/Farsi (fa), and Chinese/Mandarin (zh) is impressive:
If the trend continues, I expect Estonia, Morocco, Albania, Croatia, Slovenia, Denmark, Moldova, Uzbekistan, Bosnia and Herzegovina, Tunisia, Mongolia, Bahrain, Afghanistan, Ukraine, Bangladesh, Cambodia, and Malaysia to switch to a local language by 2030, to the detriment of English, French, and Russian.
[April 2021 update: Estonia switched from English to Estonian, Morocco from French to Arabic, Afghanistan from English to Farsi. Norway, Laos, and Romania switched back to English. Source.]
There are three major exceptions:
Even though Arabic varieties are among the most spoken languages globally, only the Modern Standard Arabic edition is popular. The Wikipedia in Egyptian Arabic has 1 million articles—often of poor quality or created by bots—but few people read it, and the Moroccan Arabic Wikipedia has just been founded.
In Sub-Saharan Africa, African languages spoken by millions of people (Hausa, Somali, Zulu, Nigerian Pidgin, etc.) are still far behind English, French, and Portuguese. Local languages above 5% of the readership only in Tanzania (Swahili) and Ethiopia (Amharic).
Although Hindustani (Hindi/Urdu) has almost 1 billion speakers, the Hindi and Urdu Wikipedias’ market shares are less than 10% in India and Pakistan. Other languages in the region are in a similar situation (Marathi, Telugu, Tamil, Punjabi, Sinhalese, etc.). However, the Hindi Wikipedia has grown from 2% of visits in January 2016 to 8% today in India. Simultaneously, the traffic coming from India to all Wikipedias increased by +45% over this period. This traffic mostly went to the English Wikipedia and compensated for the decline of the English Wikipedia in the rest of the world outside India.
Despite its relative decline, the English Wikipedia grows in absolute terms and represents almost 50% of page views. So even though English is decreasing in many countries, it’s still first by far at the global level. The second most-read language, Japanese, has 7 times fewer views! And no language is growing fast enough to overtake English as the world language in the next decade. But the rise of emerging countries and their languages points to a multipolar internet. A “Global Village,” yes, but with divided communities, centered around their language, and only interacting with each other to do business… or war: digital tribalism? A situation similar to what happens in real life in many parts of the world where different ethnic groups live next to each other in relative peace but don’t mix, except at the local bazaar. Far from the dream of an interconnected humanity, speaking the same language, “Globish,” and sharing the same values. This fragmentation of the web is already obvious with the Chinese Internet, largely separated from the rest of the world. The threats to ban Western platforms in Russia, the Indian ban of Chinese apps, and the recent calls to sanction Twitter in India and Turkey will accelerate this decoupling.
What do you think?
Antoine
Interesting insights Antoine. How do you think development of NLP applications will influence the "internet languages"? I assume countries whose language is not popular and the national market is small will only rely on language-agnostic apps or will adapt to the dominant market language.
When I look at the editor statistics, I feel that there might a phenomenon like a "first phase" in which a lot of articles you would expect in an encyclopedia are missing and a "second phase" in which every such article already exists and most of the editing is either 1. improving these articles 2. writing articles on specialized topics that would never have been covered in another encyclopedia (like articles on individual movies, towns and so on). In this "second phase" there is therefore less editing activity.
The German peak we see could then be when the German Wikipedia was in this first phase and they achieved it so fast that they reached their "second phase" rapidly, in which there is less editing than in Spanish for instance (maybe only in relative terms but not in absolute terms?), which still has to finish its first phase. It seems that the Japanese have the same pattern as the Germans (probably because they were the biggest economies in 2005 with the US). But French and Spanish seem to have been slow so maybe their first phase is spanned along many years.