Language On Wikipedia: How The Website Deals With The World’s Many Languages
Wikipedia is one of the greatest accomplishments of this generation. It’s not without its issues, but the website’s goal of bringing the world’s knowledge together is noble. There is, of course, one thing that stands in the way of all information being able to seamlessly merge: language. The plethora of languages in the world is a fantastic thing, but language on Wikipedia causes the website to be fragmented.
Because of the multilingualism of the website, there are a lot of fascinating phenomena. We looked into some of the most interesting aspects of language on Wikipedia, because it turns out Wikipedia has a lot of language mysteries.
How Many Languages Are On Wikipedia?
Since its launch in 2001, Wikipedia has hosted 306 different language versions. Compared to the roughly 7,000 languages spoken in the world this number can seem small, but considering just 23 languages are spoken by more than half the world’s population, these 306 cover a broad swath of humanity.
Adding a new Wikipedia is not an automatic process. While the website has a reputation of anyone being able to do anything, that’s not technically true. In the case of making a new language edition, you have to submit a proposal to the Wikimedia Foundation. They make decisions such as what should be its own language and what’s just a dialect (though that division is often murky).
There generally has to be content and community to make the case for a language to be added. If not, the Wikipedia might be shut down. In its history, Wikipedia has shut down 10 languages, including Choctaw Wikipedia, Afar Wikipedia and Ndonga Wikipedia. This leaves 296 Wikipedias still going today.
The decision-making process for languages can be a tough one, especially when deciding which languages are worth the website’s limited resources. Klingon, a constructed language made for Star Trek, was on Wikipedia for a while before being taken down in 2005. A year later, it was moved to Wikia (also known as Fandom), which is a website hosted on the same open-source wiki software as Wikipedia. Wikipedia does allow constructed languages like Esperanto and Volapük, but not one specifically designed for a fictional alien race.
What Are The Biggest Languages On Wikipedia?
There are a few different ways to measure how big a Wikipedia is, but if you’re going off of how many articles a language has, English is the clear champion. The top 10, however, has some outliers.
Biggest Wikipedias By Number Of Articles
- English (5.97 million)
- Cebuano (5.38 million)
- Swedish (3.75 million)
- German (2.36 million)
- French (2.15 million)
- Dutch (1.98 million)
- Russian (1.58 million)
- Italian (1.56 million)
- Spanish (1.56 million)
- Polish (1.37 million)
Most of this list is probably not far off from what you would expect. If you look at the most-used languages on the internet, it’s pretty obvious that English would be on top. There is a lack of Asian languages in the top 10, but they make appearances just a little bit further down the list. But there’s one language that might catch your eye: Cebuano.
The Cebuano language is one of the many languages spoken in the Philippines, and it has almost 16 million speakers (probably more now, because the latest count was in 2005). It’s the language of the Visayas, which is an archipelago of islands. While it’s not a tiny language, it’s definitely weird that it should have almost as many articles as English. It’s even weirder when you look at the top 10 Wikipedias based on how many users they have.
Biggest Wikipedias By Users
- English (37.55 million)
- Spanish (5.61 million)
- French (3.60 million)
- German (3.31 million)
- Chinese (2.84 million)
- Russian (2.63 million)
- Portuguese (2.32 million)
- Italian (1.89 million)
- Arabic (1.74 million)
- Japanese (1.55 million)
Where does Cebuano rank? It comes in at 69th, with only about 62,000 users. Cebuano having that many articles, then, is a wild anomaly. Fortunately, there’s a known explanation.
Josh Lim, an active member of the Filipino Wikipedia community, explained the reason for the Cebuano article anomaly in a 2016 Quora post. In late 2006, there was a huge increase in Cebuano articles, and it became the most prominent language used by Wikipedia contributors from the Philippines. This didn’t make sense because most people in the country were writing in English at the time, as it’s one of the official languages. But Cebuano isn’t even the largest non-English language in the country — that would be Tagalog.
The Wikipedians realized that the influx was caused by one user, who created a bot to translate some 50,000 articles about French communes into Cebuano. This caused a war between Philippine-language Wikipedias, because suddenly they all wanted to have the highest article counts. Waray-Waray, another language of the Philippines, also ranks highly on number of articles for this reason.
But the biggest contributor to sheer numbers of articles was Sverker Johansson. He created LSJbot, which machine-translated huge numbers of articles. Before it got to Cebuano, LSJbot did translation work with other languages, including Swedish. Coincidentally, Swedish comes in third despite having a relatively small Wikipedia community.
The creator of the bot’s original stated goal was to create more stuff for human Wikipedians to edit and improve upon. However, there are so many articles now it’s overwhelming for the editors. Also, the articles tend to have little to do with what people who speak Cebuano would actually want to know, and the machine translation is not exactly easy to read. It may be able to hold the title of second-largest Wikipedia (by article count), but Cebuano Wikipedia is less helpful to actual Cebuano readers.
What Are The Most-Translated Wikipedia Pages?
For the most part, Wikipedia articles are not translated directly from one language to another. There are clear exceptions to that, as the Cebuano example provides, but an article can exist in multiple languages and be completely different. You can possibly find out a lot more about a Spanish-specific topic if you read the Spanish Wikipedia articles, for example. This makes Wikipedia a fascinating language-learning resource.
Finding out what topics exist in the most language editions can provide a peek into what topics have the most universal appeal. Wikipedia fortunately keeps a record of which articles have the most “interwikis,” which are links to the same topic in another language.
Wikipedia Articles With The Most Interwikis
- Japan (295 interwikis)
- Finland (295 interwikis)
- Turkey (290 interwikis)
- Russia (289 interwikis)
- The United States (287 interwikis)
- Chile (283 interwikis)
- Norway (283 interwikis)
- Italy (282 interwikis)
- China (282 interwikis)
- Wikipedia (280 interwikis)
While it’s fun to think about the fact that some Wikipedias don’t have a Wikipedia entry for Wikipedia, the top articles are hardly surprising. Countries in fact dominate most of the top of the list, along with other geographic locations. The first human you stumble upon isn’t until 53rd place, where U.S. President Ronald Reagan sits with 249 interwikis. Next up human-wise is Jesus (245 interwikis), Michael Jackson (234 interwikis), Barack Obama (232 interwikis), Leonardo da Vinci (215 interwikis) and Corbin Bleu (200 interwikis).
If that last name seems a bit jarring, that’s because it is. It makes almost no sense that Corbin Bleu, of High School Musical fame, is one of the most talked-about celebrities in the world. Not to be too rude to him, but he’s hardly the best-known living actor.
This conundrum has caught the eye of several different people over the years, especially because of a study done by MIT’s media lab that tried to use interwikis to quantify celebrity fame. When they found Corbin Bleu ranked near the top, it set off some alarm bells for people.
Fortunately, Reddit was on the case. In a post from early 2019, user u/b0b10b1aws1awb10g presented the story to the r/UnsolvedMysteries subreddit. Within hours, they had figured out that it’s a single Wikipedia author, Zimmer610, who wrote almost all of them. The user who discovered Zimmer610, u/Lithide, looked through the edit histories of countless Corbin Bleu articles. They found that the articles almost all come from the same place — Riyadh, Saudi Arabia — and that many of them (though not all) are written by this one account.
Some of the languages that have Corbin Bleu articles in them are perhaps the most suspicious. There is really no reason for him to appear in the Old English or Nahuatl Wikipedias, because neither of those languages are actively spoken. Most of the articles are also not super well written, which makes sense if someone is just trying to get Corbin Bleu into as many languages as possible.
Despite the investigative work of Reddit, there are still no answers as to why this person is propagating the gospel of Bleu. It could just be an inside joke, or maybe he’s a hyperpolyglot Corbin Bleu superfan. Either way, it’s an example of the good and the bad of Wikipedia.
It’s useful to allow people to contribute their knowledge to the world, but you never know when someone will write hundreds of articles as part of an inside joke. Editors do their best to keep people from vandalizing pages and writing untrue things. But with millions of articles, the site is so big it can sometimes be like fighting the currents. At its very best, Wikipedia is an imperfect website striving toward knowledge. Once we accept that, we can wade through the Corbin Bleus to get at a treasure trove of info. But also check the sources.