My friend SJ Klein and I spent a chunk of yesterday evening talking about Wikimedia’s language issues. SJ is a wikipedian and a language enthusiast – a polyglot; I am embarrasingly monolingual (and, lately, have been having difficulty spelling common names in my own native language. Sorry, Michael…) That key difference aside, we’re both interested in how generative media projects on the web can include speakers of as many languages as possible.
My regular readers know that I’m obsessed with the question of how the Internet will change as an additional billion users join the network. It’s a safe assumption that many of these billion users will not read English… and will not create content in English. Recent statistics from Technorati suggest that more blogposts current exist in Japanese than in English; my research suggests that there might be even more blogposts in Chinese than in Japanese.
Wikipedia gives an interesting introduction to some of the potentials and challenges of a massively multilingual internet. To fulfill Jimmy Wales’s vision of a free encyclopedia for everyone in the world, in their own native language, Wikipedia needs to do one of two things (or a combination of the two): create a comprehensive encyclopedia in one language and translate it into multiple languages, or comprehensive encyclopedias in every target language.
Wikipedia’s doing a little of each, with an emphasis on the second strategy – there are now ten huge encyclopedias (100,000+ articles), 29 big encyclopedias (10,000+ articles) and hundreds of smaller wikipedias. While translation between wikipedias takes place – and the Simple English wikipedia exists, in part, so it can be translated and serve as the starting point for a new Wikipedia – the global Wikipedia community is engaged in the creation of hundreds of encyclopedias, not just one.
But not all wikipedias are growing at the same rate. Some wikipedias are surprisingly large, given how few people are native speakers of the language. Others are surprisingly small given how widely spoken the language is.
I took a close look at this question today. Of the most widely spoken native languages in the world – languages with over a million native speakers – several are well represented by very large (100,000+ article) wikipedias: Spanish, English, Portuguese, Japanese, French and German. Some are represented by smaller, growing wikipedias: Chinese, Russian, Arabic. And three have very small wikipedias: Hindi, Bengali and Punjabi. The Punjabi wikipedia has 42 entries – the language is the first language for 104 million speakers.
Putting together a very rough metric, I calculated the number of wikipedia articles per million native speakers of the language (WA/MS) for languages with over 30 million speakers. The leader in the set is Polish, with a 233,740 article wikipedia and 46 million native speakers, a WA/MS of 5081.3. The German and English speakers have strong showings as well, with WA/MS of 3925 and 3656 respectively. (If we extended beyond the 30 most spoken languages in the world, the Scandinavians begin displaying their strength – Swedish weighs in with 18,041 articles per million speakers of the language. And the Icelanders have created a 10,059 article Wikipedia, despite the fact that the language has less than 300,000 native speakers. Were we to consider languages with no native speakers – Esperanto, Ido, Interlingua – we’d encounter division by zero errors… but discover that Esperanto has 43,687 entries in their wikipedia.)
Of the ten languages that score lowest on this metric, eight (Punjabi, Oriya, Hindi, Gujarati, Bengali, Urdu, Malayalam and Tamil) are Indian languages. (So are the next three – Kannada, Telugu and Marathi.) The other two are Southeast Asian – Burmese and Javanese.
I strongly suspect that the slow growth of these wikipedias is not a function of their geography – Amharic, with 27 million native speakers and 312 articles, would place between Bengali and Urdu in terms of WA/MS if I extended the calculations beyond the top 30 languages. It’s a function of the digital divide. Swedish may only have 8.8 million native speakers, but the majority of them have Internet access. Net penetration in India is much lower (perhaps 5%)… and it’s extremely low and heavily restricted in Burma, which helps explain the size of that wikipedia.
Still, even with only 5% internet penetration, India has an estimated 50 million Internet users – more than any nation other than the US, China or Japan. Which proposes a complicating factor, which SJ and I argued about at length last night: what language does a multilingual person choose to write in?
I’ve talked about this question at some length with multilingual blogger friends. Many of the bloggers I know who speak Arabic fluently choose to blog in English. Their audience isn’t their countrymen – it’s an international audience. Furthermore, many of their countrymen (and women!) who are online are also bilingual – the ability to read and write in English is closely correlated with high levels of education, high incomes and internet access. The same is likely true in India – popular bloggers in India appear to be blogging primarily in English.
But wikipedia’s different, isn’t it? The goal is to create a body of knowledge useful for the wider world – surely Punjabi speakers want to ensure that people who speak only Punjabi have useful content when they come onto the Internet?
Well, yes. But they may also want to influence perception and opinion on topics important to them by creating articles on political figures, important issues, issues of national or regional pride. And it makes sense to contribute to the wikipedia which has a broad audience and, therefore, a maximum chance of being read and influencing opinion. Which likely argues that it makes more sense to edit the English wikipedia – with a huge audience – than the Punjabi audience. There’s also a critical mass issue – until the Punjabi wikipedia hits a certain size – 1,000 articles? 10,000 articles? – it’s a project, rather than a resource.
If I understand SJ correctly, he’d like to see wikipedians editing articles in their native language, and another group of translators making sure those unique articles are translated into additional languages. I wonder if this will work – our experience with Global Voices is that it’s harder to get people to translate than it is to get them to write. But SJ’s proposed method would be critically important for projects like the One Laptop per Child project – children learn to read better in their native language. Having a wikipedia in Kannada or Burmese makes it easier to use the computer as a teaching tool. And it means that language is less of a barrier to some of the next billion users having access to critical content.
But it won’t be easy, I suspect. I hope that the Boston Wikimania conference can include some conversations with Swedish and Polish wikipedians so we can find out how those language communities have generated such interst in wikipedia. At the last Global Voices summit, participants from around the world listened intently to Ory Okolloh as she explained how the Kenyan blog community had become so robust – I hope wikipedians in successful communities can help along the language groups that are struggling at present.
Some rough data on the WA/MS index… By the way, the largest language not to have a wikipedia appears to be Madura, spoken by 8-14 million people in Indonesia. The most widely spoken language which doesn’t have a Wikipedia entry in the English wikipedia: Maninka, spoken by 3.3 million people in eastern Mali and Guinea.