Your language or mine? (Part 2)

August 9, 2006August 15, 2006

Much of the conversation I read online about Wikipedia seems to be focused on the radical, audacious idea that an encyclopedia written by amateurs could rival the quality and comprehensiveness of encyclopedias written by professionals.

I’d like to suggest that this is by no means the most audacious aspect of the project Jimmy Wales has taken on.

Guest-blogging on Larry Lessig’s site a year ago, Jimmy wrote, “The goal of Wikipedia (and the core goal of the Wikimedia Foundation) is to create and provide a freely licensed and high quality encyclopedia to every single person on the planet in his or her own language.”

It’s that last clause that’s the radical one. It implies a massive data dissemination effort – either the distribution of billions of print copies of an encyclopedia, or participation in a huge digital divide project like the One Laptop Per Child effort. And it implies a translation and localization effort on a scale that boggles the mind.

Jimmy constrains the problem somewhat:

I will define a reasonable degree of success as follows, while recognizing that it does leave out a handful of people around the world who only speak rare languages: this problem will be solved when Wikipedia versions with at least 250,000 articles exists in every language which has at least 1,000,000 speakers and significant efforts exist for even very small languages. There are many local languages which are spoken by people who also speak a more common international language â€” both facts are relevant.

Ethnologue, a leading resource for language study, lists 6,912 known, living languages. 94% of the people in the world speak one of 347 languages which have one million or more speakers.

By contrast, Wikipedias with 250,000 articles currently exist only in four languages: English, German, French and Polish. (The Japanese wikipedia is 242,000 as I’m writing this post.) Of languages with 100 million or more native speakers, the Chinese, Hindi, Spanish, Bahasa Indonesia/Bahasa Malaysia, Arabic, Portuguese, Bengali, Russian, Japanese and Punjabi wikipedias still need work – the Punjabi wikipedia is apparently up to 50 articles, eight more than the last time I wrote about it.

Jimmy had an interesting proposal at Wikimania about addressing this problem: paid coordinators who help recruit contributors and build these new Wikipedias. I hope whoever these new coordinators are, they’ll have a chance to learn from Ndesanjo Macha, who is both the father of the Kiswahili blogosphere and one of the key movers behind the Kiswahili wikipedia, which recently crossed the thousand article mark. At one of the sessions I moderated at Wikimania, Ndesanjo told us that building the Kiswahili wikipedia has involved extensive evangelization, leveraging offline and online social networks, strong-arming bloggers into writing articles, publishing articles on Wikipedia in Tanzanian newspapers, and persuading Kiswahili teachers in the US to make writing articles for the Wikipedia a class project.

Talking with Ndesanjo and other multi-lingual Wikipedians, I became aware of an interesting debate within the Wikipedia community. In trying to achieve Jimmy’s dream of a free encyclopedia for everyone in their own language, is the goal to create a single, coherent encyclopedia that can be translated into many different languages? Or to help every language community around the world create their own encyclopedia which will have somewhere from a little to a lot of overlap with another encyclopedia?

No one was brave, anglocentric or foolish enough to suggest that the solution to Wikipedia language problems was to start translating the English wikipedia into as many languages as possible… though I mentioned the Simple English Wikipedia, which is designed to help people learn English and as a source of simply-worded articles which could be translated into other languages. (This earned me a tongue-lashing from my friend Alek Tarkowski, who pointed out that speakers of other languages weren’t stupid, just uneducated in that language…) But some Wikipedians suggested that much of the translation problem could be tackled by finding the ur-version of articles and translating them into different languages: if the French version of the article on cheese was the definitive cheese article, the English wikipedia article on cheese should be a translation of the French article.

Searching Wikipedia for information on 18th-century encyclopedias and the idea of an encyclopedia as a summary of human knowledge, I found a real-world example of this suggestion. The English Wikipedia article on the 18th-century, Diderot-edited EncyclopÃ©die appears to be substantially based on the 1911 Encyclopedia Brittanica article on Encyclopaedias. Challenging the neutrality of the article, Wikipedian Hardouin notes that the article “tries to portray the EncyclopÃ©die as essentially an English work pirated by evil Frenchmen using dubious legal proceedings to dispossess innocent English editors” – an understandable bias in a hundred-year old British encyclopedia, but perhaps less forgiveable now – and suggests a translation of the French article on the EncyclopÃ©die as an alternative.

While this may be the right solution to solving a debate about the EncyclopÃ©die, it’s unlikely to solve some other cross-language arguments. Which article is the ur-article to translate on Jerusalem? The Hebrew, the Arabic or the English? (Even if you don’t read all three languages, the choice of images on each of the three articles is an interesting contrast.) Raising the Jerusalem article in one of the sessions I was moderating, one Wikipedian suggested that the point of NPOV – neutral point of view – was that it should enable creation of a factual article satisfactory to the Arabic, Hebrew and English authors. It could present – but not assert – opinions held by Christians, Muslims and Jews, but would be sufficiently neutral as to satisfy all audiences. Whether or not such a compromise is possible, it raises other questions – how do debates about an article take place between speakers of different languages? Do we decide the ur-language and then debate in that language? Is this fair to an author who is weaker in the language of debate than her native tongue?

Ndesanjo suggests another possibility – if we consider Wikipedia to be a project to “decolonize” cyberspace, as he does, it makes more sense to consider each language’s encyclopedia indepdent, with its own priorities, standards and processes. In some languages, the priority might be to create a widely usable reference quickly, which might focus on translating a lot of articles from a convenient encyclopedia, like the English wikipedia. Or it might be to document aspects of the culture associated with the language likely to be undocumented in other languages. Ndesanjo gives the example of the Kiswahili wikipedia article on Mbege, a beer made from fermented millet and bananas. Mbege has merited a two-sentence stub in the English wikipedia, but it’s an important part of Tanzanian culture Ndesanjo and collaborators want to ensure is preserved in cyberspace… which they’ve done with a much longer article.

The difference between the Mbege article in English and Kiswahili suggests that it might be worth searching for language-specific articles, articles that exist (or exist as full entries) only in smaller Wikipedias. Of the 1000 articles in the Kiswahili wikipedia, how many have no satisfactory parallel in the English wikipedia? How about for a large Wikipedia, like Polish? Are 10 of the 272,000 articles unique to the Polish edition, or 10,000?

(It wouldn’t be all that difficult to conduct this experiment, since Wikipedia articles link to versions of that article in other languages. Spider the wikipedia for a target language, follow the links to English versions of the article. When those links aren’t present, or when they link to a much shorter version of an article, flag that article as linguistically unique. It may also not be neccesary to go through all this trouble – some wikipedias appear to feature their “unique” articles more often as featured articles of the day than articles that are derivative of other wikipedias – this data could be mined as well.)

Finding and translating the articles that are linguistically unique would have the effect of strengthening large wikipedias, like the English wikipedia, as well as calling attention to the original work being done in building smaller wikipedias. A Serbian contributor – bilingual between English and Serbian – noted that he rarely writes for the English wikipedia because so much already exists in the English version. Identifying and translating the unique articles in the Serbian wikipedia might balance these content flows.

This also opens an intriguing possibility for potentially controversial topics, like Jerusalem: the English language article on Jerusalem might include not only links to the Arabic and Hebrew versions, but to English translations of the Arabic and Hebrew versions, letting English readers see how the subject is covered in other languages. (“English” here is a placeholder – I think this would be interesting to try in any language where you can find translators capable of the language pairing.) There’s lots of practical problems – you need to retranslate as the other articles grow, you need to find ways to present the translations that don’t confuse a casual user, and ways to deal with the combinatorial explosion of languages. (347 languages with more than a million speakers implies the need for a Polish – Punjabi translator, who may or may not exist. And it suggests 347!/2 translations of each article, which is a number that breaks most earthly calculators…)

As Wales and the rest of the Wikipedia community start addressing the immense problem of producing free encyclopedias in 347 languages, it’s worth asking: “Are we writing one encyclopedia and translating it, or writing 347 encyclopedias and translating when neccesary?” Phrasing that question to Wikipedians, some expressed confident opinions… which contradicted each other. I’m hoping I can provoke Jimmy to offer a more definitive statement on the approach Wikipedia is taking… or invite a larger discussion on a topic that I think is critical to the success of the project.

In thinking about language and encyclopedias, I found myself reading more about the Rosetta Project, which is attempting to document all the world’s languages in an online database. They’ve got word lists for roughly 3,000 languages, perhaps half the world’s living languages. One of the more audacious parts of their project involves creating a nickel disc microengraved with 15,000 pages of text which could serve as a Rosetta stone for future generations trying to decipher long-dead languages. Oh, and they’re launching a copy into space to rendezvous with the Wirtanen Comet in 2011. You know, an off-site backup.

TOTH to Brewster Kahle, for letting me know about the project.

This post is intended as a sequel to my May post, “Your Language or Mine?”

14 thoughts on “Your language or mine? (Part 2)”

quinn August 10, 2006 at 1:10 am

I don’t see why it has to be an ‘or’ question. Why can’t there be a post on Cheese, and a post on Cheese Translated from French Wikipedia? They can be just one hyperlink away. Why can’t you have the goal of 347 Wikipedias, and all of them containing translations of the other 346?
Hans Suter August 10, 2006 at 1:45 am

For the owner of a Knaur’s Konversationslexicon of 1936 there is another problem, too.
If I look up in this Knaur’s the word “Nationalsozialismus” I have the following “Der Nationalsozialismus erstrebt Bildung eines selbstbewussten, vÃ¶lkischen Nationalstaates….” (N. ames at building a selfconscious, populare nation state (?)…). When you look up the same word in a US dictionary of 1936 what will you find and how would this play today ?
Martin Benjamin August 10, 2006 at 11:15 am

Quite an interesting article. I’m posting the link to the AfrophoneWikis discussion, and hope that readers with a particular interest in the topic join the group at http://groups.yahoo.com/group/afrophonewikis
Kasper Souren August 10, 2006 at 11:57 am

Regarding your last question, in my opinion there is no doubt about it: We are writing many encyclopedias, and translating when wanted.

When machine translations will one day be both good and free enough we can start the other, different project. Call it Ultimate Wikipedia or WikipediaZ, which will be about the creation of one encyclopedia in a meta language, that can automatically be translated into other languages.

For now, I hope that the Swahili Wikipedia can maintain the current growth rate, and will also be an impetus for the other Afrophone Wikipedias.
Sabine Cretella August 10, 2006 at 12:12 pm

Well, I am here thanks to Martin Benjamin and I must say: this article is very profound and touches many themes I myself often consider … I am bureaucrat of the Neapolitan Wikipedia, so a regional language mainly used on the Italian territory. We (the regional languages Wikipedias) face very similar problems – it is hard to create articles, often we are just in two or three working on projects. But there are many ways to co-operate, because much of the data you can find in an encyclopaedia (such as statistical data etc.) is very similar. Lately there was that help request on the wikipedia-list to add Geocodes to the German wikipedia … well: we could add Geocodes, but why do this only for one Wikipedia if it could be used for all? I wrote about that in my blog. The addition of geocodes would be valuable for all Wikipedias. The same is valid when it comes to other data.

Another point you mention is: translating Wikipedia. Well, that works up to a certain point. Things are often percieved differently in different countries. Even having NPOV articles you will find that people find certain parts more relevant than others. Anyway: yes it makes a lot of sense to translate articles, because once they are there people will start to work on it and make it a truely localised text.

We are also talking about tools that help localisation. What we must be aware of is that in many countries people don’t have internet access, but they could do work offline. This means that on those wikipedias we could create some kind of structural work where poeple work offline and another group of people that has internet access care about the update. Articles where people work on offline should be locked for that period and we should know who is working on it. Then people can get an offline version or a version to be installed on a local server where they can easily access etc.

Well this is a theme where we could go on talking for a very long time. Strategies and co-operation is needed and we must build our networks in such a way that we find each other. If all of us co-operate, and in particular the small Wikipedias co-operate we will get great contents quite fast and interest in our pojects will grow.

I would like to conclude this post with other two posts of mine about Wikipedia:
– Making endangered languages fun
– Languages … languages … languages … and contents
– Translating contents for small Wikipedias

Let’s work togeter on contents creation for all languages :-)

Best, Sabine
GerardM August 10, 2006 at 12:23 pm

Ethnologue maintains the ISO-639-3 list. It has currently 7602 languages. This list is not exhaustive; the Neapolitan language for instance is a candidate for splitting in at least two seperate languages.

From my perspective, you need to localise MediaWiki (the software that runs Wikipedia) in as many languages as possible. This is done best by making sure that “local” organisations adopt MediaWiki for their own purposes. When they do, it will be a big stimulus for the creation of a Wikipedia in that language. It will also make this language available in http://wiktionaryz.org.

Localisation and local adoption are the first step towards enabling people to be personally involved on the Internet. When more people and organisations adopt MediaWiki, more information may come Wikipedia’s way.

Thanks,
GerardM
Martin Benjamin August 10, 2006 at 2:07 pm

Some other factors to consider when talking about translating articles in wikis:

An article is written in Language 1, and then translated into 100 other languages. Then the article is changed in Lang 1. Would there be an expectation that the 100 translators rush back and re-translate the edits? At what frequency?

An article is written in Language 1, then translated into Language 2. The Lang 2 article is subsequently edited. Does that edit get translated back into Lang 2? Does the Lang 2 edit get translated into Lang 99 – meaning that you need translators and a system of vigilance between each pair, not just between each language and Lang 1? If Lang 2 and Lang 99 are edited in different forks, at what point, if ever, do they get reconciled?

And how often will the process of direct translation be worthwhile? Yes, translations of the Hebrew and Arabic entries for Jerusalem would be informative, but the Quechua article about Jerusalem might not bring much to the party – whereas I would be quite interested in reading a translation of the Quechua article about Machu Picchu, much more than the Arabic or Hebrew on that topic. However, it is quite possible that the automated flagging system that Ethan proposes would not pick up Jerusalem as an article of particular interest to translate from Hebrew and Arabic (the entries in the English or German or Quechua wikipedias might be just as long), or might overlook the Quechua Machu Picchu article because, though richly informative, it turns out to be a shorter text than articles written by especially verbose Italians or Japanese who have passed through as tourists.

All of this suggests that the guiding factor in choosing articles for translation ought to be human intelligence. Spiders and bots can be helpful, but many of the jewels will emerge when a bilingual editor comes across an article in one language and decides that a direct translation would be valuable.

The development of a system for producing and locating translations would be quite helpful (and also complicated, because you would need to include versioning). However, at no point should there be an *expectation* that each article will be translated 347!/2 times – that should be a possibility within the infrastructure, but not a goal.

On a related note, there should also be a system for rooting out errors that get propogated from one Wikipedia to the next. Just today, I came across an error that I’d fixed and elaborated on in an English article, and then noticed that the same error was replicated in a dozen other languages. I knew enough to fix the initial error in most of the other pages (the derivation of Chikungunya disease was said to be Swahili, when in fact it came from Makonde), but could not do much more than stick a comment in the various language edit pages asking people to go to the English page and update their languages accordingly. It would be nice to have a way of flagging differences between Lang 2 and Lang 99 – perhaps simply an indication next to the article link that gives comparative word counts or paragraph counts of the associated articles, to help suss out the gems.

Off topic, here’s an explanation of when to use Swahili vs. Kiswahili: http://wapurl.co.uk/?1FHKQDF
Don Osborn August 10, 2006 at 8:12 pm

Thanks Ethan, a very interesting article. (Thanks also to Martin for mentioning it on AfrophoneWikis.)

We may not have to be prescriptive on the approach to translations. Generally I’d agree with Ndesanjo that each Wikipedia can be independent – that’s really the beauty of it as I think we all agree. Of course, there are some items or even types of info that might best be translated pretty much literally (not always, but most of the time if only for mathematical reasons, from the Wikipedias with high numbers of entries to those with lower). But mostly the potential within each language community for elaborating the viewpoints from its particular cultural-linguistic perspective, with the particular turns of phrase and perhaps unique vocabulary of the language, is what makes having a Swahili or Manding or Punjabi etc. Wikipedia so attractive and really helpful to all.

In the case of contested subjects and different languages, it may be helpful to sort out the two issues as much as possible – the language and the content/narratives of the contested issues. The ideal I suppose is a masterful NPOV article that begs translation; the worst is unfair articles in different languages with no connection. When you have a Jerusalem/AlQods issue, a Falklands/Malvinas, etc. (the names even are different!) these cases probably will take up a lot of time, effort and creativity (including cross translations). But happily only a relative few such issues need such sustained attention, and most others however important, do not (kind of a 90-10 or 80-20 situation).

I’m writing this in several sections at several sittings and noting now Martin’s posting – I would agree with him on his conclusion that “the guiding factor in choosing articles for translation ought to be human intelligence.” Some things can be automated/rules-across-the-board while others are best handled as artisanal/situation-specific. The translations question falls under the latter.

It is also true that not everything in x language has to be in y and vice versa. It’s great when it is, but often, especially for minority language speakers, they are by necessity multilingual and can access material in different languages. It may make sense to focus on developing unique material in their languages first. On the other hand I will say that where I think translation gets really interesting is when there are direct translations between languages with little historical contact – translations that do not pass through an intermediate European language translation – facilitate exchange of knowledge between those peoples.

Different heading… I’ve been looking again at David Crystal’s _The Language Revolution_, and in the chapter on the internet he mentions the matter of a “critical mass” of material on the web at point it can develop a “vibrant cyber-life” (p. 90). Kasper in a posting to AfrophoneWikis mentions the threshhold of 1000 articles. Maybe one goal with the development of small wikis should be to attain critical mass as measured perhaps as 1000 articles. How to get there – what combinations of original and translated articles etc. is one question that comes to mind.

Gerard raised the topic of Ethnologue’s classification of languages, and the ISO-639-3 standard that gives 3-letter codes to the languages on that basis. This raises another whole level of complexities. First, I think some caution is necessary. Ethnologue is a unique and extremely valauable resource, but the issue of how to distinguish among languages is one that it is clearly on one side of: it “splits” into languages what are felt to have certain kinds of differences. For many localization purposes, we might prefer to “join” tongues (sorry, in English that sounds funny, but “tongue” as a vague generic term lets me avoid “language” and “dialect” which are defined variously) that are close enough for a certain level of intercomprehension, and probably a common set of ICT terminology.

In the case of Wikipedia, this sort of issue will come up a lot with regard to African languages. You already see some discussion about this with regard to Akan, Twi Ashanti, Twi Akuapem, Fanti, etc. Akan is a language but also the name of a cluster including the others. There are slighly different orthographies for these very closely related tongues and an effort to develop a unified orthography. So what to do in Wikipedia? ISO-639-[any #] is not a good guide, unfortunately. What sort of guidelines and framework for development of an Akanophone Wiki-space should be adopted? Or should it be laissez faire? Or…? (These are among the sorts of questions I expect will come up on AfrophoneWikis.)

So if they do split Neapolitan in ISO-639-3, that may be fully justified from a particular methodology and linguistic set of criteria, but it might also be another example of where this coding will not fit with the needs of communication and dissemination of information on Wikipedia. I don’t have any knowledge of this tongue, so pardon me if I’m off.

(There is another language reference worth mentioning – the Linguasphere Register – and several other ISO-639 sets that will permit grouping and accounting for interrelationships, but that’s all another matter …)

Great discussion. Hope this is of interest too.

Don Osborn
Ethan August 11, 2006 at 11:40 am

Thanks, friends. Exploring this issue is helping me discover just how robust and complex the efforts to create Wikipedias in different languages are. I suspect there’s some disconnect between the communities engaged with these issues and those focused primarily on one of the large-language wikipedias. I’m interested to hear folks promoting the one-wikipedia or the meta-wikipedia strategy as well as folks arguing the multiple wikipedias idea. Don, thanks for the observation on Jerusalem/Al-Quds – the fact that we can’t agree on words seems like an interesting argument for the difficulty of NPOV across language.
quixote August 11, 2006 at 4:16 pm

Fascinating post. I’m with the folks who don’t see why it needs to be an “or” issue. Translations between Polish and Punjabi will probably hardly ever be needed. If they are, they can be done as needed. Framing the problem as 347!/2 makes it look much huger than it really is.

However, the issue of keeping translations updated really is a thorny one. This’ll probably only be solved with much better machine translating than what we have now.

I’d like to weigh in on the NPOV issue. Science is always supposed to be done from a neutral point of view, and as a scientist, I’ve been steeped in that tradition. However, try as you will, there is no getting away from your own point of view. There are plenty of philosophy and history of science papers making that point with great depth and precision. The more effective solution to the POV problem is a) to state one’s own assumptions as clearly as possible, and b) to present various POVs side by side. Instead of striving for unattainable neutrality, it would be better to have a policy that major differences are all given a voice. (I know. That kicks the problem up a level because someone has to decide what is “major.” However, that isn’t fundamentally different from having someone decide what is “neutral.” There is no approach that can expunge the human element….)
tarkowski August 11, 2006 at 6:29 pm

Just wanted to add that problems with multi-lingualism further complicate the issue of placing Wikipedia in OLPC laptops (the initial problem seeming to be that the Wikipedia will become frozen at the upload point). The limited number of language versions with many articles might make the Wikipedia not a very useful resource in many countries.
AfroVoltaire August 12, 2006 at 10:40 am

Sir Ethan, bro, I am amazed at the amount of info you are able to share with us, and your command on very varied topics. You will have to share your secret with me, and also on how you are able to do all this, and lead your own life, because I am just amazed.

This said, about Wikipedia, as a multi-lingual contributor (English, French, Lingala and Spanish) to wikipedia, I try to systematically ensure that any article I start, or contribute to in one language, have an equivalent, or at the very least a stub in the other. I see this as necessary (especially in Lingala) to reaching that critical-mass Don Osborn was talking about. Why I do it? For a very selfish purpose: I would like my poorly educated, and non-European-language-speaker Congolese Grandma to have access to the Internet, and its knowledge database, before she leaves this world. In broader terms, there are 45% of people in both Congos that are not French-litterate, but are generally litterate-enough in Lingala and/or Kikongo, to read the bible for example. So Wikipedia (and other sites) in these languages could be a crucial empowerng-tool, despite the fact that the official and school language of these countries is French.

That is why I see Wikipedia as a REVOLUTION!
Cheers!
Zemed Mersha Alemu September 26, 2006 at 9:32 pm

As I was trying to satisfy my curosity re blogosphere questioning its origin as a cyber space off shoot, my eyes caught on your interesting article and read it all and tried to comprehend it and hope I did get the whole message. I am perplexed but not in dispair. My origional language is Amharic which has also its own alphabet. I attended and read the Bible in Amharic which was written before KJV was even thought of. Now I am colonized by English and now even BLOG B L O G !!Thank you and keep the faith. Just as p.s. now the word coffee originated from Kaffa, a southern provice of the “current”Ethiopia. Currently Ethiopia is divided to almost becoming economically unviable forgotten land locked country. You are apt to say So what? Yet Ethiopia’s coffee is international. Globalblog Africa should start fierce campaign. Please help this cause. Thank you again.
Sincerely
Zemed M.Alemu
Pingback: iCommons » Blog Archive » Hello Wikimaniacs!

Comments are closed.