If you want to know what people around the world are thinking and feeling, you need help from a translator. Recent events in Iran are a reminder that the internet and citizen media aren’t enough to give us access to events throughout the world – we need tools and strategies for bridging language gaps as well, or we limit ourselves to only the voices we can understand.
For those of us who think the Internet is a powerful tool for international understanding, language is a challenge we need to confront, a complex set of problems we need to address. I just had the chance to join a small band of people dedicated to solving these problems, joining in the Open Translation Tools summit, held this week in Amsterdam. I came away hopeful, sobered by the size and complexity of the problems, but thrilled that such a smart, creative and global group was willing to take on these challenges.
The internet has been polyglot since early days, but the rise of read/write technologies has brought issues of linguistic diversity to the fore. In our experience with Global Voices, we saw lots of people blogging in English as a second language until there were lots of their fellow speakers online… then we saw lots more bloggers in local languages. Once you’ve got an audience that speaks your language, it makes sense to blog, twitter or otherwise publish in that language. It’s extremely difficult to accurately estimate how many people are blogging in Chinese – figures from companies like Spinn3r or Technorati aren’t counting most of the China-hosted blogging platforms. The number is somewhere between enormous and freaking huge, and people who want to know what what Chinese netizens are thinking better hope we figure out how to clone Roland Soong sometime soon. (Roland and the EastSouthWestNorth blog are so important to English/Chinese dialog that I know of several folks who refer to plans for massive Chinese/English translation as “the distributed Roland Soong problem”.)
Other languages are moving online as a way to ensure their survival in a digital age. The 27,000+ articles in the Lëtzebuergesch wikipedia don’t reflect the size of the language (spoken by roughly 390,000 people in Luxembourg) but the passion of that community to ensure the language exists in the 21st century. While Jay Walker may predict the rise of English as the globe’s second language, I’m predicting that the internet will make it easier to document, share and keep alive the world’s linguistic diversity. (They’re not incompatible ideas, BTW, though I still think Jay’s overstating the trend.)
In other words, every single day, there’s more content online in languages you don’t speak, and you can read a smaller percentage of the internet. It’s not just a matter of learning Chinese, though that would be a great first step. We’re seeing content in Tagalog, in Malagasy, in Hindi, and it’s not clear how we’re going to read, index, search, amplify and understand all of it.
The folks at the Open Translation Tools summit (OTT09) have been working on this problem for a long time. Allen Gunn – “Gunner” to anyone who knows him – characterized the participants as toolbuilders, translators, and publishers. But the common ground is that the people represented at the gathering are pioneers, people who’ve pushed the boundaries to ensure that languages can be present online, and that we can translate between them.
Some of the folks in the crowd, like Javier Solá, can claim credit for bringing whole languages online. (That Solá, a Spaniard, can claim that credit for Khmer is its own wonderful story.) Dwayne Bailey, who’s done excellent work bringing African languages online through his project, translate.org.za, reminded the crowd of the painstaking steps necessary to bring a language online: one or more fonts to represent the character set, a keyboard map to allow text entry, appropriate unicode representations, support for the language within software like OpenOffice, the creation of utilities like spellcheckers. Internationalization is now part of virtually any open source project, but it still tends to be an afterthought, and several groups at the summit were focused on the painstaking work necessary to bring Indian, Central Asian and African languages online for the first time.
Thanks in part to the Global Voices tendency to occupy other people’s conferences – we don’t have an office, so we simply send a dozen people to cool conferences and hold our meetings before or after – publishers were probably the best represented group at the meeting. Many of the projects I most admire were represented, including Meedan, which bridges between Arabic and English speakers via translation, and Yeeyan, which translates English-language content into Chinese. It’s interesting to see the different models emerging around social translation. Meedan translates everything, first with machine translation, and then with volunteer human translators, to make English/Arabic conversation seamless. Yeeyan invites readers to suggest English-language content they think Chinese readers would benefit from reading – Jiamin Zhao, who leads their Beijing team, says this hasn’t been very popular with their users, and that much of the translation happens around large, established projects like the translation of The Guardian. And Global Voices just lets anything go – each language team gets to pick what content they want to translate and what tools they want to use.
Some of the publishers are toolbuilders as well. Ed Zad showed off dotsub’s lovely platform for subtitling and translating online video. While dotsub hosts thousands of subtitled videos, many of us know it better as the toolkit underlying TED’s ambitious open translation project. This model of hosting subtitled and translated videos for third parties is a major part of dotsub’s business model – Ed shows us subtitled videos from the US Army, allowing the Army to meet legal obligations to make all their content available to the hearing impaired, at lower costs as dotsub’s tools are far more efficient than other technologies available.
Meedan offers a beautiful set of tools to allow volunteer translators to turn machine translations into more readable, human translations, and is working closely with Brian McConnell’s WorldWide Lexicon, which focuses on giving publishers a great deal of control over how their site is translated while embracing the model of social translation. I was excited to get a peek at Traduxio, which is focusing on translating cultural texts, like Balzac and Tchekhov and building complex translation memories in the process.
One of the central questions at the meeting was whether toolbuilders were building the right tools for translators to use. A number of projects focused on building open source translation memories. These are tools that keep track of how a translator has rendered a particular word or phrase in the past and prompts her with past translations in a new document. Many professional translators use Trados, though it’s apparently one of these tools that’s industry standard, though not well-loved. (One of the odd quirks of the translation industry, Ed Zad tells us, is that translation clients own the contents of these translation memories, not the translators.) It’s not clear whether social translation projects are really using translation memories. We’ve talked about the subject a great deal within Global Voices, but none of our translation teams is using one… perhaps because they’re not aware of open source ones available, perhaps because few of those open source ones are very good, or perhaps because it’s not how they’re used to working. Ziamin from Yeeyan made the same confession – perhaps because we’re working with volunteers who are translating, rather than translators who are volunteering their time, there’s not much push from within our communities for translation memory tools.
There might be more traction for tools that helped with translation workflow. Professional translators tend to be closely project-managed, and work in teams, with a translator, an editor and a proofreader. Most of the social translation models use less complex systems – an editor usually reviews a translated text in a Global Voices community, for instance, but the system isn’t as formalized. And there seemed to be great demand for tools that matched potential readers of texts with translators, systems that could allow readers to flag a text they wanted to read in another language or show translators potential readership for a particular text. I moderated a session on “demand” which generated a wide range of ideas, from seeking data from Google Translate on what documents were most requested by users to creating Firefox plugins that automatically translated texts and allowed readers to request human-translated versions. My Global Voices comrades were exploring a set of ideas about rewarding translators, with recognition, with karma ratings that might translate into professional translation work, with micropayments for translations – all these ideas require new tools and working methods.
Google wasn’t present at the conference, but was the unspoken presence in almost every session. While there was widespread agreement that Google’s machine translation tools were far from perfect – and sometimes farcically bad – they’ve been getting lots better and some participants wondered whether we should be putting the effort into building new social translation systems if they’re going to obviate all our work in a few years. Personally, I think it’s a bad mistake to stop work because we think Google might be working on the same issues.
The languages where Google is good are ones where we’ve got huge corpora – sets of documents that exist in two or more languages, which have been “aligned” by algorithms so that it’s possible to see how one phrase has been translated into another. A corpus like the Europarl Corpora – which contains millions of aligned sentences in eleven languages, taken from human translations of European parliament proceedings – can make it fairly easy to build these tools… though one wonders if they’re better at translating bureacratic memos than casual conversations. (Another major corpus, the Acquis Communautaire, offers the whole body of EU law in 23 languages. Sounds like a blast to read.) These statistical machine translation methods get stronger as we get more aligned documents available.
But some languages don’t have large corpora available – I don’t know where we’re going to find a large set of English/Malagasy translations, for instance. In these cases, rule-based machine translation might work better – one of our participants, who studies rule-based systems, argues that they’ve proved their utility in translating between closely related languages like Spanish and Catalan. They parse sentences into parts of speech, or into more complex intermediate representations, then translate word by word, restructuring the sentences into grammatically correct forms. Our friend pointed to a study he’d helped conduct which saw these rule-based systems doubling the efficiency of human translators from 3000 words a day to 6000 words, in closely-related languages.
My sense is that the most exciting potential in the near future may be to use social translation to create corpora that could benefit statistical machine translation. That probably means ensuring that Google – admired and feared at gatherings like this one – has a seat at the table in a future discussion.
It’s a long path from the discussions in Amsterdam to a system that allows me to stumble upon a blogpost in Persian and request (and perhaps offer a bounty for) a translation. But those conversations have to start somewhere, and it was a pleasure to have a ringside seat for them in Amsterdam.
One of the projects taking place around the OTT summit is a “book sprint“, a five-day project to write a book that outlines the state of the art in open source translation systems. If that sounds crazy… well, it is, but not as nuts as you think. My friend Tomas Krag pioneered the model a few years back with a brilliant book on wireless networking in the developing world, and it’s been adopted by the fine folks at FLOSS Manuals. I’ll link when the book is available… which should be about three days from now!
You can read notes on each of the sessions on the OTT wiki – it’s a great summary of the discussions that took place.