My Heart's in Accra

Ethan Zuckerman's musings on Africa, international development
and hacking the media.

06/30/2009 (5:29 pm)

The Open Translation Manual

Filed under: Geekery,Global Voices,ideas,Media ::

In a post last week about the Open Translation Tools summit in Amsterdam, I mentioned a “book sprint” that was working to put together a book on Open Translation.
Well, they did it. It was released today, and it’s a damned fine piece of work. (I say that independent of the fact that they used my Polyglot Internet essay as the introduction to the book!)

In five days, a team led by the indefatigable Adam Hyde put together the definitive starting point for people who want to learn what Open Translation is, what tools open translation communities use, what models are working for translation communities, and what the unsolved problems are in the field. The book includes case studies of notable translation communities, including Global Voices, Meedan and Wikipedia, as well as extensive lists of tools useful for localization and translation. It’s available, for free, both as a website and a printable PDF, and will both be published as a paper book, and continue to evolve as a project you can register for and contribute to. (It’s licensed under the GPL version 2.)

As with earlier book sprints, the project demonstrates that it’s possible to make a good stab at a guide to a field of work if you’ve got the right people willing to assemble in a room for five days. The first book sprint was instigated by my dear friend Tomas Krag, who got sick of spending all his time on the road in developing nations teaching people about wireless networking. He knew he’d never write a book by himself, so he held a book sprint, based on the idea of a code sprint, at the annual gathering of the developing world wireless community. Participants spent a long, difficult day arguing over the structure of the book, then went to their respective corners to write, edit, repurpose and recycle content from around the web into a comprehensive guide. The model worked well enough that Adam Hyde from FLOSS Manuals adopted it and has used it as a strategy for building new books around conferences.

I’m off to the Aspen Ideas Festival tomorrow, which looks exciting, celebrity-studded, and worth my careful blogging. But I seriously doubt that a team of smart and crazy people will get a useful book out of it, at least not in five days.

06/30/2009 (12:03 pm)

links for 2009-06-30

Filed under: del.icio.us links ::

06/29/2009 (12:03 pm)

links for 2009-06-29

Filed under: del.icio.us links ::

06/27/2009 (12:02 pm)

links for 2009-06-27

Filed under: del.icio.us links ::

06/26/2009 (4:54 pm)

Notes and reflections from the Open Translation Tools Summit 2009

If you want to know what people around the world are thinking and feeling, you need help from a translator. Recent events in Iran are a reminder that the internet and citizen media aren’t enough to give us access to events throughout the world – we need tools and strategies for bridging language gaps as well, or we limit ourselves to only the voices we can understand.

For those of us who think the Internet is a powerful tool for international understanding, language is a challenge we need to confront, a complex set of problems we need to address. I just had the chance to join a small band of people dedicated to solving these problems, joining in the Open Translation Tools summit, held this week in Amsterdam. I came away hopeful, sobered by the size and complexity of the problems, but thrilled that such a smart, creative and global group was willing to take on these challenges.

The internet has been polyglot since early days, but the rise of read/write technologies has brought issues of linguistic diversity to the fore. In our experience with Global Voices, we saw lots of people blogging in English as a second language until there were lots of their fellow speakers online… then we saw lots more bloggers in local languages. Once you’ve got an audience that speaks your language, it makes sense to blog, twitter or otherwise publish in that language. It’s extremely difficult to accurately estimate how many people are blogging in Chinese – figures from companies like Spinn3r or Technorati aren’t counting most of the China-hosted blogging platforms. The number is somewhere between enormous and freaking huge, and people who want to know what what Chinese netizens are thinking better hope we figure out how to clone Roland Soong sometime soon. (Roland and the EastSouthWestNorth blog are so important to English/Chinese dialog that I know of several folks who refer to plans for massive Chinese/English translation as “the distributed Roland Soong problem”.)

Other languages are moving online as a way to ensure their survival in a digital age. The 27,000+ articles in the Lëtzebuergesch wikipedia don’t reflect the size of the language (spoken by roughly 390,000 people in Luxembourg) but the passion of that community to ensure the language exists in the 21st century. While Jay Walker may predict the rise of English as the globe’s second language, I’m predicting that the internet will make it easier to document, share and keep alive the world’s linguistic diversity. (They’re not incompatible ideas, BTW, though I still think Jay’s overstating the trend.)

In other words, every single day, there’s more content online in languages you don’t speak, and you can read a smaller percentage of the internet. It’s not just a matter of learning Chinese, though that would be a great first step. We’re seeing content in Tagalog, in Malagasy, in Hindi, and it’s not clear how we’re going to read, index, search, amplify and understand all of it.

The folks at the Open Translation Tools summit (OTT09) have been working on this problem for a long time. Allen Gunn – “Gunner” to anyone who knows him – characterized the participants as toolbuilders, translators, and publishers. But the common ground is that the people represented at the gathering are pioneers, people who’ve pushed the boundaries to ensure that languages can be present online, and that we can translate between them.

Some of the folks in the crowd, like Javier Solá, can claim credit for bringing whole languages online. (That Solá, a Spaniard, can claim that credit for Khmer is its own wonderful story.) Dwayne Bailey, who’s done excellent work bringing African languages online through his project, translate.org.za, reminded the crowd of the painstaking steps necessary to bring a language online: one or more fonts to represent the character set, a keyboard map to allow text entry, appropriate unicode representations, support for the language within software like OpenOffice, the creation of utilities like spellcheckers. Internationalization is now part of virtually any open source project, but it still tends to be an afterthought, and several groups at the summit were focused on the painstaking work necessary to bring Indian, Central Asian and African languages online for the first time.

Thanks in part to the Global Voices tendency to occupy other people’s conferences – we don’t have an office, so we simply send a dozen people to cool conferences and hold our meetings before or after – publishers were probably the best represented group at the meeting. Many of the projects I most admire were represented, including Meedan, which bridges between Arabic and English speakers via translation, and Yeeyan, which translates English-language content into Chinese. It’s interesting to see the different models emerging around social translation. Meedan translates everything, first with machine translation, and then with volunteer human translators, to make English/Arabic conversation seamless. Yeeyan invites readers to suggest English-language content they think Chinese readers would benefit from reading – Jiamin Zhao, who leads their Beijing team, says this hasn’t been very popular with their users, and that much of the translation happens around large, established projects like the translation of The Guardian. And Global Voices just lets anything go – each language team gets to pick what content they want to translate and what tools they want to use.

Some of the publishers are toolbuilders as well. Ed Zad showed off dotsub’s lovely platform for subtitling and translating online video. While dotsub hosts thousands of subtitled videos, many of us know it better as the toolkit underlying TED’s ambitious open translation project. This model of hosting subtitled and translated videos for third parties is a major part of dotsub’s business model – Ed shows us subtitled videos from the US Army, allowing the Army to meet legal obligations to make all their content available to the hearing impaired, at lower costs as dotsub’s tools are far more efficient than other technologies available.

Meedan offers a beautiful set of tools to allow volunteer translators to turn machine translations into more readable, human translations, and is working closely with Brian McConnell’s WorldWide Lexicon, which focuses on giving publishers a great deal of control over how their site is translated while embracing the model of social translation. I was excited to get a peek at Traduxio, which is focusing on translating cultural texts, like Balzac and Tchekhov and building complex translation memories in the process.

One of the central questions at the meeting was whether toolbuilders were building the right tools for translators to use. A number of projects focused on building open source translation memories. These are tools that keep track of how a translator has rendered a particular word or phrase in the past and prompts her with past translations in a new document. Many professional translators use Trados, though it’s apparently one of these tools that’s industry standard, though not well-loved. (One of the odd quirks of the translation industry, Ed Zad tells us, is that translation clients own the contents of these translation memories, not the translators.) It’s not clear whether social translation projects are really using translation memories. We’ve talked about the subject a great deal within Global Voices, but none of our translation teams is using one… perhaps because they’re not aware of open source ones available, perhaps because few of those open source ones are very good, or perhaps because it’s not how they’re used to working. Ziamin from Yeeyan made the same confession – perhaps because we’re working with volunteers who are translating, rather than translators who are volunteering their time, there’s not much push from within our communities for translation memory tools.

There might be more traction for tools that helped with translation workflow. Professional translators tend to be closely project-managed, and work in teams, with a translator, an editor and a proofreader. Most of the social translation models use less complex systems – an editor usually reviews a translated text in a Global Voices community, for instance, but the system isn’t as formalized. And there seemed to be great demand for tools that matched potential readers of texts with translators, systems that could allow readers to flag a text they wanted to read in another language or show translators potential readership for a particular text. I moderated a session on “demand” which generated a wide range of ideas, from seeking data from Google Translate on what documents were most requested by users to creating Firefox plugins that automatically translated texts and allowed readers to request human-translated versions. My Global Voices comrades were exploring a set of ideas about rewarding translators, with recognition, with karma ratings that might translate into professional translation work, with micropayments for translations – all these ideas require new tools and working methods.

Google wasn’t present at the conference, but was the unspoken presence in almost every session. While there was widespread agreement that Google’s machine translation tools were far from perfect – and sometimes farcically bad – they’ve been getting lots better and some participants wondered whether we should be putting the effort into building new social translation systems if they’re going to obviate all our work in a few years. Personally, I think it’s a bad mistake to stop work because we think Google might be working on the same issues.

The languages where Google is good are ones where we’ve got huge corpora – sets of documents that exist in two or more languages, which have been “aligned” by algorithms so that it’s possible to see how one phrase has been translated into another. A corpus like the Europarl Corpora – which contains millions of aligned sentences in eleven languages, taken from human translations of European parliament proceedings – can make it fairly easy to build these tools… though one wonders if they’re better at translating bureacratic memos than casual conversations. (Another major corpus, the Acquis Communautaire, offers the whole body of EU law in 23 languages. Sounds like a blast to read.) These statistical machine translation methods get stronger as we get more aligned documents available.

But some languages don’t have large corpora available – I don’t know where we’re going to find a large set of English/Malagasy translations, for instance. In these cases, rule-based machine translation might work better – one of our participants, who studies rule-based systems, argues that they’ve proved their utility in translating between closely related languages like Spanish and Catalan. They parse sentences into parts of speech, or into more complex intermediate representations, then translate word by word, restructuring the sentences into grammatically correct forms. Our friend pointed to a study he’d helped conduct which saw these rule-based systems doubling the efficiency of human translators from 3000 words a day to 6000 words, in closely-related languages.

My sense is that the most exciting potential in the near future may be to use social translation to create corpora that could benefit statistical machine translation. That probably means ensuring that Google – admired and feared at gatherings like this one – has a seat at the table in a future discussion.

It’s a long path from the discussions in Amsterdam to a system that allows me to stumble upon a blogpost in Persian and request (and perhaps offer a bounty for) a translation. But those conversations have to start somewhere, and it was a pleasure to have a ringside seat for them in Amsterdam.


One of the projects taking place around the OTT summit is a “book sprint“, a five-day project to write a book that outlines the state of the art in open source translation systems. If that sounds crazy… well, it is, but not as nuts as you think. My friend Tomas Krag pioneered the model a few years back with a brilliant book on wireless networking in the developing world, and it’s been adopted by the fine folks at FLOSS Manuals. I’ll link when the book is available… which should be about three days from now!


You can read notes on each of the sessions on the OTT wiki – it’s a great summary of the discussions that took place.

06/26/2009 (12:03 pm)

links for 2009-06-26

Filed under: del.icio.us links ::

06/25/2009 (10:47 pm)

Twitter and the news cycle, perfect together

Filed under: Africa ::

It’s nice to be listened to. I guess. Maybe. Though I now find myself wondering whether I wouldn’t be better off shutting up.

I saw the first reports of Michael Jackson’s death on Twitter around 6pm. I ran a little script I threw together some weeks ago called “twitcent” to see just how many tweets would share the news. Twitcent takes advantage of the fact that Twitter gives a unique, sequential ID to each tweet to estimate the intensity of posting around certain terms. It retrieves a page of 100 search results for a particular search term – say “Michael Jackson” – and looks at the ID numbers of the first and last tweets listed. Take the difference of those numbers, and you get how many tweets were posted between search result #1 and #100. Divide, and you’ve got a percentage of tweets on the system in a discrete, small interval mentioning the term.

Is it accurate? I dunno. If my assumptions are right, it should be – if Twitter’s not always numbering sequentially, or if some large percent of tweets on the system are unsearchable, less so. Anyway, I ran several search terms through the engine and saw something I’d never seen before – search terms registering in double digit percentages, and the term “Michael Jackson” appearing in 13 – 20% of the tweets.

So I tweeted the following: “My twitter search script sees roughly 15% of all posts on Twitter mentioning Michael Jackson. Never saw Iran or swine flu reach over 5%” And then I went to make dinner.

When I got back online this evening, the tweet had been quoted in Wired News, the New York Times Bits blog, Washington Post’s mocoNews, and in the San Jose Mercury News.

Geez, think these guys read each other much? I’m flattered, I think. But worried that I’m now going to be quoted for the next several days as an “expert” on Michael Jackson twittering, especially as the NYTimes piece identifies me as a Berkman Center researcher.

Of course, by the time I’d gotten back online, the initial fervor had died down – here’s what my script turns up now:

2.152 % Michael Jackson
2.634 % jackson
2.242 % michael
0.312 % micheal
1.596 % MJ
0.119 % #MichaelJackson

That’s a lot of tweets, but now in the neighborhood of a busy swineflu day or the heart of Twitter’s interest in the Iran protests. What was interesting to me was the way the information flashed across Twitter, briefly bringing on the failwhale for some users – with one in seven or so tweets mentioning the death, it’s interesting to wonder whether people saw themselves as spreading the news, or as simply expressing shock, surprise, or their personal reaction. (And yes, I tweeted an update that the term was now down to roughly 3%. That one hasn’t gotten retweeted…)

What’s really interesting to me is the extent to which news reporters seem to have chosen Twitter as the go-to source for reactions to news events. It makes sense – there’s a premium in the news business on speed, on having a story faster than anyone else does, so the need for the quick quote makes Google hours to slow to help you. And the 140 character limit guarantees that whoever you quote will be pithy and limited to a single soundbite.

This, in turn, also increases the chance that you’ll be wrong. A proper quote from me would probably have been something like: “The search string ‘Michael Jackson’ is getting intense interest on Twitter at the moment, showing up in between 13-20% of tweets. It’s unlikely this level of intensity will continue through the night, but at the moment, it exceeds the intensity I’ve seen on Twitter during slower-breaking stories like #swineflu, #pman and #IranElection.” That, unfortunately, is 337 characters – far too long for anyone to read anymore. And a clarification in the form of a blogpost? That’s so 2006.

06/23/2009 (12:04 pm)

links for 2009-06-23

Filed under: del.icio.us links ::

06/18/2009 (10:44 am)

Chris Csikszentmihayli and a complex vision of citizen media

Filed under: Blogs and bloggers,ideas,Media ::

Chris Csikszentmihayli opens the morning’s session at MIT’s Knight News Challenge conference with an overview of his view of the world – “It’s my view from MIT – MIT wouldn’t endorse it, they’ve been quite specific about that,” he quips, a reference to the university’s unfortunate decision not to grant him tenure. Chris is now focusing on managing the Center for Future Civic Media, and outlines one of the most exciting projects, ExtrAct. The project calls attention to the process of natural gas extraction via fracturing, a process that exposes millions of rural Americans to incredibly toxic chemicals. ExtrAct tries not just to document the practices of fracturing, but to help rural, poor, highly disconnected people organize, get media attention and fight some of the harmful effects of these practices.

What do we, as a society want, Chris wonders. A free and just society. Journalism, openness and transparency and democracy have all emerged as means to that end. Technology, leveraged correctly, can sometimes be a means to that end. Sometimes technology is the enemy of a free and just society. Alan Kay famously said, “the best way to predict the future is to invent it.” Some scholars have suggested that tools control what we can do. Yochai Benkler proposes that it’s not just about the tools, but about how we use them. Bruno Latour suggests that “technology is society made durable.”

Last night’s talk, Chris summarizes, was the “rending of garments” about the death of the daily newspaper. He points out that newspapers put another group out of work, “people so dedicated to their work that they took oaths of celibacy.” (He resists the inevitable geek puns.) The press put the monks out of work. But technology isn’t evenly distributed – head to a city in the developing world and you’ll find scribes, often organized around the post office so they can help illiterate people write letters. (I’ve seen scribes in cybercafes in Kigali…) The Media Lab, Chris tells us, makes its money from fear, taking funds fro sponsors who are slowly going out of business, like the recording industry. The implication, I think, is that documenting these changes – and demonstrating their inevitability – is a useful service for helping corporations accept and cope with this change.

To frame the ideas of user innovation and open source software, Chris shows us how “diff” and patching works – the ability to compare two files on a computer system and send the changes between the two. This is the fundamental idea behind the improvability of open source software, and underlies versioning systems like Subversion and Mercurial.

User-driven innovation, as described by Eric Von Hippel, involves motives other than making a profit – users who improve products often just want a specific functionality available to the world. They don’t need to sell it, just to have it be usable. Open source projects are political spaces – they’re like community organizing projects. They need to be optimized to allow lightweight participation and contribution. He shows the structure of Linux versus Mozilla – as Mozilla moved from a commercial product into a community one, the structure had to change so that people could add code without having to learn about thousands of dependencies.

What tools allow uprisings to take place? Chris is interested in SMS and its role in organizing protests in places like the Philippines. “Governments would love it if these tools weren’t around” – that’s why they shut down SMS during elections. But other tools end up being useful, even if they’re less obvious. Planespotting websites allowed researchers to break the CIA torture flights story – the data was never intended to study torture, but it proved useful for another, critical purpose. This leads Chris to emphasize the importance of laws and practices that ensure an open and free press in a digital age. This might mean supporting Open Street Maps instead of Google Maps, so the maps are reusable and reproducible. It might mean supporting edge figures like Richard Stallman – who Chris analogizes to Reverend Elijah Lovejoy, killed in the early 1800s for his support in print for abolition.

Chris closes his talk with remarks on Jean-Jacques Rousseau, who wrote not just political philosphy but “bodice-ripper novels”. These novels allowed individuals to “live in the skin of others”, experience the empathy that comes from living for a while as a servant or a noble. The daily paper, he believes, can give a sense of community empathy, the ability to live another’s experience through storytelling. That’s something we need to preserve and cultivate as we move into a digital future.

06/18/2009 (9:23 am)

Iran, citizen media and media attention

It’s been an interesting few days for people who study social media. As the protests over election results have continued in Iran, and Iranian authorities have prevented most mainstream journalists from reporting on events, there’s been a great deal of focus on social media tools, which have become very important for sharing events on the ground in Iran with audiences around the world. I, like many of my friends at the Berkman Center and Global Voices, have spent much of the past two days on the phone with reporters, fielding questions about:

- Whether social media is enabling, causing or otherwise driving the protests in Iran
- How Iranian users are managing to access the internet despite widespread filtering
- The ethics (and practice) of distributed denial of service attacks as a form of information warfare
- Whether such online activities are unprecedented

Rather than tell you what I and colleagues have been saying to reporters, I’ll point you to one of the better stories, by Anne-Marie Corley in MIT’s Technology Review – she interviews several of my Berkman and Open Net Initiative colleagues and outlines the argument many of us are making:

- Social media is probably more important as a tool to share the protests with the rest of the world than it is as an organizing tool on the ground.
- Iranians have been accessing social networking sites and blogging platforms despite years of filtering – there’s a cadre of folks who understand how to get around these blocks and are probably teaching others.
- Because so many Iranians use social media tools – often to talk about topics other than politics – they’re a “latent community” that can come to life and have political influence when events on the ground dictate.

Gaurav Mishra rounds up dozens of blog and MSM articles and offers an excellent overview of arguments around these questions (with a strong dose of his own interpretation, much of which I share.) He references Evgeny Morozov, who’s got a thorough denunciation of DDOS as a strategy for protest, correctly pointing out that it mostly functions to make participants feel better about themselves by giving them a way to feel involved with the protests. Unfortunately, unlike positive online gestures of solidarity (retweeting reports from Iran, turning Twitter or Facebook pictures green), this one does little more than piss off sysadmins, helps Iranian authorities make the case that forces outside Iran are “attacking the country” and encourage user-driven censorship as a response to unwanted speech.

So, given the wealth of commentary on the questions above by folks smarter than me, let me weigh in on some of the questions I haven’t heard asked.

Biases and social media – One of the reasons MSM outlets are so focused on social media is that they’re not able to deploy reporters to cover these protests. In some cases, the majority of reporting from the ground is coming from social media. It’s worth asking what the biases might be in amplifying those social media reports. Ahmedinejad’s supporters tend to be poorer, more rural, less educated and more likely to speak Farsi than Mousavi’s supporters – a picture of the protests via social media runs the danger of overstating Mousavi support or minimizing Ahmedinejad support. We’ve been trying to counterbalance this a bit at Global Voices – Hamid Tehrani, our Iran editor, did a brief roundup last night of bloggers supporting Ahmedinejad. It’s worth noting that the posts he quotes are all in Farsi: language may well be a barrier that is influencing coverage as well, if voices for reform are easily quoted in English and voices for the status quo are in Farsi.

My friend and colleague David Sasaki reminded GV editors that bloggers had predicted a Rafsanjani victory in 2005, and suffered their “Howard Dean” moment when it became clear that their candidate had little support outside the most liberal bloggers. That’s a very different situation than what’s happening now – the hundreds of thousands of peple in the streets points to profound support for Mousavi – but reminds us that the online voices from Iran, especially the English-speaking ones, probably aren’t representative of mainstream opinion.

An Iran story, not a social media story – Iran is one of the countries American and British media pay closest attention to. The use of social media for protest – especially to promote a protest to international audiences – is far from unique. But because there’s such strong media focus on Iran, and such interest in the use of social media for protest, this is a perfect storm for interest in this topic.

I’ve been asking some of the reporters I’ve spoken with where they were on other recent social media and protest stories. Citizen media has emerged as one of the key spaces for journalism in Fiji in the wake of a coup government that’s censoring mainstream media. It’s been a key source of information in Madagascar as that country’s suffered through a violent change of government. (One reporter who I mentioned this to remarked that Madagascar was “just a speck of an island somewhere”. That speck is twice the size of Great Britain and has the population of Australia…) In Guatemala, online media publicized the assasination of a lawyer by forces close to the president… and government authorities began arresting people for twittering the story to amplify it. These weren’t huge stories for most newspapers – the Iran story is huge not because of the social media aspect, but because protests in Iran are a huge story independent of citizen media.

Flock – I’ve written at some length about homophily, the tendency of birds of a feather to flock together. Turns out that reporters flock, too. It’s somewhat amazing to me the extent to which reporters from really good newspapers are all asking the same questions. I’m glad that people are taking a close look at the phenomenon of social media in the Iranian protests – it’s an important, fascinating and worthwhile topic. But there’s a lot of topics out there, and I wonder whether we benefit from a thousand well-researched stories on this phenomenon rather than a hundred, and nine hundred other stories.

Next Page »