Update: The interface to search the PeopleFinder database is up. According to our friends at Salesforce.com, there are now 87,000 records in the database, all entered by hand.
Jeff Jarvis, who’s done an excellent job of blogging various Katrina recovery efforts, sees an opportunity for a dialog about reactions to future natural (or, god forbid, manmade) disasters – he’s calling the idea Recovery 2.0.
I think Jeff’s on the right track here, although I think we’re probably at Recovery 0.2a in software terms rather than 2.0 – we’re a long way away from a 1.0 response from the web community that we could all be happy with. I hope folks will take time to document the work they did to help out with Katrina, and that we’ll keep developing and refining these tools after the immediate need has passed. Unfortunately we all understand that we’ll eventually face another disaster of one sort or another.
In that spirit, I wanted to offer some reflections on the small part of the Katrina PeopleFinder project I’ve been involved with. My basic conclusion: we got an amazing amount done in a very short time with very, very bad tools. If we’re lucky enough to have the same sort of response from the web community and took the time to build some better tools, we’d be able to tackle huge data entry challenges the next time around.
Timeline, as I saw it. Apologies to folks who are misidentified, or not identified.
- Friday afternoon, David Geilhufe starts organizing geeks to start “screen scraping” databases and bulletin boards with information about hurricane survivors. Some time that evening, David and others develop PFIF – the PeopleFinder Interchange Format, a spec and XML format for missing and found person information.
- Saturday morning, David sends an email to some of the “usual suspects” in the activist technology world, asking for assistance in organizing a part of the PeopleFinder project: manual entry of data from “unstructured” sources, like bulletin boards, blog comments, etc. Other teams are working on importing data from structured databases and building the database where all this data will live – Zach Rosen from CivicspaceLabs is leading the structured data entry team.
I find Jon Lebkowsky in the #globalvoices channel on irc.freenode.net – we commandeer the channel as headquarters for the project. Jon agrees to take on the human element of the project – volunteer management; I take on the technical part – breaking bulletin boards into chunks and assigning them to users.
- We set up a wiki on the GlobalVoices Wiki and start assigning chunks of databases in a truly, brain-dead stupid fashion. After a few hours, we move the wiki to Katrinahelp.info, to clear up namespace confusion.
We rapidly figure out that assigning people a page of bulletin board results isn’t going to work, as the posts on each page change as new posts are added to the system. A pair of Craigslist geeks solve the problem on their site, by creating HTML pages with the contents of 25 Craigslist posts on each page – they place them on a constant URL so we can index the pages easily for the wiki.
Nate Kurz comes up with a clever hack to index posts on bulletin boards that use sequential post IDs. I write an ugly perl script using his hack to generate assignment pages that have links to bulletin board posts.
- Over Saturday night, a few volunteers check in and start entering data, primarily from the Craigslist pages. Sunday morning, we post links to several new data sets, using the technique Nate and I have developed. A small cadre of volunteers starts entering data… and promoting the data entry effort on their blogs.
A-List blogs start promoting the effort and we’re quickly swamped by volunteers. The wiki slows to a crawl, and we’ve got countless edit conflicts as new wiki users discover what happens when they try to edit a page at the same time as another user is also editing.
The database used to collect information from volunteers is crashing under the load – its load average is between 35 and 50, perhaps ten times what it should be. There’s a hasty decision made to take down the database and stop data entry until we can put data into a more robust database.
- During the data entry downtime, the team running the wiki reconfigures it to handle a greater load. Nick Branstator, a developer who’s already helped develop Air America Radio’s Katrina Voicemail for VoodooVox comes over to my house and we start developing tools to scrape bulletin boards that don’t have sequentially numbered posts. He scrapes two large boards before heading home after dinner.
Building on Nick’s model, Steven Skoczen, Intelliseek‘s Matt Hurst and other programmers scrape another dozen bulletin boards overnight. By Monday morning, we’ve got thousands of 25-post chunks ready to be entered into the database.
- The new database, hosted by Salesforce.com, is up by 10pm EDT Sunday night, and volunteers enter data through the night. By 4am, there are 7,000 records. When I log on at 8am on Labor Day, there are 12,000.
Volunteers pour in through Labor day and by 9pm, we’ve reached the 50,000 record mark. I also realize I’ve reached the burnout point and tell David that I need to hand off my part of the project. David quickly finds volunteers to take over my role, including Paul Schreiber, Deborah Finn and others. A little more than 48 hours after clocking in to the project, I’ve clocked out.
With absolutely no figures to back up this statement, I’m guessing we’d readied about 90% of the known bulletin board posts for assignment by midnight last night. The vast majority of the data entry work is done… but the PR machine is just kicking into gear. As of 6am, the PeopleFinder Volunteer page on the wiki is the tenth-most linked page according to Daypop and folks on the team are starting to get phonecalls from the press.
The project’s not done – more data will keep coming in as more refugees get online access and can post information about their whereabouts. And the key part of the system – an interface to the data in the database is still missing. But a group of loosely organized people did an amazing job of tackling a huge data entry problem in roughly 36 hours.
People want to help.
None of us were prepared for the volunteer turnout – indeed, the willingness of people to help us out brought our system to its knees more than once. Midday Sunday, I recognized the tags of most of the people claiming chunks from the wiki – many were friends from my LiveJournal community. By the time database melted down, people I knew were in the minority.
Basically, hundreds of people saw the requests for help on BoingBoing, Metafilter or elsewhere and pitched in. In many cases, it was the first time a volunteer had encountered a wiki… but people coped with the new technology remarkably well.
I got dozens of emails thanking me for an opportunity to help out. I suspect a huge number of people were sitting at home in front of the TV this weekend, feeling helpless and were grateful for something they could do above and beyond writing a check that made them feel hopeful.
Sometimes code is the solution. Sometimes 2,000 loosely organized people are the solution.
I got a dozen emails or blog comments from people asking – basically – why we were being luddites and having people enter data into forms instead of writing scripts to do the data entry automatically. I responded to some of these by asking people to look at five of the entries on a bulletin board and getting back to me if they still thought scripts were a good idea.
Here’s the problem. A typical message board post looked something like this:
My father, Joe, was working in New Orleans and hadn’t evacuated – he was living in Jefferson Parish. We don’t know if he’s okay. Please call me or Mom in Houston – Lisa Brown, Houston, TX.
To parse that post automatically, a script needs to figure out that “My father, Joe” is probably named “Joe Brown” and that “We don’t know if he’s okay” means he should be marked as “missing” in the database. While it’s very simple for a human to draw those conclusions, programming an computer to make those conclusions is a major artificial intelligence challenge.
Computer programmers are naturally inclined to solve problems with code. That’s because we’re lazy – not lazy in the bad, won’t-get-out-of-bed sense of the word, but in the good, avoid-boring-repetitive-tasks-at-all-costs type of lazy. This is usually a good thing – most people don’t like boring, repetitive work, and it costs money to hire people to do even the most mind-numbing jobs.
But when 2,000 people show up and ask for something to do, it’s a great idea to take advantage of their generosity. Estimating that it took roughly two minutes to enter each name into the database, volunteers donated roughly 2,250 hours of time over the past 48 hours to do data entry. That’s a $11,600 in-kind contribution, valuing people’s time at US minimum wage.
Could a talented programmer solve the unstructured data parsing problem in 120 hours at $100 an hour? Possibly. Probably not. And 1,999 other people wouldn’t have had the chance to help out and feel good about doing their part.
Simple tools work surprisingly well.
Wikis are spam-prone, hard for beginners to use, subject to arcane problems (edit conflicts) and make it too easy to create long, complex, unreadable pages (some as bad as my blog posts…).
Despite those flaws, they work surprisingly well as adhoc workflow management systems.
In a perfect world, I would sit down with a couple of good developers and develop a workflow management system for the next time we need to get a thousand volunteers together to enter some data. It would have a simple, web-based interface that logged users in, assigned them a task, nagged them via email until they completed it, and provided a comprehensive view of what was and wasn’t assigned to administrators.
But I don’t think it’s a burning need, because MediaWiki – once it was tuned to handle the load – worked pretty damned well. Some key things:
- Assignment pages need to be small. Huge ones lead to edit conflicts. Lots of small pages is better than a few big ones.
- Wikis where you need to login to edit might turn some users away, but they do a nice job of preventing spam and make it very easy to track users who are having problems.
- It’s a good idea to put someone – or multiple people – in charge of wiki gardening early in a project.
It’s not just your tools that need to be robust. You’re dependent on everyone else’s tools as well.
As we turned hundreds of volunteers onto message boards to read and index posts, those boards – predictably – crashed. Wondering why so many users were accessing posts by post ID, two board sysadmins changed their indexing scheme to block that sort of access. That broke our data entry process midstream and forced numerous volunteers to abandon their work.
We made two big mistakes here. One was that we didn’t properly respect the sysadmins running these message boards. We should have let them know what we were planning to do and throttled our volunteer force so that we didn’t swamp their systems.
More critically, my post ID hack is a stupid way to solve this problem – a well written scraper is a better way to handle this problem. Then the data lives on servers the volunteer team controls, not servers that are overloaded with people posting missing people information. Next time, lots of scrapers, no URL hacks.
Many of the people who are working on chunks of the PeopleFinder project are people who’ve know each other for years – sometimes in person, sometimes virtually – and trust each other a great deal. Most of the people I reached out to for help on coding problems are people I’ve known and worked with for over a decade. Many of the first volunteers who started entering data into the system – and who debugged our first data entry problems – are part of an extended LiveJournal community.
Basically, when net people try to solve a problem, they bring their posse with them. For me, one of the lessons of the weekend was discovering what a powerful force my posse can be, and how effective the network of posses around the net can be.
Around 4pm yesterday I realized two things: 1) despite the fact that we’d entered 80% of the existing data, data was still going to be generated and we might be entering data for another month to come; 2) I’m scheduled to give two talks in the next ten days and to submit an academic paper. Next time I get involved with one of these projects, I’m going to rope someone in early in the process so that I can hand my tasks off to her when, inevitably, I have to return to normal life. She, in turn, would find someone to shadow her, and so on.
Because you’re going to burnout, use generic email addresses that can be redirected. Roughly a thousand volunteers around the world now have my gmail address and I’m going to be redirecting email for the next weeks to come, because I’m an idiot. Don’t be an idiot – set up email@example.com before you do anything else.
We’re all going to learn more as we share the lessons of the online relief effort over the next couple of months – I hope these notes are useful to someone else as they think about how to build the set of tools we need to cope better with the next emergency. And a thank you, from the bottom of my heart, to everyone who pitched in on this project, whether you wrote code, entered data or promoted it. During a dark time, it’s a wonderful reminder just how many people want to do the right thing and lend a hand.