Confession: I am, at best, a mediocre computer programmer. I spent most of the computer science courses I took in college, sitting in the back of the room, working on breaking the encryption scheme that protected the college’s copy of Pagemaker so I could have a copy to use for my graphic design business. At Tripod, my utility was not as a coder, but as someone who could run interference between the businesspeople and the techies.
But my current set of research interests has forced me to buy a pile of O’Reilly books, fire up BBEdit and bang out piles of inelegant but functional Perl. The experience is a humbling one. I know a couple of master programmers and have read enough of their code to know how inexpert my code is. I often have the experience of knowing there’s a better way to do something, but lacking the programming chops to execute correctly.
I was feeling roughly as stupid as I usually do as I attempted to write a program this morning that scraped headlines from CNN’s website. My intent was to write a CNN version of the program I wrote a few weeks ago which scrapes headlines from the New York Times and checks Technorati to see whether or not they’ve been blogged. Unlike the NYT, which has a text-only version with predictable layout, CNN’s HTML is ugly, unpredictable and nigh-unscrapeable. As the regular expression I was using to match URLs and headlines grew to fill an entire computer screen, I found myself thinking: “Geez, I wish sites would just put their content up in predictable XML formats so that I could just search for a tag that said ‘Headline’ and get the current headlines.”
And then the voice of Dave Winer spoke to me, and said, “Uh, you mean like an RSS feed?” (I’m serious. It sounded like he was in the next room. Yes, I’m sober, and not listing to any of Dave’s podcasts.)
Uh, yeah, Dave. Like that.
So I’m now writing feedreaders for BBC, the Washington Post and the Guardian, which will give me data on four of the five most blogged mainstream news sources. I’m somehow unsurprised that CNN doesn’t have its own feed, though people braver than I are running scrapers that turn CNN’s pages into RSS. (I may write a tool that uses one of these feeds, but at the moment I’m so annoyed with CNN, I’m not going to bother.)
If all goes well, I’ll have a page up in a couple of days with daily results from these sources. That assumes the voice of Dave doesn’t come back. My current top priority: a new tinfoil hat.