What Google Coop Search Doesn’t Do Well
Update: the Google custom search team has retuned their code and fixed most of the problems I outlined here. Very impressive. Here’s a new post on Google’s tweaking of their engine and my gratitude to them for solving this problem.
Like several thousand other geeks out there, I’ve spent a good chunk of this week playing with the latest toy from Google, Co-op Search. The idea behind this new search tool is an excellent one: let users make their own specialized vertical search engines, showing either results only from a selected subset of sites, or prioritizing the results from those sites while searching the whole catalog. The service has all sorts of geeky bells and whistles – you can upload an OPML file to create a catalog, you can weight sites as being good or bad matches for certain terms, you can wrap the whole thing in AJAX and produce your own pretty, customized results.
My friend Nathan pointed me to the tool in response to a question I’d asked for his advice on: how do I let users of Global Voices search the thousands of blogs we’ve pointed to in our 18 month existence?
Basically, there’s two ways to approach this problem. One is to build your own search engine – decide what sites you want to spider, index them with a tool like KinoSearch and put a CGI interface on your site to let users search. (You can also buy search-in-a-box from companies like Google – the principle is the same: you’re building a custom index of sites you think are important.)
The other approach is to take the output of an existing search engine and filter it, looking only at the sites you’re interested in. Savvy Google users know how to do a search with the “site:” attribute – “ghana site:ethanzuckerman.com” gives you the 309 blog posts on my site that have mentioned Ghana. Yahoo!’s search API lets you restrict a search to one of thirty different domains, a very powerful feature which the folks behind Rollyo – a company that urges you to “roll your own search engine” – have used as the technical backbone of their company.
But these options don’t work well when you want to give your users the ability to search on thousands of blogs.
Enter Google Coop Search. You can design a search engine that searches across up to 5000 different domains, orders of magnitude more than Yahoo! allowed you to search. (Some good reviews of Coop Search, especially if you’re looking for a more positive review than this one…)
Fantastic – I fired one up immediately, dropped in OPML files from about half of the Global Voices regional editors and had, within half an hour, a search engine that searches almost 3000 global weblogs.
Unfortunately, it doesn’t search them very well. More specifically, the precision is high, but the recall sucks. (Information retrieval systems are usually measured in terms of how well they perform on these metrics. “Precision” means “how good were the results you got in regards to relavence to your query?” “Recall” means “how complete were your results out of all available relevant documents.”)
Search for “ghana” on our little search engine – you get three results: one from Koranteng’s Toli, one from Timbuktu Chronicles and one from my blog. The results from Koranteng and Emeka are good matches for the search – the one from my blog is curiously bad. But what’s really weird is how few there are – as we saw above, a “site:ethanzuckerman.com” search for “ghana” gives you 309 results. You’ll get 234 on Koranteng’s site and 212 on Emeka’s TChron site – so why aren’t we getting 800+ results from our engine?
A little poking solves the mystery pretty quickly. Google Coop Search works by searching against the main Google search catalog, retrieving 1000 results and filtering them against the sites you’ve included in your catalog. This makes sense, computationally – these searches are fast, almost as fast as normal Google searches. Rather than conducting 3000 “site:” searches and collating and reranking the results, Google is sacrificing recall, getting 1000 results and discarding those not in your set of chosen sites, which requires one call to the index and a really big regular expression match.
Search for “Ghana” on Google, preferably with the number of results per page set to 100. After 300 or so results, you’ll find the Koranteng post our little search engine calls up; at about result 600, you’ll find the Timbuktu Chronicles post on Wireless Ghana. (The result on my site is around number 900, which Google won’t let me see with an ordinary search.)
In other words, the little engine I’ve built is useful only if the sites I’ve chosen are relatively high ranking and authoritative sites on the topics I’m searching on. If I make a search engine of sumo commentary sites and search for “Asashoryu”, the results will be quite good, as those sites probably have several dozen pages that are top matches for the big man. Try it on our engine and you get four results (three from my site…) Alternatively, pick topics where our bloggers are relatively authoritative, and you’ll get better results – try “blogger block“, for instance, and you’ll get 35 sites, either on the blocking of Blogger.com in some countries, or the dreaded disease that seems to strike some bloggers (though not me, so far…)
This doesn’t mean that Coop Search is broken – just that it’s broken for my purposes. Folks will develop lots of interesting search engines, I suspect, using sets of sites that are consistently good matches for the terms they encourage people to search, like my sumo example. But Coop Search isn’t a good solution for authoritative searches on a large set of relatively unpopular blogs, unless one or more of those blogs happen to be very authoritative on the terms you choose. (I could also solve this problem almost immediately by telling Google not just to search my 3000 blogs, just to prioritize them in the index. But that wasn’t the goal of my experiment.)
I’d originally thought that Google might be using Coop Search as a way to identify collections of URLs they might want to spider more deeply – for instance, if I identify 20 great sumo sites, Google might want to visit them more often, or increase their relevancy for searches on sumo. And perhaps they’ll figure a way to do this without opening themselves up to a huge new vector for spammers to promote their sites. But I suspect the truth is that they saw a way to leap ahead of Yahoo! (destroying Rollyo in the process) and offer a tool that’s going to be great fun for 80% of the people who use it. Unfortunately, for the 20% of us who are trying to use Coop Search so we don’t need to go buy our own Google Search Appliance, we’re probably still out of luck.









October 28th, 2006 at 8:35 pm
[...] Ethan Zuckerman has an interesting post critical of Google Co-op, which Don mentioned in class. The search feature allows users to narrow the universe of indexed sites to a range of sites dealing with particular areas. Zuckerman says the results the customized searches provides are very good in returning relevant results, but fails at providing a complete picture of all of the available documents. [...]
October 30th, 2006 at 12:04 pm
Hey Ethan,
I work for the ONE Campaign and I’ve been tasked with looking for blogs that talk about extreme poverty and AIDS. Our Online Organizer here, Ginny Simmons, wants our blog (www.theONEblog.org) to be better connected with, and help to build up – the poverty blogosphere.
We recently put out a new ONE TV Spot. It’d be great if you wanted to post it on your blog. You can find code for the ad here: http://www.youtube.com/watch?v=0G3bNxStYBI&eurl=
The rest of our ONE videos are here: http://www.youtube.com/theonecampaign.
We’d be really interested in hearing from you. Please shoot an email back to me (one@one.org with any questions, ideas or thoughts.
Thank you so much for all your good work.
Meagan McManus
The ONE Team
November 2nd, 2006 at 8:00 am
Eph News…
This is a shout out to all the technically sophisticated Ephs out there, people like DeWitt Clinton ‘98, Evan Miller ‘06, Ethan Zuckerman ‘93, Eric Smith ‘99 and Stephen O’Grady ‘97. We want to create an “Eph News” feed with……
November 5th, 2006 at 11:40 pm
[...] I was busy with other things, so I’m just now getting around to checking out Google Custom Search Engine (GCSE). I find I’m a bit disappointed after reading where Ethan Zuckerman explains how GCSE is lacking: A little poking solves the mystery pretty quickly. Google Coop Search works by searching against the main Google search catalog, retrieving 1000 results and filtering them against the sites you’ve included in your catalog. This makes sense, computationally – these searches are fast, almost as fast as normal Google searches. Rather than conducting 3000 “site:†searches and collating and reranking the results, Google is sacrificing recall, getting 1000 results and discarding those not in your set of chosen sites, which requires one call to the index and a really big regular expression match. [...]
November 6th, 2006 at 6:41 pm
[...] About a week ago, I wrote a blog post about my experiments with Google Coop Search, complaining that the engine I’d built to search Global Voices blogs retrieved very few results – three for a search on “Ghana”, for instance, leading me to the conclusion that the product was doing little more than retrieving the top 1000 results from Google’s main catalog and searching for sites in the subset of sites I’d included in the engine. [...]
November 12th, 2006 at 11:29 pm
Hello,
Searching among 3000 blogs is great but the only main problem I see is that the content is too old, 1 or 2 days… not like Google Blogsearch.
And searching fresh posts in blogs is what people want.
I made a french Custom Blog Search
http://www.google.com/coop/cse?cx=006257974143066747032%3Ark6-j5ph12e
searching among the 200 most popular french blogs (france and quebec) when I realized this problem of freshness..
Will Google have the Anser to this?
thanks,
Vince.
November 14th, 2006 at 8:57 am
[...] Im Artikel von …My heart’s in Accra » What Google Coop Search Doesn’t Do Well stehen zusätzlich zur den Google Co op Review Infos über Konkurrenzprodukte. [...]
November 19th, 2006 at 11:57 am
My Google Co-op example:
You Search for unprotected live webcam streams found through a variety of clever search techniques done with the Google Co-op custom search engine tool.
http://www.camhacker.com
December 8th, 2006 at 6:20 pm
[...] We’ve just added a very cool feature to Global Voices – the ability to search through 4,800 of the weblogs our editors consult most frequently in putting together their roundup of blogs from around the world. The backend for this search technology is Google’s Co-op search, which I complained about in this blog post, then raved about in this post – in between the two posts, Google used some of our feedback to tune their algorithms and produce results that are much better suited to blogs than to the sites they’d used as their initial testbed. [...]
January 3rd, 2007 at 3:47 pm
[...] I recently announced the release of my own CSE — SweetSearch — that is a comprehensive and authoritative search engine for all topics related to the semantic Web and Web 2.0. Like Ethan Zuckerman who published his experience in creating a CSE for Ghana in late October, I too have had some issues. Ethan’s first post was entitled, “What Google Coop Search Doesn’t Do Well,” posted on October 27. Yet, by November 6, the Google Co-op team had responded sufficiently that Ethan was able to post a thankful update, “Google Fixes My Custom Search Problems.” I’m hoping some of my own issues get a similarly quick response. [...]
January 26th, 2007 at 12:19 pm
[...] Last week, Ethan Zuckerman wrote a great article explaining why Custom Search Engines would be so useful for communities such as the Global Voices network. He built a CSE to search over 3000 blogs from across the world. This is exactly the kind of application we built our platform for — not just because of the scale of his search engine, but also the cause it serves and its collaborative approach. Unfortunately, as he explains in his article, queries on his search engine for some of the terms he’s interested in didn’t work very well. [...]
July 21st, 2008 at 1:50 pm
[...] ideas? DeWitt’s involvment in Open Search is clearly relevant while Ethan’s recent work is too complex for me to follow, much less implement. « House for Sale | Richest [...]