My current favorite tool on the web is Overture’s Keyword Selector Tool. (Many thanks to Ben Edelman and David Pennock for independently introducing me to it and ruining my productivity for the month.) Designed to help purchasers of ad keywords select appropriately targeted search terms, it returns a set of “associated terms” for any term you offer. Search for “red sox” and you discover more precise searches for “red sox curse”, “red sox ticket” and “red sox nation”… not to mention 10,068 searches last month for “red sox suck”. (Fellow members of Red Sox Nation will be pleased to note that there were 30,527 searches for “yankees suck” in the same month. And while there were 8,031 searches for “yankee hater”, there were only 619 for “red sox hater”.)
So, while Overture may have built the KST to assist their customers, it appears to be custom designed for internet sociologists, so we can see what the web really thinks about a given search term. I think of it as Freudian psychotherapy for the web – we ask the Internet to free associate and attempt to draw make sense of the results.
What do web users think about Brazil? That’s an easy one: we want to visit, and we’d like to see some naked women. (No word on whether that’s why we want to visit.) The top 20 search terms associated with “Brazil” include “travel”, “tour”, “hotel”, “vacation”, “carnival”, “visa” and “beaches”… and also “girl”, “woman”, “sex”, “porn”, as well as “Mike”, who appears to be visiting Brazilian beaches with a camera, meeting lots of naked women.
The Canadians, on the other hand, can keep their clothes on. We want their drugs – “pharmacy” and “drug” both rank in the top five search terms – and we’re thinking about moving there. “canada immigration” is the 12th associated search term, followed closely by “jobs in canada”. (The keyword selector tool offers some help there as well – if you can’t get a job in a Canadian pharmacy, consider Sears or Wal Mart, both of which rank in the top 20 associated search terms. (The Overture data set is updated monthly, so these results reflect October searches – I look forward to seeing how these figures change in November.)
Having fun yet? I was, so I built a little tool to automate querying of the KST. The program, which I christened OverCluster (less because it produces clusters of search terms from Overture and more because it sounds like the sort of company that got VC funding in the late 1990s) accepts a list of search terms, calculates how each subsidiary term compares to the main term (i.e., what percent of people searched for “green bay packers” rather than “green bay”), and creates clusters of subsidiary terms. For instance, when you feed OverCluster a list of search terms representing the 187 nations I routinely monitor media for, you discover that nine of them are often queried with the subsidiary term “safari” – the resulting “safari cluster” is Botswana, Namibia, South Africa, Tanzania, Kenya, Zambia, Zimbabwe, Uganda and Malawi.
Looking to share the fun with y’all, I’ve run three data sets – my usual 187 nations; the 50 US states and 13 Canadian provinces and territories; and the 87 cities in the US and Canada with populations over 250,000. I ran each twice, once looking at the top 40 subsidiary terms, once looking at only the top 20. The results are here:
I’m also releasing the source code for OverCluster under the GNU Public License. It has its own project page on a Berkman Center website, along with the three search term files I’ve used and all sorts of caveats, warnings and apologies. The quick version of those warnings – it’s buggy research code, and you need to be comfortable with customizing Perl and installing Perl modules from CPAN to have a prayer of using it successfully. (For my friends who’ve requested that I release the GAP tools under GPL – they’re coming. This was an experiment in building a tool for release under GPL – OverCluster is an order of magnitude simpler than GAP, so I thought I’d learn by releasing it first.)
For those of you not rushing to download my new toy, allow me to share some of my favorite results thus far:
We’re not especially interested in the what’s currently going on in Madagascar (madagascar news: 408 searches.) But the not-yet-released Ben Stiller/Chris Rock animated film is already gathering interest (madagascar movie: 2520 searches). And Birkenstock’s Madagascar sandals netted 2944 searches – and three of the top five subsidiary term slots. We hear they’re good for stepping on Madagascar’s hissing cockroaches (522 searches, 11th subsidiary term.)
For those who thought Belarus just produced wacky neo-Stalinist dictators – it produces tractors, too! Proudly manufactured by Minsk Tractor Works (MTZ), Belarus tractors come in 30 different models, and are the subject of 1593 web searches, Belarus’s #2 secondary term. Belarus is one of seven nations in the adoption cluster (13, if you search 40 matches deep instead of 20), alongside Kazakhstan, Guatemala, Ukraine, Ethiopia, Azerbaijan and Moldova. Belarus is also part of the bride “near cluster”, along with Moldova, Russia, Ukraine and Latvia (where the search is for “latvia and bride”).
When I visited Rwanda in 2002, government ministers were trying to “rebrand” the country, hoping people would start to associate the nation with mountain gorillas, tea, and pyrethrum, an insecticide produced from crysanthemum flowers. So far, it’s not working. “Genocide” (1st), “genocide in” (2nd), “genocide picture” (5th), “massacre” (7th), “genocide 1994″ (9th), “1994″ (12th), and “picture of civil war” (13th) dominate Rwanda searces, as does PBS’s Ghosts of Rwanda documentary (”ghost of” (6th), “ghost of pbs” (10th)). “Gorilla”, “tea” and “pyrethrum” don’t make the list.
“Genocide” is a small cluster – Rwanda, Sudan, Bosnia, Burundi, Armenia, Somalia and Cambodia. It’s the same size as “civil war” – Sudan, Sierra Leone, Somalia, Rwanda, Liberia, El Salvador and Ivory Coast, but smaller than “war – Viet Nam, Iraq, Bosnia, Afghanistan, Yugoslavia, Somalia, Rwanda, Sudan, Sierra Leone, North Korea, Liberia and Ivory Coast – and “war in” (eight nations, all mentioned under “war”). The war in Ivory Coast wasn’t the subject of a whole lot of interest last month – 39 searches for “Ivory Coast war” and 36 for “Ivory Coast civil war”. Let’s see what bombing some Frenchmen does for search engine traffic, shall me?
On the subject of war – the second most popular secondary term for “Poland”? “you forgot poland”. But it’s not just President Bush who can put a nation on the map. Secondary results #2 and #3 for “Burma” (which I’d been using instead of “Myanmar”, but will now switch to the nation’s official name) – “shave”, and “mission of”. Nice to know that 1019 people a month are searching for my favorite once-defunct, now thriving punk band.
The only possible way to end a post titled “The Freudian Web” is with a close look at “sex”. (”mother” doesn’t appear in any of our clusters.) With 40 nations, it’s one of our larger clusters. “iran sex” ranks highest in percentage terms – for every 100 searches for “iran”, there are roughly 7 for “iran sex”. But Japan is a close second in percentage terms, and first by a landslide, with 21,358 searches last month for “Japan sex”. Japan finishes second to Brazil in percentage terms on searches for “girl”, beating out Ukraine, which places third. Ukraine is ranked first in searches for “woman” – 8.29 searches for “Ukraine woman” for every search for “Ukraine”. But Brazil is first in absolute terms – 6,873 searches for “Brazil woman”. The clusers for “escort”, “miss” and “porn” do little to clear up the confusion. Clearly, we as a web are torn.
If you think of any clever sets of terms you’d like to feed to OverCluster, or if you get any interesting results running the script yourself, please let me know. Stand up and be clustered!