Understanding how Google.cn filters

The anti-censorship community has been hard at work today trying to figure out just how Google’s new Chinese search engine prevents access to controversial content. Nart Villeneuve, pretty much the smartest guy out there working on Internet censorship, offered a post yesterday with some early insights into how sites are being blocked.

One of the observations Nart made: Google.cn is working from a blacklist of URLs, possibly provided by Chinese authorities, possibly generated from following traffic to search results and adding domains that are consistently blocked by the firewall. This blacklist includes activist sites, news sites, homepage hosting and forum sites.

It’s worth noting that, with some knowledge of Google hackery, you can make some guesses about just how Google’s removing results from searches. Go to google.cn and enter a search for “site:hrw.org” – this should return results from Human Rights Watch’s website. Google.cn yields a page that tells us that no pages can be found to match our query, and includes a prominent notice that there are results which are not displayed, due to local laws and policies. Try again with a search for “inurl:hrw.org”, which should display all sites that include the string “hrw.org” in their URLs. Google.cn returns a page that lists 58,400 results… and provides links to only two of them, neither of which is a page on the hrw.org site. Again, the page ends with a prominent notice about the censorship of results. This suggests that there are many pages in the catalog with hrw.org URLs, but that somewhere between retrieving results from the catalog and presenting them to the user, Google is checking against a blacklist and eliminating most results.

This bit of hackery helps complicate the result I had last night searching for “太石村” – Taishi Village – I noted the absence of wikipedia results and concluded that this was because Wikipedia wasn’t indexed by Google.cn. The actual situation turns out to be far more interesting.

Search for “site:wikipedia.org” on Google.cn and you get 17 million results, including links to the English, Vietnamese, Spanish and Chinese-language Wikipedias. Search for “inurl:wikipedia.org” and you get 17,700,000 – this makes sense, as 700,000 urls not on the wikipedia.org domain might include the string “wikipedia.org”, especially if they’re mirroring Wikipedia content.

Now search for “site:wikipedia.org 太石村” – Taishi on a wikipedia.org page. You get an error page that tells you that no results are available and that you should use fewer, or more common, search terms. Ditto for “inurl:wikipedia.org 太石村”. But here’s what’s really interesting – even if you perform the searches with the leftmost radio button (Search the Web) depressed, the search executed has the rightmost radio button (Search Chinese Web pages) depressed.

This suggests that for some controversial keywords, Google.cn forces a search against a catalog of pages hosted in China rather than a search against the whole web. Some quick experiments:

法轮功 (falun gong) – Forces a Chinese page search, returns 866,000 results
法轮功 inurl:wikipedia.org – Forces a Chinese page search, returns no results, error page

太石村 (taishi) – Forces a Chinese page search, returns 11,400 results
太石村 inurl:wikipedia.org – Forces a Chinese page search, returns no results, error page

西藏 (tibet) – Allows a full web search, returns 17,100,000 results beginning with tibetonline.net

西藏 inurl:wikipedia.org – Allows a full web search, returns 2,400 results including Chinese-language wikipedia pages hosted on Wikipedia’s US servers.

民主 (democracy) also gives you full-web results. “falun gong” written in English forces a Chinese page search. “taishi” written in English doesn’t force a Chinese page search, though the Chinese string does.

Basically, it looks like two things are going on here: certain sites are simply so controversial, Google.cn won’t offer links to them. inurl: searches reveal that pages exist, but results won’t let you see them, and site: searches give you the same result as if you searched for a nonexistent domain. (There’s a slight difference – search for a non-existent domain and you don’t get the message that certain results may be removed…)

Use a particularly controversial keyword (falun gong, taishi – though not tibet or democracy) and you’re forced into a search only of pages hosted in China… generally pages approved by the government. (Search for “falun gong” on Google.cn for an example of the sorts of “impartial” content this turns up…)

If any of my Chinese-speaking readers (including Rebecca :-) would like to collaborate, I’d be very interested in testing a larger list of words in Chinese and English to see which ones trigger this “Chinese pages only” behavior – it would provide an interesting map of what topics are merely controversial and which are completely off limits.

7 Responses to Understanding how Google.cn filters

  3. Patrick Hall says:

    An interesting additional data point:

    Paul Boutin blogged about how the Chinese Google filter doesn’t detect misspellings.

    Searching for misspelled terms such as tianenmen will turn up tanks, not temples.

    I can see version two already: “Did you mean ‘Tiananmen’? Oh, well, we wouldn’t have given you the results you wanted anyway.”

  7. Is incredible that even countries have censorship. People should be free to pursue whatever they want.

