My colleages at Berkman and in the larger Open Net Initiative have been busy this weekend working on new strategies to test Google’s new Chinese search engine, Google.cn. Rebecca MacKinnon gave us a list of 80+ Chinese keywords to test on google.com and google.cn – I spent part of Saturday watching the Bruins lose and comparing results between the two search engines.
Because I don’t speak Chinese, my comparison of results is generally restricted to two observations: Does one engine give meaningfully more results than another engine? And is the search on Google.cn performed against the whole set of webpages? Or is it rewritten as a search of only pages hosted in China?
I’m especially fascinated by this latter phenomenon. What topics are so out of bounds that Google and/or the Chinese government won’t alow pages outside of China to be listed as a search result?
For an English-language example, it’s instructive to look at a pair of searches. Search for “falun gong” on Google.com and you’ll get 2.2 million results, leading with falundafa.org, a multilingual site dedicated to the promotion of Falun Gong. Perform the same search on google.cn, and you’ll receive 12,500 results, leading with “Truth of Falun Gong”, a site that calls Falun Gong a harmful cult and explains why it’s banned.
Most interesting for our purposes, if we look at a screenshot of this search, the rightmost radio box under the Google search box has been checked – this indicates that the search results come from a search of Chinese-hosted pages only. Check the leftmost box – search all pages – and repeat the search. You’ll get the same result, and the rightmost box will be selected. Certain keywords force Google.cn to perform searches only against Chinese pages, probably to ensure that Chinese users don’t encounter pages hosted outside of China on highly controversial topics.
Working from Rebecca’s list of keywords, I was surprised to discover how few search terms triggered this behavior. Of the 82 terms I tested, only 27 generated significantly different result counts between google.com and google.cn. (For my purposes, I flagged any difference of 25% or greater as “significant”. There are going to be minor differences between google.com and google.cn on most terms, as google.cn blocks results from sites like geocities.com, which host many million pages.) Only 16 generated the “forced” result that I find most interesting – a mandatory search against pages hosted in China.
So what are Chinese censors concerned about? It’s not sex. Rebecca’s list includes 20 sexually oriented keywords (suggesting that if Rebecca were to curse you out in Mandarin, it would leave a mark…). None trigger the Chinese-only page search. The vast majority have similar results on the .com and .cn sites – the Chinese words for “bra”, “make love” and “butthole” have more significantly more results on Google.cn, while “penis”, “condom” and “big penis” have significantly more on Google.com.
Falun Gong, on the other hand, is clearly a sensitive topic. Searches for “Dafa”, “disciple”, “truth righteousness endurance”, and Li Hongzhi (the founder of the Falun Gong movement) all have fewer results on .cn and force a Chinese language search. “Falun” – 法轮 – is particularly fascinating. There are three times as many results on google.cn rather than on google.com – possibly because the engine points to many anti-Falun Gong pages approved of by the government.
(Interestingly, two alternate spellings of Hongzi, the founder of Falun Gong, don’t give forced Chinese searches. This is consistent with a result Paul Boutin found – misspelled results don’t appear to be blocked on google.cn – spell “Tiananmen” poorly and you’ll get images of tanks and defiance. I’d argue this is further evidence that Google is attempting to follow the letter of the restrictions presented by the Chinese authorities, but not the spirit of them… but that’s nothing but pure speculation, based on my sense that Google engineers are easily talented enough to program misspellings into their filters.)
Searches for “independence” are interesting as well. “Taiwan independence” and “Mongolian indepedence” don’t have major disparities between the two engines and don’t trigger a forced search. One spelling of “Taiwan Independence” has almost twice as many results on google.cn as on google.com. Tibet and Xinjiang independence do, both forcing Chinese-only searches and yielding less than half of the pages the Google.com searches do.
Some other words that force a search of sites hosted in China: “6 4” (the date of the Tiananmen crackdown), “Tiananmen”, “violent action”, “immolate”, “Dalai” (as in “Lama”), “communist dogs”, “taishi village” (site of recent democracy protests), and Liu Xiaobo (dissident university professor).
Others have major disparities, but don’t force a search of Chinese-hosted pages. 三個代表 – “three represents” – Jiang Zemin’s political philosophy – and “eight immortals” – figures of Chinese legend – both have fewer results on google.cn than on google.com – “three represents” has a tiny fraction of the matches on google.com. “Brainwash” similarly has fewer results on google.cn, but no force. “Referendum”, on the other hand, has many more results on google.cn than on google.com.
What I find most interesting is that 50+ of the terms Rebecca suggested don’t trigger any meaningful differences between the two engines. Those terms include terms like “human rights”, “Michael Anti” (blogger whose site was removed by MSN), “mafia” and “military police”.
Two notes – rechecking results I first obtained Friday night suggests that the search catalogs are being tuned in real time on Google.cn. Rechecking a few of these terms today, it looks like many terms that had similar result counts between Google.cn and Google.com now have many more results on Google.com.
Second, if you’d like to test these results for yourself, my friends Nart Villeneuve and Boris Anthony have built a handy tool that performs searches on google.com and google.cn simultaneously and compares results – please do check my work and see if these observations from the other night still hold true. (The page includes all the keywords Rebecca provided…)