Twitter and social graph analysis

One of the best things about being an academic (or even, like me, a psuedo-academic) is how easy it’s become to collaborate with people you don’t know well… or at all.

I posted some data a few weeks back about the use of the #pman twitter tag to organize, report on and debate a set of anti-communist protests in Chisinau, Moldova. Two weeks later, a post on social media guru Beth Kanter’s blog sent me over to Michael Edwards’s post about studying the use of Twitter tags during the NTEN nonprofit technology conference.

What Edwards did was scrape Twitter looking for mentions of #09ntc, the tag used to report on the technology conference. He looked closely at the use of the @ tag, which in Twitter syntax is generally used to address another individual or to credit a comment to the individual. Looking at @ tags turns a set of tweets into a directed graph – when I tweet “Hey @cshirky, did you see that brilliant post by @kanter”, I’ve created links to two other nodes in the graph. (You’re right to note that those links probably shouldn’t be considered to be equivalent – I’m directing Clay to look at Beth, which could imply different relationships to both parties. And it certainly implies a difference in information – I’m telling Clay that Beth has information he might be interested in… and there’s no guarantee that the inverse is true. But those are the challenges of analyzing graphs and not parsing content.)

Directed graphs are pretty common in social network analysis, and there are good tools developed to enable their analysis. The web is a directed graph, with hyperlinks as edges connected pages as nodes – some of our favorite tools, including most search engines, are based around analysis of these graphs. So Mike took an algorithm called HITS – developed by John Kleinberg at Cornell and used in the ask.com search engine – to look at the #09ntc tags. HITS identifies two types of nodes – hubs and authorities. Authorities are nodes in the graph – pages in the web, twitter users in the set of twitterers – likely to be authoritative on a particular topic. Hubs are nodes that have a high chance of pointing to authoritative pages.

Running HITS on the #09ntc conference shows lots of people pointing to Clay Shirky’s keynote. That’s not because Clay is especially active in the local Twitter community, Edwards speculates – Clay doesn’t actually use the #09ntc tag, so “he is, for the purposes of this conference, only a source of information, not a reporter of it.” The leading hubs are Kanter, Rachel Ann Yes and Steve MacLaughlin – Kanter and Yes are both high-ranking authorities, which suggests that while MacLaughlin is doing a lot of reporting, Kanter and Yes are reporting, but also responding to a lot of tweets, and are tightly integrated into the conversations taking place.

I thought Edwards’s analysis was a badass little piece of work, so I dropped him a note and asked whether he’d like to look at the data I’d scraped on #pman. He told me that the #pman work had gotten him thinking about scraping Twitter, and offered to run the data through his scripts. And a couple of days later, with no meetings, no grant applications, no travel between New York and the Berkshires, we found ourselves looking at an intriguing data set – the HITS ranking of the 1979 users of the #pman tag.

Looking at the “authorities” the HITS algorithm found in the #pman data, I noticed something interesting and strange. The two most “authoritative” sources – mixman2009 and mediamtv – were two figures I’d noticed using the #pman tag not in support of the Moldova protests, but to critique, argue with and sometimes mock protesters. In the sense of search engine “authority”, they’re the last people you’d want to point to as “authoritative” voices on the #pman tag. But setting aside the names and thinking about the function the two had within the network, the analysis makes sense. mixman2009 and mediamtv said a lot of provocative things, and other users of the tag felt compelled to respond to them… frequently, and repeatedly. Since a tweet that reads “@mixman2009 is an idiot. Ignore everything he says” has the same weight in a directed graph as “read @mixman2009 to understand what’s going on in moldova”, the HITS algorithm turns out to be very sensitive to identifying people starting flamewars, not just those speaking authoritatively.

(From an email from Mike to me, responding to some of my questions about the analysis: “The fact that mixman2009 is an authority makes sense if he posted provocative tweets that the pro-protest tweeters replied to. Posting whatever he did caught the eye of other people who had also linked to other high-ranking ‘authorities’. It kind of seems like a high-volume, highly visible flame war between two sides might quickly create hubs and authorities in Twitter.”)

Many of the people who’ve got high authority scores were using their twitter feeds to support the protesters, and several of these folks have high hub scores as well – it’s possible that some combination of hub and authority score might help us identify the most “important” nodes in a Twitter conversation on a particular tag… and they also might not. One result that I found very interesting – of the eight twitter users with the highest authority scores (putting aside mixman2009 and mediamtv for the moment), none were involved with the conversation on the actual day of the Moldovan protests. Most of the “authoritative” voices are commenters who join the conversation two or more days later, to cheer on the protesters. The voices who appear to be reporting from Chisinau don’t show up as authorities in the set, despite the fact that by a human definition of “authority”, they’re the eyewitnesses we’d expect to be most authoritative. We’re looking more closely at the data to try to figure out whether there’s a way to identify these individuals (other than my method of scraping all tag mentions and looking through early posts to see who appears to be reporting from an event rather than commenting on it, which is hardly scaleable or believable.)

Which is to say that graph analysis looks promising and interesting, but is hardly a silver bullet for analyzing these sorts of conversations. I suspect that doing this sort of graph analysis well is going to require playing with some hard problems, like sentiment analysis, and perhaps finding some way to parse the grammar of tweets to better understand who’s pointing to whom and why when they use the @ sign.

Mike’s work is based around the NetworkX framework, a very powerful set of libraries in Python for analyzing graphs. He’s continuing to play with networks around different Twitter conversations and is documenting results on his blog – if there are graph theory geeks out there, especially experts on the HITS algorithm, he and I would both love to get your help thinking through some interesting research questions.

I’m of two minds about spending time analyzing conversation dynamics in Twitter. Part of me wants to make the case that Twitter is a pretty small community, representing pretty sophisticated users in comparison to other online media tools. (Focusing on the use of #hashtags and of @directed messages restricts that set even further, to more sophisticated users.) Is it valid to make generalizations about the spread of ideas in online networks based on analysis of a small, specific subset?

On the other hand, Twitter’s starting to have some real power in influencing mainstream media’s interest in some topics – stories like #pman and #amazonfail appear to have crossed into mainstream attention in part through Twitter, and smart activists are looking for ways to generate sufficient buzz on Twitter, Facebook, or other closely watched social media services as ways of “breaking” stories.

Beyond that, Twitter is a fantastic environment for reseachers, because it allows us to get comprehensive data. It’s virtually impossible to answer the question of how many newspaper stories or blogposts in the past month mentioned Swine Flu – Google News tracks about 14,000 news sources, but that’s far from all the possible sources we could track, and none of the blog search engines are especially comprehensive. It is possible to get an answer to the question on Twitter – it takes some work, but the tools I wrote a couple of weeks ago should make it possible to get a precise count.

This sort of comprehensive data lets us ask some different types of questions about how ideas spread in a medium. One of the questions I’d most like to answer is “How successful are bloggers/citizen journalists/twitterers in introducing new stories to mainstream media?” In other words, how often do we see memorable events like “Rathergate” or “Trent Lott at Strom Thurmond’s birthday” emerge from citizen media and gain traction in mainstream news sources? Without an ability to follow hundreds of thousands of sources – what we’re trying to do with Media Cloud – any answer to this question is going to miss lots of failed attempts to get traction for a story.

We might be able to try something different in a closed Twitter universe. Grab every single #hashtag over the course of a month. How many of these are used once, or used by a small group of people, in comparison to #hashtags that gain wide usage? Are #hashtags introduced by people with lots of followers more popular? Does it matter how many #hashtags you try to introduce? How often you post? How viral are these mediums, really?

It’s not an experiment I can run yet – it would require access to all tweets, not just the search engine scraping I’m doing – but it seems like it would be feasible. As for whether learning how viral #hashtags are in Twitter and whether that tells us anything about the relationship between blogs and newspapers, I don’t have a good answer, but it’s a fun question to think about.

5 thoughts on “Twitter and social graph analysis”

Beth Kanter May 12, 2009 at 12:13 am

Great analysis here .. BTW, my twitter handle is @kanter. @bethkanter is my evil twin who spurs an automated feed of link love
Beth Kanter May 12, 2009 at 1:38 pm

I’ve been looking at the social network analysis – in other contexts – for networks of organizations, within organizations, personal networks or clouds, or as you lay out here.

Is there a definitive “Poor Man’s Dummies Guide To Social Network Analysis for Non-Geeks?”
Claudia S. May 13, 2009 at 10:45 pm

Hi Ethan, many question you have here regarding what can the data on #pman tell us. We (my husband and I) also did some analysis of the #pman data and we looked at it. We did not get into such a statistical analysis like you did but instead we looked (for example) at how EU was talked about and what sources of information were transmitted on messages on #pman. A wordle with the domain names most linked to in the first 3 days looks like this.
I really find interesting the importance of YouTube there.

I do have some questions that I would be interested to ask related to your analysis and your research questions. For example, you mention the main stream media but, in the case of Moldova, the “old school” media had no way of investigating what was going on in Chisinau but through the Internet. The main television and the international (i.e. mostly Russian) journalists present there were very political colored. Talking about how people used Twitter in Moldova was a very elegant way of getting away with very little information during the first days. Later more information was available but the Twitter got glued to the events in Moldova.

I do have other questions that I am interested to hear your take on so please feel free to email me if you would like to talk more about this.
If possible, I would also be interested to take a look at the “authorities” graph of #pman. Do you have one done just for the data from the first 3 days?

Thanks for being interested in #pman :)
Pingback: El Oso » Archive » Ars Electronica Symposium on Cloud Intelligence – September 5
Gregory Saxton January 11, 2010 at 4:39 pm

Hi Ethan,

Nice work with social networking analysis. I do research on nonprofits and technology, too, but have never been to NTEN. I was wondering if you would be willing to share more details on how you implemented the HITS algorithm? I am doing a study on nonprofits’ use of Twitter and think the hub/authority analysis might make a useful addition. I’d also be happy to share my code with you (I’m a relative newbie at Python, but have developed a few simple scripts to download and parse tweets and put them into an sqlite database. Also playing around with some basic natural language processing).

At any rate, I hope to hear from you. Thanks in advance for your time.

Cheers,

Greg

Gregory Saxton, Dept. of Communication, University at Buffalo, SUNY
@NPTimes100

Comments are closed.