David Weinberger‘s new book “Too Big To Know” (#2B2K – be sure to pick book titles that make good hash tags…) launched last night at Harvard Law School with a talk entitled “Unsettling Knowledge”. If you know David’s work, it’s obvious that the title is a pun. And David’s new book is a wonderfully unsettling piece – it challenges our notion of what knowledge is, and introduces the uncomfortable question of how we navigate this new space.
Knowledge as we know it is coming apart, David tells us. The bastions of knowledge, the physical emblems of knowledge, like encyclopedias, newspapers and libraries are undergoing radical transformation. We know we’re heading into a future that’s deeply different, though we don’t know quite how. The manifestations of knowledge are at risk, and all it took was the touch of a hyperlink.
How did these institutions fall apart so quickly? It’s an impossible question to answer, but he offers one path through the thicket. He starts with a famous quote from Daniel Patrick Moynihan, who tells us “Everyone is entitled to his own opinion, not his own facts.” This is the promise of knowledge: that if we all got together and had an honest conversation, we can eventually come to an agreement. There is knowledge and it can bring us together.
We tend to assume that knowledge gives us an accurate picture of the world, built up bit by bit, fact by fact. In acquiring knowledge, we nail down each piece with certainty. And we see knowledge as a product of filtering and winnowing – we move from perception to true perception, from a mob of opinion to true belief. Knowledge is about finding gold within the flux.
We’ve always had to filter, based on the fact that the world is way bigger than what fits in our skills. There’s too much to know (quoting Anne Blair’s book “Too Much to Know“) and the world is too big to know.
Traditionally, we’ve handled this by breaking off a brain-sized chunk of the world and getting an expert to understand it. Once we’ve got that expert, we can stop asking questions: we simply ask the expert. Experts, and the credentials that create them, are stopping points. They’re points beyond which we don’t need to look any further.
But that’s how knowledge works on paper. Books, for all their magnificence, are a disconnected medium. They are contained within covers, they are shelved apart, they don’t naturally connect to one another. The author’s job is to put everything she knows on a topic between two covers. The arguments move in sequence, from the beginning to the conclusion. And because the book is an essentially limited medium, good writers ruthlessly cast things aside, deciding what it put in the book and what is excluded. Books are born of long-form arguments, moving us forward step by step, brick by brick.
Links are a new form of punctuation. They give you a means of continuing. In the print world, to follow a footnote in a book, you need to get on a bus and go to the library. That’s why we don’t generally follow footnotes. But now we can jump from one book to the next. It’s a magic map – touch a place on the map and you go there.
The internet is an environment that’s all about connection and our knowledge is picking up properties of the medium. Knowledge in this space is characterized by the fact that it’s “too much, messy, unsettled, and unstructured”.
Clay Shirky suggests that there’s no such thing as information overload, only filter failure. This is a very modern response to an older question. Futurist Alvin Toffler warned us about information overload, popularizing the phrase. It’s an extension of the idea of sensory overload, the idea that too much input could overwhelm and paralyze you. This is based on the faulty assumption that brains are information processing machines, and that we can overwhelm and crash them.
This line of thinking led marketers to conclude that choosing between 16 brands would be overwhelming to an American housewife and that fewer choices needed to be offered. But we’re now headed to a point where there’s an exabyte of genomic information available, and that number doesn’t lead us to paralysis, but to fascination. We’ve redefined the term “information overload” through how we use it.
We’re less overwhelmed because we’re learning different ways to filter. When we filtered in the print world, we did so in a way that prevented us from seeing the dregs. We saw only the books that our local library chose to buy, and only the books the publisher chose to print. The manuscripts filtered out of that process were invisible to retrieve through ordinary means.
Now, in a digital age, we filter forward, not filter out. All that information – some of it very low quality – is out there somewhere on the internet. We could curate and try to delete the stuff that’s wrong, hurtful, harmful or hateful. But it’s expensive to exclude information and cheaper to include everything. When you curate, you’re making decisions about what is interesting to your users, and no one can accurately predict what might be useful to a researcher in the future. Filter out all the gossip and crap from new media and you harm the scholar who wants to study celebrity behavior. You couldn’t have predicted the high level of interest in notes from a committee meeting in Wasilla, Alaska in 2008 until Sara Palin became a public figure.
The web has worked by developing tools that include all content and filter when we retrieve it. As recently as a decade ago, information retrieval experts told us that ordinary users would never use tools this complicated. But now we use them everyday, because we have to. And we’re seeing much better tools, like Shelflife, the tool Harvard’s Library Lab has created to allow users to browse the vast set of information in Harvard’s library systems.
We don’t just have a lot of information – the information is very messy. We like order – David shows a slide of zoological specimens, beetles mounted on pins – and we’re very good at establishing it. We understand where everything fits in a tree of species, based on similarities and differences. To know where a species fit into this tree was to know how the world works – to not know it was to be adrift.
In the physical world, there’s only one way to sort manifestations of information. You might want to sort your CDs by artist, while your partner might want them sorted by genre. There’s only one possible they can be stacked on the shelf, because no two things can be in the same place at the same time. In a digital age, we simply make playlists. We end up with a mess of information, but it’s a rich and fertile mess.
Figuring out where things fit in the natural order of things was an essential piece of being human. Human beings saw ourselves as “the knowers. But there’s multiple orders and multiple ways of categorizing, through tags, playlists and other ways to sort information. Messiness is an essential feature of how we scale meaning. But, David warns, we still tend to think of knowledge in the ways we did when books had to sit on a single place on the shelf, when knowledge had a single, possible, right form, rather than multiple forms.
Knowledge is too big, messy and wildly unsettled, just like the internet. “For every fact on the internet, there is an equal and opposite fact.” David warns that there is nothing we all agree on – you can find someone willing to argue that 2+2 is not 4 (and, indeed, a quick Google search shows this to be true.) We don’t agree about anything, and David warns, we never will. “This doesn’t mean there are no facts – but it does mean that people are going to insist on being wrong.”
What this persistence of disagreement means is that the promise of knowledge Moynihan offers – that we can agree on a set of facts and then argue our opinions – is not going to be fulfilled. As it turns out, we don’t even know whether Moynihan said “everyone is entitled to his own opinion, not his own facts” or whether that’s exactly what he said.
The good news is that we’re rapidly developing ways of dealing with difference and disagreement. YouTube has a crummy commenting system, as is well documented and well established. David shows us a threat of comments on a recent Batman movie trailer. Somewhere deep in this comment thread is an impassioned argument about circumcision. It would have been great if YouTube supported forking of conversations. Forking is a powerful way to deal with disagreement. It’s very hard to do in the real world without social consequences – if we decide to move away from the dinner party to our own table where we talk about circumcision, it makes people uncomfortable – but it’s very easy to do this on the web.
In the 19th century, it was very challenging to classify the platypus. There was one space in a taxonomy for warm-blooded animals, and another for animals that produce eggs. Scientists thought the platypus must be a hoax, because it didn’t fit within existing categories. Even when presented with a specimen from Tasmania with eggs intact, they fought the platypus “hoax” as something that didn’t work within existing categories.
Now we can solve problems of overly rigid taxonomies by using linked namespaces. We can create a database of names, and a database of taxonomies. We can deal with the platypus and the water mole, and map scientific and colloquial names onto different possible structures. “Pick your name, pick your taxonomy and get on with your life. So what if we disagree? Yay for difference!”
David is actually quite concerned about difference, and just how much difference we can tolerate and still interact and function. He acknowledges that there’s a human tendency towards homophily, flocking together in groups united by race, gender, belief, socioeconomic status, etc. This can lead to a serious challenge to public discourse – echo chambers that can solidify beliefs, making them more extreme and polarized. But David worries that posing issues this way relies on an unquestioned assumption: that conversations are between people who disagree deeply and looking for solutions and common ground by trying to get to the facts. This analysis misses the social role of conversation. We need so much context and so much agreement to even have a conversation. “To have a good conversation, you need to have 99% similarity and 1% difference.” He suggests that some of the work Yochai Benkler and I have been doing may help us find productive paths towards including difference, but reminds us that the high level of disagreement and the difficulty of finding common ground is likely a core feature of the internet and knowledge in an internet age.
Finally, knowledge in this new paradigm is unstructured. We’re used to the idea that knowledge has a basic structure. We have grown used to long form arguments that take us from A to Z, and we’re particularly fond of arguments that take us from A to Z in an orderly path, where Z is an unexpected place to end up. “This is a magnificent form of thought, but the long form argument is losing it’s preeminence.”
We might think of Darwin as a leading proponent of the long form argument. And his argument certainly led somewhere unfamiliar. But he wouldn’t have analyzed data for years and released a massive book if he were working today. He would publish online. And even if he didn’t, the conversation about his work would be based online. Whether or not we imagine Darwin tweeting from The Beagle, the web is where the thinking about and reacting to Darwin’s work would take place, and collectively, it will have more value that Darwin’s long form work taken alone. Moving forward, we will not just see these long form works, but the webs that precede and follow them.
Michael Nielsen has recently written about scholarly community reaction to results at CERN that offer evidence for faster than light neutrinos. As these results came in, they were posted to arXiv.org, a journal preprint site. They stirred up a firestorm of interest and reactions. Some of those reactions are brilliant, some are stupid and wrong. But that welter of discussion is where knowledge is – it’s taking place outside of printed peer review journals.
Darwin spent seven years studying and dissecting barnacles before working on The Origin of Species. His two volume work on barnacles includes countless facts, and his hard work to discover and pin them down was an act of nobility. But science doesn’t work quite like that anymore. We work with clouds of data about genetics, astronomy, and other topics. These data clouds are fundamentally different than facts. When data.gov released sets of government information, they didn’t clean or normalize it ahead of time – they released raw data. They concluded that it was better to put the data out there than to constrain themselves to information that was consistent and known, for the simple reason that this constraint would have slowed them down badly. Darwin would not have agreed – he spent seven years on one fact.
There’s value in getting the data out quickly, David argues. It may be the one approach that’s scaleable – releasing raw data and letting individuals and groups clean, analyze and share what they find. Peer review scientific journals don’t scale, but perhaps peer to peer peer review might. We’re seeing growth in the Open Access journal field, particularly in spaces of repository where data is released, not peer reviewed.
One way we can start making sense of these new data sets is through the magic of linked data, a format suggested by Tim Berners-Lee, father of the web. We organize information in triples:
the platypus | lives in | Tasmania
Watermoles | lay | eggs
When we link triples to a central reference, we can resolve our platipae to water moles and link our triples together. Facts, which used to look like bricks, now look like links.
David closes by returning to his original question: why were old knowledge systems so fragile? These systems assumed knowledge was bounded, settled, orderly and proceeded step by step. But that’s not what knowledge feels like in the age of the internet. It feels unbounded, overwhelming, unsettled, messy, linked and governed by our interests. And those properties are the properties of what it means to be human in the world.
“Networked knowledge may or may not be truer about the world, but is is truer about knowing… This crazy approach to knowledge feels familiar to us, because it’s how we tend to know.” He closes with an observation that’s both hopeful and unsettling: “What we have in common is a shared world about which we disagree, not a common knowledge we share and can collectively come to.”
I’ve followed David’s work for a long time, and had the pleasure of watching him work through the ideas behind this book – David and I are both part of a group at Berkman that helps colleagues explore book-length projects. While I’m familiar with this line of David’s though, it was exciting and unsettling to hear him work through these ideas covering the whole arc of the book. I think this may be the most unsettling and radical book David’s put forth. On the one hand, it’s not a surprise that people will disagree on any concievable fact. But David’s suggestion that we give up on achieving an impossible consensus and proceed with the hard work of getting on with our lives strikes me as challenging and liberating, a very different path than I hear from most activists and advocates. I’m enjoying wrestling with the ideas David puts forth both in this talk and in the paper and hope lots of readers will take up the challenge as well.
Beth Kolko is the sort of academic who follows her muse from one fascinating topic to another. Colin Maclay traces some of her past work from a doctorate in English through research on use of technology in the developing world, through her current research on human-centered design and engineering at the University of Washington. For the past couple of years, Beth has been focused on research for a book on hackers and makers. This is a project that comes from her daily life, where she’s spent the last six years participating in hacking and making events in the Seattle area – she’s now considering the implications of hacking for academia and larger questions of how the DIY movement could impact civic engagement and educational reform.
There are three major areas her talk – titled “Hackademia” – focuses on. She’s interested in how hackers, makers and students, especially undergrad students, can work as innovators. She’s starting to identify patterns within non-expert communities that allow hackers and makers to innovate. And she’s interested in how we “make more of this ‘stuff’” – as society and as educators, how to we scaffold and maximize these contributions?
The key to understanding hacking and making, she suggests, is imagination: looking at people as creative problem-solvers. While there’s lots of research on how corporate and university researchers solve problems, there’s less research on how people without credentials solve problems. She’s specifically interested in rulebreakers, people who either break the rules of the academy or laws to innovate. Rulebreaking, she argues, is a type of power play: it’s a way ot fighting against the cultural and economic power of “being technical”, finding ways to be technical outside of an existing ruleset.
The people Beth studies are functional, rather than accredited engineers. She confesses, “I don’t really care about formal STEM (science, tech, education and math) education – okay, I care a little. But there are lots of studies on getting people to work in those fields. Instead, I’m trying to get people to be STEM literate and facile.”
Beth tells us about an experiment in group learning she participated in. A group is given a task – from three feet away, collaboratively find a way for the group to touch each card in a set of cards in order. While it’s a simple task, the challenge is to execute it collaboratively, and she reports that her group took a long time to discuss what ways would be sufficiently participatory, while another group never completed the task. When we’re faced with new sets of rules, we are forced to think through tacit assumptions that define our behavior, bringing those internalized constraints to the surface.
She tells us about an independent inventor in Detroit, who created a novel flash heating process for steel. It saves energy, and makes steel that’s 7% stronger than through conventional processes. While his research was independent and uncredited, it’s now being analyzed within metallurgy schools to verify the success of the process. One of the people verifying observes that, “Steel is a mature science”. We tend to assume that all that could be done has been done, but that’s not true.
For an example that’s even further from the academic community, she points us to a YouTube video of a fun parlor trick – removing a cork from a wine bottle without harming cork or bottle. The key is to insert a plastic bag, snare the cork, partially inflate the bag and then pull the apparatus out. An auto mechanic – Jorge Odon – was watching YouTube videos in his native Argentina, and thought this was a cool trick. He wondered if it would work for babies. And it does – the Odon device is now in trials as part of birth kits for the developing world.
There’s innovating from hacking as well. She points to wardriving, a technique developed to compromise networks, which now is part of business processes to ensure corporate networks are locked down. And she suggests that password testing tools have emerged almost exclusively from the hacking community. Security techniques designed to compromise networks become part of standard business practices.
Some of Beth’s recent work has focused on non-expert innovation from students, specifically work on a low-cost portable ultrasound kit. A colleague at the University of Washington working in radiology reached out to Beth for help with user interfaces for ultrasound systems used by midwives in Kampala, Uganda. The goal of the project was to train midwives to identify the three conditions that most contribute to maternal mortality and send affected women to hospitals, rather than giving birth at home.
As Beth and her students worked on the project, they discovered that one major problem was that midwives were trained for 2-6 weeks, while ultrasound readers in the US train for two years before being certified. Even the technicians who train for two years don’t use all the functions of a commercial ultrasound machine – in US ultrasound practice, the complex machines are heavily marked with signs created by the technicians warning not to use certain buttons or to use only certain ranges of frequencies.
Can we make this technology simpler for technicians with less training? This makes sense, as the Ugandan technicians are only trying to diagnose three conditions. The solution Beth and her team found was to move back to an older, cheaper technology and to marry those wands with simple netbooks, then focus on making the user interface as easy as possible.
Through ethnography with midwives and mothers, they discovered that the use of ultrasound is utterly different in Uganda than in US clinical practice. In the US, the technician can pass any ambiguous results to a support structure of doctors. Midwives in Uganda are generally all on their own – they need to give answers to mothers directly. So she and her students built a help system for the ultrasound device that was a learning system about maternal health, not just a manual for the tool.
“Not understanding the boundaries of the problem space allows innovation – including a help and learning system into the product was something my students did not know was prohibited.”
Beth’s insights in this field come from studying creativity around technology in the developing world, as well as US hackerspaces, makerspaces, hacker cons, and makerfaires. Extrapolating from both types of sites, she observes three characteristics:
- The importance of actual space in bringing communities together
- Systems of apprenticeship or scaffolded learning, including workshops that show people what they need to know to join a community
- Contests and other systems for building reputations, like the “black badges” issued to winners of capture the flag contests at Defcon, or the badges people win on instructables.com
She’s interested in the possible overlaps between university research, industry labs and independent researchers. Her goal is not to map the actual Venn diagram of the space, but to understand how independent researchers work in this space. She believes that independent researchers are particularly important for building disruptive technology. Academics have a disincentive to build highly disruptive systems – they’re hard to get academic funding for, and hard for PhD students to pitch dissertations around. It’s hard to disrupt in the corporate community, especially when disruptive tech is cheaper, as those sorts of innovations tend not to fit within existing sales structures. Independent researchers may be immune to these restrictions and especially capable of pushing forward disruptive innovations.
The structural constraints suggest that independent researchers may not be able to do fundamental research – it’s hard to investigate the deep structure of matter without strong funding. What independent researchers excel at is technological remix. She shows photos of makers building a panoramic camera designed to take photos from near space. There’s not much novel tech development involved with the project, but lots of remix of existing photographic technology.
Beth’s “Hackademia” project has attempted to learn from these general observations. She invited six undergraduate students to meet regularly in a physical space, equipped with desks and chairs and salvaged gear to hack with, including Arduino controllers. She asks the students to learn and keep track of how they learn. She offers no formal instruction, but lots of pointers to places her students can find learning materials.
One of the projects the Hackademia team took on was assembling a makerbot, a 3D printer that comes as a kit. Very seasoned engineers have been able to assemble the product in seven hours – her team took it slowly and took weeks. But they got it together, and developed some intense technical skills in the process. One student, who had been worried about touching any pieces of the kit for fear of breaking them, found herself some weeks later slapping Beth’s had when she tried to assemble something for her. This student had thought of herself as “non-technical”, Beth tells us. “But that notion of technical and non-technical broke down for them.”
Why Hackademia? Because there are few mechanisms at the university to allow non-science students to gain technical skills. It’s very hard for someone not on an engineering track to learn how to solder. But Beth’s work isn’t designed to create more professional engineers – it’s to get people to functional technical literacy. “We’re creating functional engineers one blinky LED at a time.”
Interventions like Hacakdemia, Beth hopes, can address at least six issues:
- self-efficacy – considering yourself capable of engaging in technical acts
- material technical practice – gaining concrete technical skills
- identity formation – identifying personally and socially as a technically competent person
- conception – understanding the scope and practice of technical knowledge
- motivation – articulating possible future selves
- social capital and sustainable participation – understanding how to seek out expert knowledge when necessary
On this last point, Seattle is a particularly sustainable place to build this sort of interventions, as it’s filled with hacker spaces and expert communities who can support this form of experimentation.
Beth’s new effort is Shiftlabs, an engineering and manufacturing company that works only with hackers. The company focuses on the engineering of low cost devices in the global health space, using R&D from independent researchers. Why a company and not a book? Beth explains that she’d never intended for this space to be the main locus of her research – it’s the product of taking a close look at something she’s become fascinated with in her personal life that’s turned into an academic and professional focus.
Daniel Castro of The Information Technology & Innovation Fund recently published a paper supporting the Stop Online Privacy Act (SOPA) currently being debated in congress. In that report, he claims that research performed by us supports the domain name system (DNS) filtering mechanisms mandated by SOPA. This claim is a distortion of our work. We disagree with the use of our study to make the point that DNS-based Internet filtering works and that we should therefore use it as a means of stopping websites from distributing copyrighted content. The data we collected answer a completely different set of questions in a completely different context.
Among other provisions that seek to control the sharing of copyrighted material on the Internet, SOPA, if enacted, would call upon the U.S. government to require that Internet service providers remove from their DNS servers the names of any sites that either infringe copyright directly or merely “facilitate” copyright infringement. So, for example, the government could require that ISPs remove the name “twitter.com” from their DNS servers if twitter.com was not being sufficiently aggressive in preventing its users from tweeting information about places to download copyrighted materials. This practice is known as DNS filtering. DNS filtering is one of the most common modes of Internet-based censorship. As we and our collaborators in the OpenNet Initiative have shown over the past decade, practices of this sort are used extensively in autocratic countries, including China and Iran, to prevent access to a range of sites offensive to the governments of those countries.
Opponents of SOPA have argued that the DNS filtering, even though it will have a number of harmful effects on the technical and political structure of the Internet, will not be effective in preventing users from accessing the blocked sites. Mr. Castro cites our research as evidence that SOPA’s mandate to filter DNS will be effective. He quotes our finding that at most 3% of users in certain countries that substantially filter the Internet use circumvention tools and asserts that “presumably the desire for access to essential political, historical, and cultural information is at least equal to, if not significantly stronger than, the desire to watch a movie without paying for it. Yet only a small fraction of Internet users employ circumvention tools to access blocked information, in part because many users simply lack the skills or desire to find, learn and use these tools.”
In our report, we looked at three sets of censorship circumvention tools: complex, client-based tools like Tor; paid VPNs; and web proxies. We estimated usage of those three classes of tools. We used reports from the client tool developers, a survey to gather usage data from VPN operators and used data from Google Analytics to estimate usage of web proxy tools. Counting all three classes of tools, we estimated as many as 19 million users a month of circumvention tools. Given the large number of users in China, Iran, Saudi Arabia and other states where filtering is endemic, this represents a fairly small percentage of internet users in those countries; 19 million people represents about 3% of the users in countries where internet filtering is pervasive. We actually believe that 3% figure is high, as some of the tools we study are used by users in open societies to evade corporate or university firewalls, not just to evade government censorship.
We stand behind the findings in our study (with reservations that we detail in the paper), but we disagree with the way that Mr. Castro applies our findings to the SOPA debate. His presumption that people will work as hard or harder to access political content than they do to access entertainment content deeply misunderstands how and why most people use the internet. Far more users in open societies use the Internet for entertainment than for political purposes; it is unreasonable to assume different behaviors in closed societies. Our research offers the depressing conclusion that comparatively few users are seeking blocked political information and suggests that the governments most successful in blocking political content ensure that entertainment and social media content is widely available online precisely because users get much more upset about blocking the ability watch movies than they do about blocking specific pieces of political content.
Rather than comparing usage of circumvention tools in closed societies to predict the activities of a given userbase, Mr. Castro would do better to consider the massive userbase of tools like bit torrent clients, which would make for a far cleaner analogy to the problem at hand. Likewise, the long line of very popular peer-to-peer sharing tools that have been incrementally designed to circumvent the technical and political measures used to prevent sharing copyrighted materials are a stronger analogy than our study of users in authoritarian regimes seeking to access political content.
Second, our research has consistently shown that those who really wish to evade Internet filters can do so with relatively little effort. The problem is that these activities can be very dangerous in certain regimes. Even though our research shows that relatively few people in autocratic countries use circumvention tools, this does not mean that circumvention tools are not crucial to the dissident communities in those countries. 19 million people is not large in relation to the population of the Internet, but it is still a lot of people absolutely who have freer access to the Internet through the tools. We personally know many people in autocratic countries for whom these tools provide a crucial (though not perfect) layer of security for their activist work. Those people would be at much greater risk than they already are without access to the tools, but in addition to mandating DNS filtering, SOPA would make many circumvention tools illegal. The single biggest funder of circumvention tools has been and remains the U.S. government, precisely because of the role the tools play in online activism. It would be highly counter-productive for the U.S. government to both fund and outlaw the same set of tools.
Finally, our decade-long study of Internet filtering and circumvention has documented the many problems associated with Internet filtering, not its overall effectiveness. DNS filtering is by necessity either overbroad or underbroad; it either blocks too much or too little. Content on the Internet changes its place and nature rapidly, and DNS filtering is ineffective when it comes to keeping up with it. Worse, especially from a First Amendment perspective, DNS filtering ends up blocking access to enormous amounts of perfectly lawful information. We strongly resist the claim that our research, and that of our collaborators, makes the case in favor of DNS-based Internet filtering.
Mr. Castro’s report may be found here:
with the reference to our work on p. 8.
The study that is being misused by Mr. Castro is here:
The findings of our decade-long studies are documented in three books, published MIT Press and available freely online in their entirety at:
- Rob Faris, John Palfrey, Hal Roberts, Jill York, and Ethan Zuckerman
This summer, Sasha, Lorrie and I started brainstorming the sorts of events we wanted to host at the Center for Civic Media this fall. The first I put on the calendar was a session on “mapping civic media”, a chance to catch up with some of my favorite people who are working to study, understand and visualize how ideas move through the complicated ecosystem of professional and participatory media.
To represent the research being done in the space, we invited Hal Roberts, my collaborator on Media Cloud (and on a wide range of other research), Erhardt Graeff from the Web Ecology project, and Gilad Lotan, VP of R&D for internet analytics firm BetaWorks. On Wednesday night, I asked them to share some of the recent work they’ve been doing, understanding the structure of the US and Russian blogosphere, analyzing the influence networks in Twitter during the early Arab Spring events and understanding the social and political dynamics of hashtags. They didn’t disappoint, and I suspect our video of the session (which we’ll post soon) will be one of the more popular pieces of media we put together this fall. In the meantime, here are my notes, constrained by the fact that I was moderating the panel and so couldn’t lean back and enjoy the presentations the way I otherwise might have.
Hal Roberts is a fellow at the Berkman Center for Internet and Society, where he’s produced great swaths of research on internet filtering, surveillance, threats to freedom of speech, and the basic architecture of the internet. (That he’s written some of these papers with me reflects more on his generosity than on my wisdom.) He’s the lead architect of Media Cloud, the system we’re building at the Berkman Center and at Center for Civic Media to “ask and answer quantitative questions about the mediasphere in more systematic ways.” As Hal explains, media researchers “have been writing one-off scripts and systems to mine data in haphazard ways.” Media Cloud is an attempt to streamline that process, creating a collection of 30,000 blogs and mainstream media sources in English and Russian. “Our goal is to get as much media as possible, so we can ask our own questions and also let others ask questions of our duct tape and bubblegum system.”
Hal’s map of clusters in popular US blogs. An interactive version of this map is available here.
Much of Hal’s work has focused on using the content of media – rather than the structure of its hyperlinks – to map and cluster the mediasphere. He shows us a map of US blogs that cluster into three main areas – news and political blogs, technology blogs and what he calls “the love cluster”. This last cluster is so named because it’s filled with people talking about what they love. Subclusters include knitters, quilters, fans of recipes and photography. The technology cluser breaks down into a Google camp, an iPhone camp and a camp discussing Android Apps. Hal’s visualization shows the words most used in the sources within a cluster, which helps us understand what these clusters are talking about. The Google cluster features words like “SEO, webmaster, facebook, chrome” and others, suggesting the cluster is substantively about Google and its technology projects.
While we might expect the politics and news cluster to divide evenly into left and rightwing camps, it doesn’t. Study the link structure of the left and the right, as Glance and Adamic and later Eszter Hargittai have, and it’s clear that like links to like. But Hal’s research shows that the left and right use very similar language and talk about many of the same topics. This is a novel finding: It’s not that the left and right are talking about entirely different topics – instead they’re arguing over a common agenda, an agenda that’s well represented in mainstream media as well, which suggests the existence of subjects neither the right or left are talking about online.
Building on this finding, Hal and colleagues at Berkman looked at the Russian media sphere, to see if there was a similar overlap in coverage focus between mainstream media and blogs. “Newspapers and the television are subject to strong state control in Russia – we wanted to see if our analysis confirmed that, and whether the blogosphere was providing an alternative public sphere.
The technique he and Bruce Etling used is “the polar map” – put the source you believe is most important at the center, and other sources are mapped at a distance from that source where the distance reflects degree of similarity. The central dot is a summary of verbiage from Russian government ministry websites. Right next to it is the official government newspaper. TV stations cluster close to the center, while blogs cover a wide array of the space, including the edges of the map.
It’s possible that blogs are showing dissimilarities to the Kremlin agenda because they’re talking about knitting, not about politics. So a further analysis (the one mapped above) explicitly identified democratic opposition and ethno-nationalist blogs and looked at their placement on the map. There’s strong evidence of political conversations far from the government talking points in both the democratic opposition and in the far right nationalist blogosphere.
What’s particularly interesting about this finding is that we don’t see the same pattern in the US blogosphere. Make a polar map with the White House, or a similar proxy for a US government news agenda, at the center, and you’ll see a very different pattern. Some right wing American blogs flock quite closely to the White House talking points – mostly to critique them – while the left blogs and mainstream media generally don’t. However, when Hal and crew did an analysis of stories about Egypt, they saw a very different pattern than in looking at all stories published in these sources. They saw a tight cluster of US mainstream media and blogs – left and right – around the White House. The government, the media and bloggers left and right talked about Egypt using very similar language. In the Russian mediasphere, the pattern was utterly different – the democratic opposition was far from the Kremlin agenda, using the Egyptian protests to talk about potential revolution in Russia.
The ultimate goal of Media Cloud, Hal explains, is to both produce analysis like this, and to make it possible for other researchers to conduct this sort of analysis, without a first step of collecting months or years of data.
Erhardt Graeff is a good example of the sort of researcher Media Cloud would like to serve. He’s cofounder of the Web Ecology Project, which he describes as “as a ragtag group of casual researchers that has now turned in a peer-reviewed publication“. That publication is the result of mapping part of the Twitter ecosystem during the Tunisian and Egyptian revolutions, and attempting to tackle some of the hard problems of mapping media ecosystems in the process.
The Web Ecology Project began life researching the Iranian elections and resulting protests, focusing on the #iranelection hashtag. With a simple manifesto around “reimagining internet studies”, the project tries to understand the “nature and behavior of actors” in media systems. That means considering not just the top users, or even just the registered users of a system like Twitter, but the audience for the media they create. “Each individual user on Twitter has their personal media ecosystem” of people they follow, influence, are followed by and influenced by.
This sort of research rapidly bumps into three hard problems, Erhardt explains:
- Did someone read a piece of information that was published? Or as he puts it, “Did the State Department actually read our report about #IranElection?” It’s very hard to tell. “We end up using proxies – you followed a link, but that doesn’t mean you read it.”
- Which piece of media influenced someone to access other media? “Which tweet convinced me to follow the new Maru video, Erhardt’s or MC Hammer’s?”
- How does the media ecosystem change day to day? Or, referencing a Web Ecology paper, “How many genitalia were on ChatRoulette today?” The answer can vary sharply day to day, raising tough problems around generating a usable sample.
The paper Erhardt published with Gilad and other Web Ecology Project members looks at the Twitter ecosystem around the protest movements in Tunisia and Egypt. By quantitatively searhing for information flows, and qualitatively classifying different types of actors in that ecosystem, the research tries to untangle the puzzle of how (some) individuals used (one type of) social media in the context of a major protest.
To study the space, the team downloaded hundreds of thousands of tweets, representing roughly 40,000 users talking about Tunisia and 62,000 talking about Egypt. They used a “shingling” method of comparison to determine who was retweeting whom ad sought out the longest retweet chains. They looked at the top 10% of these chains in terms of length to find the “really massive, complex flows” and grabbed a random 1/6th of that sample. That yielded 774 users talking about Tunisia, 888 talking about Egypt… and only 963 unique users, suggesting a large overlap between those two sets.
Then Erhardt, Gilad and others started manually coding the participants in the chains. Categories included Mainstream Media (@AJEnglish, @nytimes), web news organizations (@HuffingtonPost), non-media organizations (@Wikileaks, @Vodaphone), bloggers, activists, digerati, political actors, celebrities, researchers, bots… and a too-broad unclassified category of “others”. This wasn’t an easy process – Erhardt describes a system in which researchers compared their codings to ensure a level of intercoder reliability, then had broader discussions on harder and harder edge cases. They used a leaderboard to track how many cases they’d each coded, and goaded those slow to participate into action.
The actors they classified are a very influential set of Twitter users. The average organization in their set has 4004 followers, the average individual 2340 (which is WAY more than the average user of the system). To examine influence with more subtlety than simply counting followers, Erhardt and his colleagues use retweets per tweet as an influence metric. What they conclude, in part, is that “mainstream media is a hit machine, as are digerati – what they have to say tends to be highly amplified.”
The bulk of the paper traces information flows started by specific people. In the case of Egypt, lots of information flows start from journalists, bloggers and activists, with bots as a lesser, but important, influence. In Tunisia, there were fewer flows started by journalists, more by bots and bloggers, and way fewer from activists. This may reflect the fact that the Tunisian story caught many journalists and activists by surprise – they were late to the story, and less significant as information sources than the bloggers who cover that space over time. By the time Egypt becomes a story, journalists realized the significance and were on the ground, providing original content on Twitter, as well as to their papers.
One of the most interesting aspects of the paper is an analysis of who retweets whom. It’s not surprising to hear that like retweets like – journalists retweet journalists, while bloggers retweet bloggers. Bloggers were much more likely to retweet journalists on the topic of Egypt than on Tunisia, possibly because MSM coverage of Egypt was so much more thorough than the superficial coverage of Tunisia.
While Gilad Lotan worked with Erhardt on the Tunisia and Egypt paper, his comments at Civic Media focused on the larger space of data analysis. “I work primarily on data – heaps and mounds of data,” he explains, for two different masters. Roughly half his work is for clients, media outlets who want to understand how to interact and engage with their audiences. The other half focuses on developing the math and algorithms to understand the social media space.
This work is increasingly important because “attention is the bottleneck in a world where threshhold to publishing is near zero.” If you want to be a successful brand or a viable social movement, understanding how people manage their attention is key: “It’s impossible to simply demand attention – you have to understand the dynamics of attention in the face of this bottleneck.”
Gilad references Alex Dragulescu’s work on digital portraits, pictures of people composed of the words they most tweet or share on social media. He’s interested not just in the individuals, but in the networks of people, showing us a visualization of tweets around Occupy Wall Street. Different networks take form in the space of minutes or hours as new news breaks – the network around a threatened shutdown of Zuccotti Park for a cleanup is utterly different than the network in July, when Adbusters was the leading actor in the space.
Lotan’s visualizations of Twitter conversations about Occupy in July and October 2011
Images like this, Lotan suggests, “are like images of earth from the moon. We knew what earth looked like, but we never saw it
We knew we lived in networks, but this is the first time we can envision it and see how it plays out.”
When we analyze huge data sets, we can start approaching answers to very difficult questions, like:
- What’s the audience of the New York Times versus Fox News?
- What type of content gains wider audiences through social media?
- What topics do certain outlets cover? What are their strengths, weaknesses and biases?
- How do audiences differ between different publications? How are they similar?
- How fast does news spread, and how does it break?
Much of media and communications research addresses these questions, though rarely directly – as Erhardt noted, we generally address these questions via proxies. But Lotan tells us, we can now ask and answer questions like, “How many Twitter users follow Justin Bieber and The Economist?” The answer, to a high degree of precision, is 46,000. It’s just shy of the number who follow The Economist and the New York Times, 54,000.
Lotan is able to research answers like this because his lab has access to the Twitter “firehose” (the stream of all public data posted to Twitter, moment to moment) and to the bit.ly firehose. This second information source allows Lotan to study what people are clicking on, not just what media they’re exposed to. He offers a LOLcat, where the feline in question is dressed in a chicken costume. “We can see the kitty in you, and the chicken you’re hiding behind.” What people share and what they click is very different, and Lotan is able to analyze both.
This data allowed Lotan to compare what audiences for four major news outlets were interested in, my measuring their clickstreams. Al Jazeera and The Economist, he tells us, are pretty much what you’d think. But Fox News watchers are fascinated by crime, murders, kidnappings and other dark news. This sort of insight may help networks understand and optimise for their audiences. Al Jazeera’s audience, he tells us, is very engaged, tweeting and sharing stories, while Fox’s audience reads a lot and shares very little.
Some of Lotan’s recent research is about algorithmic curation, specifically Twitter’s trending topics. Many observers of the Occupy movement have posited that Twitter is censoring tweets featuring the #occupywallstreet hashtag. Lotan acknowledges that the tag has been active, but suggests reasons why it’s never trended globally. Interest in the tag has grown steadily, and has a regular heartbeat, connected to who’s active on the east coast of the US. The tag has spiked at times, but remains invisible in part due to bad timing – a spike on October 1st was tiny in comparison to “#WhatYouShouldKnowAboutMe”, trending at the same time.
At this point, Lotan believes he’s partially reverse engineered the Trending Topics algorithm. The algorithm is very sensitive to the new, not to the slowly building. This raises the question: what does it mean to “get the math right”. Lotan observes, “Twitter doesn’t want to be a media outlet, but they made an algorithmic choice that makes them an editor.” He’s quick to point out that algorithmic curation is often very helpful – the Twitter algorithm is quite good at preventing spam attacks, which have a different signature than organic trends. So we see organic, fast-moving trends, even when they’re quite offensive. He points to #blamethemuslims, which started when a Muslim women in the UK snarkily observed that Muslims would be blamed for the Norway terror attacks. That tweet died out quickly, but was revived by Americans who used the tag unironically, suggesting that we blame Muslims for lots of different things – that small bump, then massive spike is a fairly common organic pattern… and very different from the spam patterns he’s seen on Twitter.
When we analyze networks, Lotan suggests, we encounter a paradox that James Gleick addresses in his recent book on information: just because I’m one hop away from you in a social network doesn’t mean I can send you information and expect you to pay attention. In the real world, people who can bridge between conversations are rare, important and powerful. He closes his talk with the map of a Twitter conversation about an event in Israel where settlers were killed. There’s a large conversation in the Israeli twittersphere, a small conversation in the Palestinian community, and two or three bridge figures attempting to connect the conversations. (One is my wife, @velveteenrabbi.) Studying events like this one may help us, ultimately, determine who’s able to build bridges between these conversations.
I can’t wait for the video for this event to be put online – we’ll get it up as soon as possible and I’ll link to it once we do.
Beth Coleman presents some of her recent research on the protests in Tahrir square, and a broader theory of how social networks and activism in the physical world work together today at the Berkman Center. With her is Mike Ananny, her coauthor and researcher in danah boyd’s lab at Microsoft Research. The presentation, “Tweeting the Revolution”, tries to understand how we read large data sets to understand located action. This is a timely topic because we’re seeing a rise in protest activity that’s been missing from the public sphere for a few decades. Coleman wants to know what we can understand about social media and people’s willingness to take an activist stance. One of the foci of her work is the idea of mediated copresence, which she sees as a major way of understanding the relationship between technology and public action.
Tahrir Square offers an opportunity to think through the relationship between three types of speech:
- Public speech, the broadcast of information to a broad audience
- Civic speech, speech within the networks of your located environment
- Poetic speech, speech about expressing needs and interests
What’s the effect of Twitter, SMS and other technologies in a space like Tahrir? They may be critical in understanding the sustainability of commitments to a movement beyond the initial phase of protest.
In his critiques of online activism in understanding the Arab Spring, Malcolm Gladwell has suggested that activism needs to include bodily presence, risk of harm or arrest, and developed organizational infrastructures. It’s worth asking those questions – does online participation matter? Do we need bodily presence for activism? Coleman and Ananny use the possibility of bodily risk – in this case, the physical presence in Egypt – as a precursor for inclusion for her interview group. She cites Elaine Scarry’s work on body and pain, suggesting that when a body is in pain, there’s a loss of self, a loss of agency, and a loss of language. Pain cannot be articulated, and there’s the failure of “subject as a system”. So physical location in Egypt opens risk of incarceration and torture, and creates a category of potentially effected actor.
There’s lots of analysis of network collective action from at least two points of view: considering social media as an augmentation to traditional organizing tools, and considering network media as a form of command and control. There’s an open space for analysis around strategic and tactical engagement around located network media. We might think of social media as a way of facilitating co-presence, the way of being part of a phenomenon either in physical space or in a complementary virtual space. If we’re continually surrounded by Twitter, Facebook and SMS, which remind us of people’s presence even if we’re not interacting with them, how does this help us understand a move from onlooker to participant in collective action.
To understand copresence, we need to understand quotidian media engagement. 17% of Egyptians were online before the revolution and 72% on mobile phones. Coleman notes that Kate Crawford, studying non-literate women in India, sees SMS use from people you wouldn’t expect to be able to use SMS. It’s worth being open to the notion that SMS could be a powerful tool for sending the sense of presence for a very large swath of an Egyptian audience. Coleman suggests that we need to engage in careful consideration of the oral and the local to understand the cascae of strong and weak ties and their relationship to collective action.
She and Ananny propose a way of thinking through Egyptian positions towards the Tahrir protests. There were people who were present in Tahrir and those who weren’t. There were people engaged with the protests online and those who weren’t. We can create four categories of engagement by considering those categories in terms of binaries. This separates some figures from the discussion – individuals like Alaa Abdel Fatteh, who was deeply engaged online, but in South Africa for much of the protest. But it’s a useful structure in part because it forces you to consider the bottom quadrant, those who didn’t engage physically or online, and are therefore the hardest to study. Eszter Hargittai’s contribution to the work, Coleman notes, is to urge her to take that quadrant of nonparticipation seriously.
Interviews with participants quickly complicate and stretch the boundaries of these categories. An interview with a 20-something woman, upper middle class, who’s been using Ushahidi to map sexual harassment, shows Coleman that “on/off the square” may be too binary a distinction. In the wake of the media blackout on the 28th, she tells Coleman, she was motivated to go to the square because she didn’t want to be alone, she wanted to find other people, and she felt like the movement was moving from online to offline. But as she headed to the square, she felt a sense of risk and turned around. Her story calls into question the idea of whether you needed to be in Tahrir physically to be part of the revolution.
Coleman shows us a graph of Dima Khatib’s Twitter network rendered by Gilad Lotan. Based on the frameworks Coleman is suggesting, can we better understand who connects, who retweets and how information cascades? “How might the data trace of media engagement overlap with the human narrative?”
This matters, ultimately, because it influences how we might develop new tools. This past weekend, Coleman led a workshop with Juliana Rotich of Ushahidi, a platform for crisis mapping and management. “After the crisis, what are the tools for sustaining movements?”