Twitter users send about 500 million tweets a day, an endless fire hose of information about how people feel, what they're doing, what they know and where they are.
For epidemiologists and health officials, it's a potential gold mine of data, a possible way to track where disease is breaking out and how it spreads, as well as how best to help, but only if they can figure out how to find the useful signal amid all the noise. "The question is: How do you take these billions of messages, find the useful information and get it to people who can respond?" says Mark Dredze, an assistant professor of computer sciences at Johns Hopkins University.
That's a big question, one whose difficulty has pushed many researchers away from the idea of using Twitter data, which they say is too messy and uncontrolled compared with traditional methods of collecting health information, such as surveys and analyses of hospital visits. Others argue that once we learn to effectively harness the data, Twitter's very messiness (including the impulse to tweet how annoying your runny nose is) will make it an invaluable resource.
"It's like a pulse on the world, because people will just tweet whatever, whenever," says Christophe Girraud-Carrier, an associate professor of computer science at Brigham Young University, who studies what he and colleagues have dubbed "computational health science". "Poll answers are filtered by perception or memory; on Twitter, we're observing real behaviour" in real time, he says.
Using Twitter data has other advantages, Dredze says. For starters, it's faster. It can take the Centres for Disease Control and Prevention about two weeks to publish findings, he says. Those numbers can also be delayed by the fact that a sickness doesn't show up in statistics until someone goes to hospital or does something else that causes the ailment to be reported.
Twitter, on the other hand, might reflect it the first morning someone wakes up with a sore throat. Speed can be a big advantage when tracking epidemics and emerging diseases, Taha Kass-Hout, director of the centre's division of informatics solutions and operations, says. "An emerging disease from south-east Asia can be in a US backyard in 12 to 14, maybe 24, hours."
Twitter can also provide a more detailed picture of where disease is breaking out, since many tweets are tagged with their locations. That, coupled with faster data, could help keep hospitals and clinics from getting overwhelmed in the middle of an outbreak. Detailed, location-specific data can also identify clumps of non-communicable diseases - cardiovascular disease or type II diabetes, for example - allowing health officials to focus education efforts in the areas that need it most.
Twitter is also in increasingly wide use, including in countries that don't have effective public health tracking agencies.
Those advantages, coupled with the fact that researchers are getting better at tracking and analysing useful information, mean "consensus is forming in the public health and healthcare communities that we really need to pay attention to social media," Kass-Hout says. But he stresses that social media data is "a complementary tool, rather than a replacement" for more traditional methods of gathering information.
One goal of Dredze's research was to confirm how useful Twitter data could be by studying if tweets about the flu could be filtered in such a way that they tracked with official flu rates.
In May, 2011, Dredze and his colleagues were using a computer program to monitor mentions of the flu on Twitter. Suddenly, there was a huge spike in chatter. "It didn't make any sense to us," Dredze says. "The flu season was pretty much over." They drilled down and discovered that people were discussing the fact that Kobe Bryant of the Los Angeles Lakers had played a game while sick.
Dredze and his colleagues decided they needed a better algorithm, one that would allow the program to filter out tweets that aren't actually about people having the flu. Their system starts by searching for some key words (such as "flu," "fever" and certain brands of medicine) and screening out others (including "Bieber" with "fever" is a good sign that someone's not talking about having the flu; so is including a URL, since it probably means they're simply sharing an article), then applying grammatical analysis to figure out whether someone actually has the flu or is just talking about it. (Is "flu" the subject or the object of the verb? Which verbs are used? Which pronouns?)
They tested the system when reports of the latest US flu epidemic hit the media in January. The number of tweets mentioning the flu shot up, though most of them didn't reflect actual cases. But when Dredze and his team filtered tweets through their algorithm, they matched actual flu rates.