Data-mining: A gold mine for Guardian readers

Simon Rogers is editor of the Guardian’s Datablog and Datastore, an online data resource that publishes hundreds of raw datasets and encourages its users to visualise and analyse them. He is also a news editor who works with the graphics team to visualise and interpret huge datasets. He was closely involved in the Guardian’s exercise to crowdsource 450,000 Members of Parliament expense records and its coverage of the Afghanistan Wikileaks war logs. He has just been awarded the Oxford University Internet Institute’s award of “Best Internet Journalist” and was recently honored at the Knight Batten awards for journalistic innovation.

Simon Rogers explains some of The Guardian’s data-mining work at Strata 2012

Rogers wasn’t good at math in school. And the only experience he’d had crunching numbers before joining The Guardian’s news desk in September 2001 was when he compiled a database for a marketing magazine right out of the Journalism school. He is now the editor of the best known blog on data for the media (http://www.guardian.co.uk/datablog) and receives frequent invitations to talk about data-driven journalism all over the world. He flew to the U.S. from London to speak at the Strata conference on Tuesday. The event has drawn over 2.000 developers, business managers, data scientists and journalists.

The Guardian’s Datablog started in April 2009, just after Data.gov was launched in the U.S. and before Data.gov.uk was released in the U.K. Both government websites are part of a broader trend towards open data. “We were very lucky. The timing was very good for us. We were just at the beginning of this,” he said.

How did you acquire your data-related skills?

Doing things. I find that unless I am doing something with a certain tool I don’t use it that much. For instance, Google Refine is an amazing tool, but I haven’t had to use it that much so I don’t really know how to use it. Whereas something like Google Fusion Tables I use it all the time, so I’m pretty good at it. So, learning for me, it’s been on the job. And these tools are changing all the time so there is no point in just going to use one because in a year’s time there will be something completely different.

So obviously you are not afraid of learning?

It’s part of the job, isn’t it? When I started working as a journalist, nobody had access to the Internet on their machines. I was the only one with a computer at my desk because I was the youngest person in the office. And now, can you imagine a reporter not using the Internet, not taking notes on their laptop? I think the tools, the nature of what we do as journalists is changing all the time.

What software do you use?

We’ve got a content management system which is in-house. At the Guardian we use that. Otherwise, most of the time I use Excel. Ninety percent of what I do is in Excel. I use Google’s Fusion Tables quite a lot. I use Tableau a bit, although I’m planning to do that a bit more because it’s a very quick way of producing … Fusion Tables are very good when you have a big data set and you want to map a lot of stuff. It’s perfect for that.

How big is your team?

There is really myself and I’ve got a researcher called Lisa Evans. She’s very good. And apart from that a part-time researcher called Amy who is a trainee journalist. And we’ve also got access to the development team at the Guardian, occasionally when we can. So, it’s not like we’re working completely on our own. For instance, we have just published a Tableau visualization. It was a bit complicated for me so I found a developer who I knew could do it. So often we’re stealing people from different bits of the organization, different organizations, too, who can help us out.

What do you mean exactly when you say ‘I have access to the development team’?

The Guardian has a team of developers and I might be able to get access to them. It’s occasional in the sense that they also have to maintain the site, build new bits of the Guardian website and so on. So, I have to persuade people that it’s the right thing to do. So, most of the time we have to do things on our own, I suppose.

What are your plans in regard to data-driven reporting?

My intention is really to make it part of what the Guardian does everyday. We’re situated next to the newsroom. Frequency of publication to me is very, very important. I want to do things not only well but also often so that people get used to the idea that when they come to the Guardian they have the ultimate kind of data resource. There is data they can use. But also they can find data sources, they can find data mining news topics of the day and it’s all there for them, and available and open.

Are reporters at the Guardian afraid of data journalism?

I think, they are less scared than they used to be, thanks to things like WikiLeaks. People realize that you can get stories out of data. But at the same time, there is a reluctance to challenge the numbers in the way you’d challenge the source you interviewed. Often journalists just believe the numbers, without questioning them, which is worrying. But I think the younger journalists who are coming up are very data-savvy and literate and interested, much more than the older journalists, so it will change. Yeah, we have competition now, in the way we didn’t have two years ago. It’s changing all the time. Next year somebody else might be speaking here instead of me.

What is the value of data journalism for you?

I think it’s allowed us to generate stories we would’ve never been able to do before and it’s opened up the government and it’s opened up the information, certainly, in the U.K. Whereas in the past somebody could lie about things, a government minister could lie about something and you wouldn’t have the tools to question that. Now you do, you can check it straight away. Mark Twain said: “A lie could be halfway around the world before the truth has got its boots on.” And now the truth can come and catch up and that’s really interesting.

Teresa Bouza, a senior correspondent for Spain’s EFE News Services, interviewed Guardian data guru Simon Rogers at the O’Reilly Strata Conference in Santa Clara, Calif. As a 2012 John S. Knight Fellow, Bouza is working on making open-source data mining tools more accessible.