(This article features Internews' data journalism work in Kenya and Afghanistan.)
By Marianne Bouchart
As a journalist, how do you go about accessing, verifying, and visualising datasets on this topic? What kind of ethical questions does that raise? How do you protect the victims? We gathered experts to find out.
Tom Meagher is deputy managing editor for The Marshall Project which has been publishing some of the most compelling crime data journalism of the past few years. Their project Crime in Context won a Data Journalism Awards 2017 prize for its analysis of 40 years worth of national and local crime data.The Next to Die has been tracking every execution in the US for the last two years in close to real time.
What makes working with crime or police data different from working with any other type of data?
Tom Meagher: Oh, where to begin? In the US, there are a few things that make criminal justice data a little more complicated than in most other beats. First, there’s a presumption of innocence for people accused of crimes until their case works its way through the court systems. So we want to be mindful of how the people our data represents are considered. Not everyone arrested is guilty, but with data it can be easy to overlook that key fact sometimes.
And more practically, in the US the data is so, so fragmented. There are 18,000+ police agencies and thousands of courts that all seem to keep their data in their own way (if they keep it at all). It makes it really challenging to carry out national analyses of how parts of the criminal justice system are operating. There are very few one-stop-shops for data.
Ciara McCarthy: I think, for us at The Counted at least, the main issue we set out to fix was that the data we wanted to analyse and investigate simply didn’t exist. There was no comprehensive or reliable information about how many people died in police custody in the US (although there is lots of available data, of varying reliability, about other pieces of the criminal justice system).
I think that a lot of criminal justice data […] might not be complete or accurate if it’s even been collected. And to echo Tom, that’s the other main issue: With no central body keeping track of the data we were looking at, it was hard to monitor thousands of different law enforcement agencies, all of which follow slightly different policies and standards for releasing information and communicating with reporters.
Ciara McCarthy is a journalist who worked on Guardian US’s The Countedproject, often referred to as an industry benchmark. It counts the number of people killed by police and other law enforcement agencies in the US throughout 2015 and 2016 to monitor their demographics and to tell the stories of how they died.
Both of them joined us during a Slack discussion dedicated to crime and police data at the beginning of November. This article gathers the best tips and advice they dared to share.
Although the FBI ‘collects’ this data, it’s wildly inaccurate, and underestimates the true number of people who die in police custody at least by half. It’s optional for police departments to submit their information to the FBI, meaning that most don’t end up doing it.
So would your advice be to ‘build your own data’?
Ciara McCarthy: I think it depends! Once our team started reporting on this issue in particular, it was clear that, at least for deaths in custody, the information the federal government had would have resulted in deeply flawed analyses. But in other areas of the US criminal justice system, the data collected by the government is usable — I think it’s a matter of asking a lot of questions of an available data set before you get started and seeing whether you can make reliable analyses. And if you can’t, then yes! Build your own data.
Tom Meagher: It seems like at The Marshall Project, for nearly every significant investigative story we do, the data doesn’t exist. We have to build it ourselves. As an example, here’s a story I wrote about just a few of the really key criminal justice questions we can’t answer in the US because the data doesn’t exist.
As data is tough to get hold of, do you have tips on how or WHERE to find crime and police data?
Tom Meagher: When we’re approaching a story, we have to craft a new strategy every time. For Crime in Context, we had a trove of 40+ years of the federal Uniform Crime Reporting data, but then we had to go back and contact individual police agencies to fill in dozens and dozens of holes we identified.
Then we had to call 70+ police agencies to get them to release the previous year’s data (this was in August) because the FBI didn’t have it yet. We could flag missing records in the data or reports that were suspicious (how could they have -30 assaults in a month?) and had to report each of those out. My friend Steven Rich at the Washington Post likes to say ‘the phone is the most important tool for data journalism’.
Ciara McCarthy: For us at The Counted we basically went from agency to agency to request and ask for the data. Sometimes we had to request the information under public records law, and sometimes the information (or the basics, at least) were easily distributed. The Counted was a little different from some data analysis projects in that it was live: We added new cases of people killed by police to the database each day.
How do you verify data related to crime and the police, especially when victims come forward to denounce wrongdoing? Any tips or best practice on crowdsourcing for such projects, and establishing trust with sources?
Tom Meagher: We tend to rely on official court records — lawsuit filings, courtroom testimony, decisions — and on other journalists to help us vet information. Our executions project, The Next to Die, is a sort of journalistic crowdsourcing, where we work with reporters and editors in eight other news organisations to help us amass the information that goes into our database.
Ciara McCarthy: A few things I’d point out from our project: First, for us, when we couldn’t give a definitive answer, we noted it (see an example right here). I think part of the genius behind our very brilliant interactive journalists who built the database was they created one that could adapt to our reporting needs as we added to the database.
So if police said someone was armed with a knife, but witnesses said the person had dropped the knife before the shooting, we usually label that ‘disputed’ in our database, and then pursue additional information to try and get a clear answer. In cases of people killed by police, the first piece of information almost always comes from authorities, and that information may or may not be true. So if there are witnesses (often there aren’t) we’ll talk to them to see if they saw something different.
Secondly, we considered The Counted to be a crowdsourced database, meaning that our readers could reach out and contact us with tips at any time. We had a ‘tip line’ of sorts on our website and we also got information from readers via Facebook, Twitter, and email. Most of the time, the people reaching out to us weren’t sources with sensitive or story-cracking information, but readers with questions about the project or people alerting us to new cases. Sometimes, though, family members of the deceased would reach out to dispute law enforcement’s characterisation of the incident, and when that happened we’d follow up on whatever information they gave us.
Have you ever been worried of the backlash or bad impact your projects could have?
Tom Meagher: We try to operate in a ‘no surprises’ manner. We go to great lengths to let our subjects know what’s coming out and to give them an opportunity to respond ahead of time. A big story my colleagues undertook on these programmes where you can pay money to stay in safer or nicer jails relied heavily on freedom of information requests and data compiled from more than 25 different police jurisdictions (screenshot below). If you look at the methodology, they describe how they did the analysis and how they took it to each of those police agencies a few weeks before publication to give them a chance to dispute or comment on the analysis.
As far as protecting sources from legal or physical harm, we’re very mindful of that. We go to great lengths to get our sources to go on the record, but if we think they’re potentially in jeopardy, we will allow them to be anonymous, provided we can vet their story independently. We don’t want to put anyone at risk of losing their jobs or of physical harm.
Ciara McCarthy: No one on our team personally encountered any threats or danger as a result of The Counted project as far as I know; I’d say the worst I personally encountered was a few mean tweets and a few terse phone calls with law enforcement officials who weren’t happy about the project. We also didn’t have a ton of anonymous sources whose identity we needed to protect (which I don’t think is something we expected starting out).
Most of the time, if witnesses or family members contradicted the police account, these (very brave) people did so pretty publicly. See, for example this article (screenshot below) telling the story of an American who filmed police violence. If there were cases where our reporters were working with anonymous sources, they were very cautious and made sure those who were providing information knew what publishing their accounts entailed.
Do you encounter difficulties in streamlining key definitions (for example ‘armed’ vs ‘unarmed’, or ‘Police custody’), especially when gathering data from multiple sources? How do you resolve these differences?
Tom Meagher: Oh yes, all the time. We find that different agencies or different states will often use the same words but have completely different meanings. In one state, for example, they may have a crime called ‘battery’ that in a different state would be labelled ‘assault’. We first try to make sure that we understand exactly what each term means to each source. We start with getting their data dictionary (or record layout or user’s manual) to see how they define it in print. Then we’ll follow up with interviews with agency personnel to confirm our understanding of the terms. Ultimately, we’ll often create our own categorization scheme that is hopefully more accessible to readers to describe each class of records we see in the data.
In the Pay to Stay story, we had 25+ agencies all using different terms to refer to a fairly arcane set of state statutes that you really needed a law degree to understand. With lots of reporting work, we were able to generally class them as types of crimes with colloquial names (Drugs, Driving Violations) that were still accurate to the legal definitions, muddled as they were. It ultimately made it easier for our readers to grasp the importance of the different types of crimes being reported on.
“Often in data reporting, it’s tempting to be lulled into thinking that the ‘official data’ that is provided to you is rational and sensible and ready to be analyzed or visualized. In reality, we find most of the time that it’s a complete mess that requires a lot of reporting before we can even think about analyzing it to inform our reporting.” Tom Meagher (The Marshall Project)
Ciara McCarthy: We ran into this issue A LOT while working on The Counted project, particularly when it came to defining whether the deceased was armed or unarmed, as you noted. As you can imagine, the law enforcement definition of someone who is armed might differ from what others would consider armed, or the police account might change over time. We ran into this a lot when police shot and killed someone who was driving a car; often, they would say, they opened fire because the person in question was using the car as a weapon. (We did a bigger piece on this here).
That’s obviously super tricky, because it’s difficult to corroborate without video or a witness. A good example of this issue is the case of Zachary Hammond, a teenager who was shot and killed in South Carolina in 2015. Police initially said he drove the car toward the officers, which is why one opened fire. Surveillance footage released later showed that Hammond was driving past the officer, and not directly at him.
So I don’t have an easy answer! Sometimes the only available info we had was from police, but we’d do our best to find other sources when the police account seemed questionable. Basically, it meant a lot of extra reporting and a lot of discussions among our team members.
What tips do you have on visualising crime and police data? How and why do you decide whether or not to show people’s name, photo, or personal information?
Ciara McCarthy: With The Counted, we had built this big database, and wanted people to be able to use it and explore it and learn from it. That’s a main reason why the database included photos, whenever possible: We really wanted to put a face on each person who had died, so we weren’t only focusing on the overall number of people who died.
As for personal information, we would include what was relevant; so, for example, if a person’s medical or mental health history might have impacted their interaction with authorities, we’d be sure to note that.
Tom Meagher’s tips:
- You want to give your data context.
- Avoid one-year comparisons.
- Set it against historical data as much as possible.
- As you visualize it, try to remember that every record in that database represents a person — someone who was injured or victimized or killed, or someone who has committed crimes.
- Try to use your visualization to emphasize their humanity as much as you can. Dots or jagged lines sometimes obscure the people they represent
Is there one thing you wish someone had told you before you took on The Counted and the Next To Die projects?
Tom Meagher: Building your own databases for open-ended projects can be very fulfilling as a journalist. You’re filling a gap in the public’s understanding of an issue. It’s very worthy. But also keep in mind that you’re committing your news organization to an endless project.
Does the story merit your time and your colleagues’ time for the indefinite future? I’d argue that The Counted and the Next to Die do. But you don’t want to make the decision without understanding the costs and all the other reporting you won’t be able to do for the next few years because you’ll have to be updating your database.
Also, these can be very emotionally taxing subjects to report on. You’re spending your entire professional life (and much of your personal life) immersed in stories of violence, and trauma, and misery. Be sure to take care of yourself and give yourself emotional outlets.
What do you think could be done to improve things? Do we just need more comprehensive data from authorities compiled in a standardised way?
Tom Meagher: The division of powers between local and state and federal governments in the US makes it complicated. There’s realistically not going to ever be a single source for reliable data. What would be a vast improvement would be if more politicians and policymakers embraced the ideas of transparency and accountability, that better, smarter data will help them and the public understand our justice systems, and to make better decisions.
As journalists, we’d certainly benefit from that change in mindset, which is still too rare here.
Ciara McCarthy: It would be lovely to get more comprehensive data, but perhaps that’s just wishful thinking. I think getting data from a variety of sources and different types of data will help — comparing a database of media reports vs. official data, for example. That’s what my team is doing with our project, anyway.
More comprehensive data from authorities would be amazing, of course, but when that’s not an option I think building your project is a great public service for newsrooms to undertake. One of my favourite things about The Counted was that, on the surface, it’s mission and premise was pretty simple: The US government should know how many people are killed by police each year. We don’t, so let’s change that.
There’s obviously a ton of different reporting that can (and should!) be done on issues related to police violence, but one thing I really liked about our project was that, at the heart of it, we were saying that we can’t have this public policy discussion without reliable data. I think having this specific, and sometimes narrow, aim for big journalism projects can be really clarifying, and help you achieve impact.
How does it compare in other parts of the world?
Aun Qi Koh of Malaysiakini (Malaysia): I feel like it’s the opposite problem in Malaysia as the official data comes from just one source, the Interior Ministry/Royal Malaysian Police, but it’s not very detailed, and unfortunately we don’t have many other sources of data because there aren’t many checks and balances on the police.
Shree D N of Citizen Matters (India): India has the problem of under-reporting crime data. The National Crime Records Bureau is the official data source, but underreporting usually happens. This article has some insights on the issue. The methodology used to record offences leads to under-reporting of rape, abduction and stalking.
During our November Slack discussion she shared with us great examples from Kenya, Afghanistan and Turkey:
“I think The Counted inspired so many other media outlets because they realized they could build their own databases using similar data collection techniques but getting away from official sources. The Kenya Nation Newsplex team used mostly media reports to compile its Deadly Force Database.
Pajhwok Afghan News maintains a database of terrorist attacks that is much more detailed than anything the government or international bodies maintain. It’s not too much work because they cover all terrorist attacks anyway so they just have to enter them into the database. And then they can generate monthly stories on trends in terrorism in Kabul and across Afghanistan without too much effort.
This paper on collaboration between civic tech and data journalists I think is also relevant. In Turkey, Dag Media works with a domestic violence NGO to track violence against women. The NGO builds the database and the journalists do the stories.”
To see the full discussion, check out previous ones and take part in future ones, join the Data Journalism Awards community on Slack!
Over the past six years, the Global Editors Network has organised the Data Journalism Awards competition to celebrate and credit outstanding work in the field of data-driven journalism worldwide. To see the full list of winners, read about the categories, join the competition yourself, go to our website.