Data Journalism Resources

A friend asked me for some resources in helping a team developing some data journalism capabilities. Here are the things I’ve pulled together. It’s not comprehensive by all means but it’s a start.

Workflow

General Resources

  • DataJournalism.com is a massive resource from EJC and Google News, especially the handbook. Lots of advice from experience and articles

Other people’s list of resources

Tools

Commuting - a case study on collaborating

Commuting: most people do it, most people hate it. We know men do more of the longer commutes, and our newest analysis published at the beginning of September showed that women are more likely to leave their job over a long commute.

Our article was picked up by the Telegraph, Reuters and the Evening Standard. Reporters from the BBC, analysts from the IFS and Ipsos MORI, and other statisticians commented on social media about our findings. Most notably, the then Minister for Women and Equalities, Amber Rudd said:

“Women across the country struggle to find a balance between being a parent and their job.

“These statistics show how women are likely sacrificing a larger paypacket, and career growth, because they are doing the bulk of childcare and unpaid work – like taking care of elderly relatives and their home.”

It felt like the success was down to this being our most collaborative project yet.

The idea

In January, I was trying to get an idea off the ground around who does the same commute as you. As a member of the Digital Content team, I’m responsible for communicating our data to a broad audience of inquiring citizens. Our users are interested in how they fit into our statistics, hence I thought we could show how their commute compared with others like them.

I started by emailing some people in our policy team, which led to another meeting with our Labour Market team. The meeting notes got circulated round to even more people.

Separately, Vahé in our Methodology division had been working on some analysis of gender differences in commute time, having found a gap in the literature regarding contributors to the gender pay gap. His analysis was aimed at policymakers and expert users, but the underlying data with people’s places of residence and work had links back to my idea.

Fortunately, Tom Williams from the Labour Market team, who had been involved in our initial meeting, was aware of both strands of work. We had worked with Tom before on the previous analysis on commuting

Takeaway 1: Putting the right people together takes multiple attempts

We explored a few different avenues of how else we could pull in additional information. I asked people who had worked on admin data whether we could get information about parents but this wasn’t feasible. But we had seen the Data Science Campus talk about another project using a trip planner to calculate travel time.

Takeaway 2: Explore possible avenues early but close down ones that don’t look fruitful

Calculating commuting time

The Data Science Campus project, access to services, looked at how people could travel to different services, for example GPs, and whether there were black spots. For this project, they had created a tool called propeR which runs analysis on batches of locations.

I thought we could use this tool to calculate the commute times instead of using commute distance. We needed a bit of help and the people at the Data Science Campus, especially Michael Hodge, got us up and running with propeR and supplied the necessary public transport timetables. We adapted their open source code to run better for our task, which was throwing a million home-work postcodes pairs at it.

Takeaway 3: Keep your networks wide

Although the trip planner was great, it was very fussy about the starting point – it had to be on a road – but we were using the lat-lon of the centroid of a postcode which isn’t necessarily on a road. I was a bit stuck on how to fix this, so I reached out to the government data science community over Slack and asked for some advice.

Someone from Hackney Council suggested using Open Source Routing Machine which worked.

Takeaway 4: Seek help when needed

It took until April to get to a point where I could start crunching the big datasets, and to speed things up I took over other people’s computers. During this time, Vahé also presented the gender difference of commuting distance at the ONS Economic Forum event to get some feedback.

Bringing it together

Once the dataset of travel times had been pulled together, it was fairly quick to rerun the analysis using time instead of distance. This helped with making our story more credible and more relevant to people.

We had regular catch-ups every few weeks until the publication. The core digital content team was made up of Rachel, our designer, Phil the data journalist, and me from the data visualisation side. We were joined by Roger Smith and Tom Williams from Labour Market and Vahé from Methodology.

Takeaway 5: Bring together the right mix of skills

Takeaway 6: Regular communication is key; have face-to-face meetings when you need them

There was lots of crossover of people’s expertise. I remember Roger commenting on some draft graphs which led to them being lots better. There’s a risk for egos to become involved, but it was really apparent that people were focused on creating something that really worked for the user.

Takeaway 7: Put respect, equality and a common goal at the core of the team

We had hoped to launch the project with a media partner but unfortunately that didn’t work out.

Takeaway 8: Sometimes things don’t go the way you plan and you have to make the most of it

Transitioning on D3 annotations

After reading Sarah Reed’s article on why D3 is so hard to learn from bl.ocks I thought I better do a blog post explaining a recent bl.ocks I made to solve a problem I encountered in my recent project about the commuting gap.

I wanted to create a dot histogram of the distribution of travel times and add a marker with an annotation for you. This is what it looked like in the page.

Animation of annotation transitioning position and label

For the annotation, I use the d3-annotations library. I’ve written a tutorial for d3-annotations so I’d recommend starting there if you’re unfamiliar with it.

Here’s the example of transitions using d3-annotations I posted as a bl.ock.

Let’s break down the code bit by bit. First we create an svg on the page.

 var svg = d3.select("body").append("svg")
      .attr("width", 960)
      .attr("height", 500)

Next we set up a linear scale for the x position

 var x = d3.scaleLinear()
    .rangeRound([0, 960]);

    x.domain([0, 400])

We create the annotations array for d3-annotations

annotations = [{
        note: {
          label: "00:00",
          title: "You"
        },
        x:x(0),
        y:500/2,
        dy: 20,
        dx: 20,
        subject: {
          radius: 10,
          radiusPadding: 5
        },
        type:d3.annotationCalloutCircle
      }]

And then call d3.annotations to create the annotations in the svg.

      makeAnnotations = d3.annotation()
        .annotations(annotations)

      svg.append('g')
        .attr('class', 'annotation-group')
        .call(makeAnnotations)

Next we use a d3 transition but with a custom tween function.

d3.select('.annotation-group')
      .transition()
      .duration(4000)
      .tween('updateAnno',function(d){
        xTrans = d3.interpolateNumber(0,200)
        timeTrans = d3.interpolateDate(new Date("January 01, 2019 00:00:00"),new Date(new Date("January 01, 2019 00:00:00").getTime()+200*60000))
        return function(t){
          annotations[0].x = x(xTrans(t));
          annotations[0].note.label = d3.timeFormat("%H:%M")(timeTrans(t))
          makeAnnotations.annotations(annotations)
          makeAnnotations.update()
        }
      })

We need to set up some interpolators. There are some that exist in d3 for commonly used formats (numbers, dates). See the d3 documentation for interpolators for more info. We are going to use the number interpolator for the x position, going from 0 to 200. Let’s call this xTrans.

We use the date interpolator for time, calling this timeTrans, and we’re going from 00:00 to 03:20. For this we have to start at some arbitrary date at midnight then add 200 minutes to that time.

new Date("January 01, 2019 00:00:00").getTime() gets the time in milliseconds since the Unix Epoch, then to calculate how many more milliseconds we need to get we multiply 200 minutes * 60 seconds * 1000 milliseconds. Finally we make this a javascript Date object with new Date().

For the annotation, we set the position using the x property in the annotations object. We can now set to change with the ticker t. Every time the transition updates, it updates t and updates the anything that relies on t. We also use the scale to transform it from the data range to the svg range, and we get annotations[0].x = x(xTrans(t));.

We can change the label of the annotations. We use d3.timeFormat to change from the javascript Date format into something more human readable, here we’re using hours and minutes, d3.timeFormat("%H:%M"). Again we use our interpolator and the ticker t. Putting this together we get annotations[0].note.label = d3.timeFormat("%H:%M")(timeTrans(t)).

We now need to update the annotations with the updated values with makeAnnotations.annotations(annotations) and update the annotations on the svg with makeAnnotations.update().

And that’s it. You can make it data-driven which is what I did in the interactive by using variables for the end points of the interpolators and duration of the transition.

Running a population model in the browser

Over a year ago, I started work on a project looking at our ageing population and specifically one measure called the Old Age Dependency Ratio (OADR). This measure compares the number of people who are above the retirement age to people of working age (16 to retirement age).

Although this measure has its limitations because people are delaying joining the workforce because of education and working past their retirement age, it is still useful when comparing internationally and still gives an indication of the financial implications of our population when considering pensions and health care for example.

We wanted to explain a bit more about what factors were involved in how this OADR ratio changed in the future and some variables have more of an effect than others.

We wanted an easier way for people to understand demography and modelling population in the future without having to understand all the technical details and calculations.

We also want to dispel the misconception people may have had about migration and the magnitude of the effect it would have on the population.

Finally, we wanted to use an interface that brought elements of gamification to engage people.

At the end of June, this project was finally published - How would you support our ageing population.

The Excel beginning

My starting point was an excel model that colleagues in the office had built that use the National Population Projection variants. This excel sheet calculated the population given some numbers about fertility, migration and mortality.

By using different variations you could see what your choices did to the population. This spreadsheet did give you a lot of combinations to play about with and there was a custom option where you could enter your own numbers but you had to enter something for every age and every year. And the results it gave was a table of the population at each age over each year. This isn’t really something that’s intuitive, easy to interpret or could run in straight in the browser.

Building it for the browser

Once I got my head round the calculations in the excel sheet (using a lot of Trace dependents). I had to then think about what people would change, and trying to get this down to one input for each variable. I decided to fix a target amount 25 years down the line for the three inputs (mortality, migration and fertility), with some linear interpolation along the way. I scaled the single year of age numbers by an amount to match the interpolated target for each year. I did this in Excel and showed this to the business area for them to approve my thinking.

Mortality was the difficult one. There are cohort effects which travelling through age groups. Also using something like life expectancy at birth, while familiar to people, is not sensitive to changes at the higher ages. And it’s these higher ages where we are interested in. In the end we settled on displaying the life expectancies associated with 5 mortality variants and the slider snaps to these values.

Once they were happy with the process, I recreated the calculations in the browser. In Excel, it’s hard to see the order of things as it just appears to run all at once, but in JavaScript it’s much more linear. Turns out demography is actually quite simple, just doing a lot of adding and subtracting for people migrating in or being born and dying. And looping over the years into the future.

The interface

We wanted a simple way to interact with the controls and I found d3-simple-slider, a library for making sliders. Initially we had 4 sliders going horizontally but this took up quite a lot of space so we went for 4 vertical ones which looked like a mixing desk.

Sketch of design

The sliders took a bit of extra styling and messing around with the library so it’s ended up quite customised but I’m pleased with the result. For mobile, we show one horizontal slider at a time and store the results in hidden inputs.

We used tooltips to add in extra information where something might have been misinterpreted.

Displaying the results

The population models reruns every time you change any of the sliders or inputs and updates the results you see on the page. Keeping the results in view was important so you could see your changes happening in front of your eyes.

Other examples we’d seen include an FT example where you had to decide how you’d spend the BBC licensing fees between the stations and channels.

We also wanted to give feedback about what people’s choices were in respect to today’s numbers and what it would mean for them. This is done in a couple of ways, there’s the text box that spells out what the numbers are but also lines on the sliders to indicate 2017 levels.

Technical achievements

It’s the first d3.v5 project we’ve done in the team. I decided to do this because v5 uses promises so you have more control when code block executes. This avoided doing calculations without the numbers from the previous step.

I’ve learnt a lot more about how to use bootstrap-grid properly. Bluebird.js to compatibility promises for IE. Tippy.js for tooltips.

What people did

We used Google Tag Manager to find out how people were using the tool and what they were inputting.

We looked at the OADR in 2042 people were getting from their chosen inputs. The biggest peak is from loading the interactive and people making small changes. The periodic jumps are from people changing the pension year. What’s interesting to note is the secondary peak around 300. This is because people are trying to match the 2016 level of OADR. We put feedback into the results box and it’s clear this element of gamification is helping.

We can also see that people like round numbers as well as the ends of the scales. Here’s what users selected for migration with peaks on the hundred thousands.

And here’s fertility. The peak around 1.75 is when the model loads. There’s big peak around the round numbers (1,2,3) and the ends of the scales.

Feedback

We are still collecting feedback about the tool but here two choice picks that show that we are meeting the aims of the project.

During the project, we found out there was another project looking at an alternative measure to the OADR. We decided to align the projects and this took a bit of extra time and effort and changed the way the article turned out as we referenced each other. We also felt it diminished the impact we could have had with the media.

Overall, I’m happy with how the project turned out. People are using the tool and taking the messages away that we wanted them to leave with. Now with a working population model that projects into the future we can use it for more projects. If you’ve got any good ideas, let me know.

What we learnt from making a StatsBot

At the end of March, the ONS published an article looking at the types of jobs at risk from automation. We thought it would be ironic to be able to query the results through a chatbot interface.

Other journalism organisations are using chatbots to explain complex topics. They’ve said that although not every one engages with the chat bots, when people do they have a deep engagement.

I have to say that Jure almost all the work on this project. Our first challenge was finding a chatbot library that didn’t involve any backend. In the end, we found chat-bubble that runs completely off javascript.

Design was also a challenge. With the BBC chat bots, it’s basically multiple choice buttons and it gives you different answers depending on which button you choose. The data we had was about the risk of automation to occupation and also who was doing that job for example gender, region, age.

We designed a few questions to try to figure out all these characteristics through elimination, e.g. what decade were you born in? with the answers being 60s, 70s, 80s, 90s, 00s.

What was more difficult was trying to work out occupation. You could ask people to try to navigate the Standard Occupation Codes (SOC) down the tree to find the job they wanted to see info about but we knew this was clumsy.

In the end, we decided to let users type in an occupation and let the ONS occupation coding tool match what the input to the SOC. This introduced a text box for people to use in the chat interface. This wasn’t ideal as on mobile the keyboard takes up a lot of space. But we felt this was better than navigating the SOC hierarchy.

Mobile screenshot

The article got widely picked up but the BBC pushed a lot of traffic to our site.

Our standard timeframe for considering metrics is a week. After a week we had almost 14 thousand unique visits, 11 thousand of which happened during the first day.

A third of people who visited the site wanting to know what the risk of automation was for an occupation. 91% of those entered an occupation.

The top 5 occupations entered were

  1. Accountant
  2. Teacher
  3. Software Engineer/Developer
  4. Solicitor
  5. Doctor

11% of people who visited choose to read about general facts. Information about location, age and gender were less popular still, with 4%, 3% and 2% respectively.

One thing we debated a lot was the speed of the chat bubbles. Should we make it quicker like the BBC’s or slow so that people can actually read it and not skim?