15 Jan 2021
The small multiple line chart template I developed last year uses tidy data but we are used to seeing data in a wide format as it makes it quicker to grasp what’s going on.
We have been copying and pasting data around to rearrange it to fit the format needed for the template and a colleague asked whether I could make something to do this quickly.
And here it is, the tidy data maker.
I wanted to learn svelte as part of building this tool and use their REPL tool. Here is the svelte code.
To publish the tool. I downloaded the zip file, unzipped it in a new repo and turned on Github pages. The only thing to note is that svelte put links to the root rather than local so once you’ve changed that so it looks like this it works.
<link rel='stylesheet' href='./global.css'>
<link rel='stylesheet' href='./build/bundle.css'>
<script defer src='./build/bundle.js'></script>
Since publishing this tool, many people have pointed out that this functionality exists in Excel as unpivot which is something I didn’t know.
10 May 2020
On Thursday 7 May, the ONS published analysis comparing deaths involving COVID-19 by ethnicity. There’s an excellent summary on twitter but the headline is that when taking into account age and other socio-demographic factors, such as deprivation, household composition, education, health and disability, there is higher risk for some ethnic groups of a COVID related death compared to those of white ethnicity. The full article goes into more detail including some of the caveats e.g. the strengths and weakness of using ethnicity data from the 2011 Census and not being able to use occupation.
How we presented odds ratios
What I’d like to do is talk a lot about the visuals we produced for the article.
First, what we are visualising is called an odds ratio. This definition comes from the ONS analysis.
An odds ratio is a measure of the relative risk of an outcome in one population compared with a different population, where odds ratios greater than one indicate the outcome is more likely while less than one is less likely.
It’s quite a hard concept to understand but we try out best to guide people with a number of features in the graphic.
1) We make it really clear that the odds ratio refers to a comparison group. This is in bold and with a thick line
2) We tell you what it means to be one side of the line rather than leave it to the numbers and knowing that a number higher than one is more likely.
3) We’ve plotted the odds ratio as a dot because we want it’s position relative to the comparison group to be noted. We’ve also plotted the confidence intervals as lines sticking out of the dot. These are again a statistical concept and may be confusing for people but they are useful for interpreting the chart. If the confidence intervals overlap with the line for the comparison group, it means we aren’t sure that the increase or decrease in risk is really happening in the current data. It may be there but we can’t tell at the moment. To be honest, we didn’t really explain this in the notes under the chart so there’s room for improvement.
4) We’ve changed the scales to factors of likelihood. We could have plotted the raw numbers but by using phrases it makes it more understandable.
What others did
This story was covered in other publications including the Guardian, BBC, FT(£) and Daily Mail. I’ve picked these out in particular because they’ve chosen to rechart the data in their own style. The Daily Mail is an interesting example as they redid the numbers in their own style and also included a screenshot of our charts.
The BBC looked at the different models when we included different factors we could see how the odds ratio changed. This showed how much a factor was in reducing the risk. These charts are OK but the only thing I would pick up on was that they don’t plot the bars relative to the comparison group. It’s not stated but the bars appear to start from 0. An odds ratio of 0 means that the event is impossible.
This also causes an issue when they chart the risks for women.
The Chinese ethnic group should actually be plotted as a bar between 0.8 and 1.
The Guardian do something similar, plotting bars from 0. They actually make a mistake when labelling the axis. They are plotting the odds ratio but label it as times more likely to die from COVID-19 compared to white. This would make the an odds ratio of 1 which is equal likelihood as the comparison group, one times more likely. You could use these axis but you’d have to take one off every value and you might struggle with odds ratios below 1.
Their second chart looks to display the different models on the same chart and allow a comparison between the two. We were thinking of doing something similar but it was felt that the confidence intervals were important to show here which would make it quite messy. Without the confidence interval you can’t tell if the different between the models is significant, which in 3 of the cases is not. This means that the difference between the models might be zero. It also doesn’t show the confidence intervals with relation to 1. The axis is also unclear as it isn’t said that the likelihood of 1 is equal to the comparison.
The FT also look to show the difference between the models. It’s not stated but this looks like for males. They have plotted bars from zero and use likelihood of dying without explaining that 1 is equal likelihood to comparison.
With all the bar charts, I feel plotting from zero is somewhat misleading as you might think since the bars have length at 1 this would represent an increase in likelihood but are in fact equal likelihood to the comparison group.
I don’t write to criticise the people behind the graphs as I know they are doing important jobs to get information out to a wider audience under time pressure. I also know that there is a lot of explanation that goes on around the chart either in the article or talked about with correspondents which helps explain the complicated concept of odds ratios.
I hope by explaining the thinking behind the designs of our charts and why we think they make them clearer it will help others making odds ratios charts in the future.
25 Sep 2019
Commuting: most people do it, most people hate it. We know men do more of the longer commutes, and our newest analysis published at the beginning of September showed that women are more likely to leave their job over a long commute.
Our article was picked up by the Telegraph, Reuters and the Evening Standard. Reporters from the BBC, analysts from the IFS and Ipsos MORI, and other statisticians commented on social media about our findings. Most notably, the then Minister for Women and Equalities, Amber Rudd said:
“Women across the country struggle to find a balance between being a parent and their job.
“These statistics show how women are likely sacrificing a larger paypacket, and career growth, because they are doing the bulk of childcare and unpaid work – like taking care of elderly relatives and their home.”
It felt like the success was down to this being our most collaborative project yet.
In January, I was trying to get an idea off the ground around who does the same commute as you. As a member of the Digital Content team, I’m responsible for communicating our data to a broad audience of inquiring citizens. Our users are interested in how they fit into our statistics, hence I thought we could show how their commute compared with others like them.
I started by emailing some people in our policy team, which led to another meeting with our Labour Market team. The meeting notes got circulated round to even more people.
Separately, Vahé in our Methodology division had been working on some analysis of gender differences in commute time, having found a gap in the literature regarding contributors to the gender pay gap. His analysis was aimed at policymakers and expert users, but the underlying data with people’s places of residence and work had links back to my idea.
Fortunately, Tom Williams from the Labour Market team, who had been involved in our initial meeting, was aware of both strands of work. We had worked with Tom before on the previous analysis on commuting
Takeaway 1: Putting the right people together takes multiple attempts
We explored a few different avenues of how else we could pull in additional information. I asked people who had worked on admin data whether we could get information about parents but this wasn’t feasible. But we had seen the Data Science Campus talk about another project using a trip planner to calculate travel time.
Takeaway 2: Explore possible avenues early but close down ones that don’t look fruitful
Calculating commuting time
The Data Science Campus project, access to services, looked at how people could travel to different services, for example GPs, and whether there were black spots. For this project, they had created a tool called propeR which runs analysis on batches of locations.
I thought we could use this tool to calculate the commute times instead of using commute distance. We needed a bit of help and the people at the Data Science Campus, especially Michael Hodge, got us up and running with propeR and supplied the necessary public transport timetables. We adapted their open source code to run better for our task, which was throwing a million home-work postcodes pairs at it.
Takeaway 3: Keep your networks wide
Although the trip planner was great, it was very fussy about the starting point – it had to be on a road – but we were using the lat-lon of the centroid of a postcode which isn’t necessarily on a road. I was a bit stuck on how to fix this, so I reached out to the government data science community over Slack and asked for some advice.
Someone from Hackney Council suggested using Open Source Routing Machine which worked.
Takeaway 4: Seek help when needed
It took until April to get to a point where I could start crunching the big datasets, and to speed things up I took over other people’s computers. During this time, Vahé also presented the gender difference of commuting distance at the ONS Economic Forum event to get some feedback.
Bringing it together
Once the dataset of travel times had been pulled together, it was fairly quick to rerun the analysis using time instead of distance. This helped with making our story more credible and more relevant to people.
We had regular catch-ups every few weeks until the publication. The core digital content team was made up of Rachel, our designer, Phil the data journalist, and me from the data visualisation side. We were joined by Roger Smith and Tom Williams from Labour Market and Vahé from Methodology.
Takeaway 5: Bring together the right mix of skills
Takeaway 6: Regular communication is key; have face-to-face meetings when you need them
There was lots of crossover of people’s expertise. I remember Roger commenting on some draft graphs which led to them being lots better. There’s a risk for egos to become involved, but it was really apparent that people were focused on creating something that really worked for the user.
Takeaway 7: Put respect, equality and a common goal at the core of the team
We had hoped to launch the project with a media partner but unfortunately that didn’t work out.
Takeaway 8: Sometimes things don’t go the way you plan and you have to make the most of it