Using Github flat to automate making geoportal boundaries as topojson

Just a really quick note about a little resource I created to help with maps in the data visualisation team. The boundaries for loads of different areas are held on the ONS geoportal. To turn them into files that we can use for our map templates, we had to do a few steps to rename fields and drop unnecessary fields to reduce the file size and turn it into the topojson file format.

I have a bash script that I used to run on the zip files that I used to download and that was really quick but I saw Github flat which can automate tasks and thought this would be great to regularly grab boundaries from the geoportal and convert them ready for our map templates.

The end results is the repository geoportal boundaries as topojson which is an always up to date place to look for boundaries files for our map templates.

Here’s an explanation of what it’s doing.

Step 1: Setup github flat

A flat.yml file will instruct Github to run a workflow that will grab a .json file from the geoportal daily and save it the file to the repo. This file lists all the boundaries available on the geoportal.

Step 2: Set up python for postprocessing

In the workflow, it says to run postprocess typescript after the file has been downloaded. However I’m not as familiar with how to process stuff in typescript so I’m following an example where the typescript file loads up a python environment to do some analysis.

Step 3: Make the topojsons

The python script loops through all the areas available and ignores anything that’s not a boundary file, for example a lookup between different areas. It then looks strips out any unnecessary fields, renaming relevant fields and converts it to topojson for our map templates. And that’s pretty much it, set it up once and let it run forever hopefully.

Collaborate idea cards

Collaborate cards gif

I’ve previously written about the digital content team but a lot has changed since them. The data visualisation team is now 17 people strong and we have 9 data journalists, about to be 10. We’ve split up our work into teams based around four functions rather than discipline with the intension of allowing people to focus on particular aspects of work. The four functions are:

  1. Publish
  2. Improve
  3. Collaborate
  4. Innovate

I am co-leading the collaborate function and this takes over the old digital content team. Our purpose is to produce content specifically focussed on the citizen audience.

One task we often do is think about ideas for analysis. Often this can start from an interesting annecdote and a conversation follows about what the ONS can say about it with statistics, or it can be a topic which we feel is newsworthy, relevant and timely. Sometimes these topic can come from the organisation itself, as it’s trying to bring insight about big issues that are affecting everyone and it’s a case of breaking it down into smaller questions that we can go off and research.

There have been other great resources out there for assessing ideas. NZZ wrote about their scoring system and The Pudding wrote about their pitching process. But this wasn’t quite what we were looking for help with. It was more about the creative process of generating questions that would be lines of enquiry about a topic.

I thought about horizon scanning and all the futures work (driver mapping, scenarios, back-casting etc) but that’s more about thinking about a future we want the planning the policy interventions to get us there. Other people have horizon scanning already and written it up into trends and drivers (climate change, automation, ageing population, greater interconnectivity, shifting global economies) so it doesn’t really make sense to do it all again. Also the ONS doesn’t really deal with policy so although we could have a role in tracking the impact of interventions or how trends are showing up in our data it wasn’t quite what we were looking for either.

After a long think I decided to try using cards to help with ideation to apply some constraints to encourage creativity. Some examples in this area include Brian Eno’s oblique strategies and from the futures world there is The Thing from the Future, or Forks in the Road where you are given a scenario and you have to imagine the impact on a thing or the world.

On the journalism side, I found journalism cards and in the explainer post about project, the creator looks at core aspects to journalism. The tool allows you to select which aspect you want to incorporate and then shuffle through them. Some aspects aren’t relevant to us, (we don’t have to sell newspapers), but I feel this is the closest thing to what we needed so I thought I would adapt it.

I added a few more audiences to be more ONS specific (for example young adults or disabled), ONS cross cutting lenses (inequality by location, inequality between people e.g. gender or age, pandemic response, Brexit, access e.g. travel time, deprivation), ONS taxonomy/datasets (economy, business, trade, wellbeing, labour market, personal finances, housing, population, health, education, death), geographical level (national, subnational, neighbourhood) or timeline (in a few months, in a year, in 5 years, in 10 years, in 20 years, in 50 years).

I’ve put the tool up on GitHub, so you can try it out

I think we could use these paths through topics for example if we take social care, we could spin the deck and come up with questions like, who is a carer? where is the future workforce for carers going to come from? when will we reach peak caring need? how much of our caring workforce is from abroad? what’s the average pay of a carer, is this too little, how does this compare to other countries? How many people are full time and also have caring responsibility, how many are part time, did they go to part time to help with care?

Using nested data to create small multiples

In my previous post I talked about strategies to dealing with lots of data. Small multiples is a particularly powerful approach as it does give you a lot of flexibility as you can make a chart for each category you have. It’s also useful in situations like comparisons over time or comparisons against a national trend.

This post is a technical explanation about how to create small multiples using d3js. This is the approach that I use and it isn’t the only one. There are also different ways to create small multiples using different libraries (e.g. with Plot).

One big SVG or multiple SVGs?

In previous small multiple charts we made, we used to use one SVG and split it up into different parts. This made it trickier to set the external margin for the whole SVG and the internal margins for the individual charts. Now I prefer making multiple SVGs and getting them to line up with each other with CSS. I prefer it this way because it makes responsiveness slightly easier as the SVGs just flow naturally. This does have disadvantages for example annotations are harder because they have be able to fit on one small chart.

Nesting data

To be able to take advantages of d3js looping feature we need to rearrange our data to fit the format we want, specifically we need to create some hierarchy in the data.

Let’s work through the example of figure 3 in the Deaths registered weekly bulletin which looks at the number of excess deaths by setting.

Number of excess deaths by place of occurrence, England and Wales, registered between 7 March 2020 and 27 August 2021

The data behind this chart would normally look something like this

Date Home Hospital Care home Other
Week 11 145 -303 -60 30
Week 12 286 -270 6 53
Week 13 427 341 264 -22

Although this is table is good to read as humans and is a small file size, it’s not quite the right format for creating nested data. We need something a bit more like this

Date Setting Value
Week 11 Home 145
Week 11 Hospital -303
Week 11 Care home -60
Week 11 Other 30

This format is known as long format or tidy format, where each data point has a single row and all the attributes about that data is in a column. This format also makes it easier when there are more attributes, for example if we had this excess deaths split by sex, we would have to either add extra columns (Home Male, Home Female) or duplicate tables with either merged cells or different tabs to differentiate between the two. In the tidy data format, we just add another column for sex.

Once we have data in this tidy or long format, we can nest it with d3.nest. With our example the code would look like

nested = d3.nest()
.key(function(d){return d.Setting})
.entries(graphic_data)

where graphic_data would be the data we’ve read in from the .csv.

Once we look at nested we find it’s been transformed into an array of 4 objects, one for each setting. Inside each object, there are the properties key and values. Inside values is a subset of the rows of the table that relate to the setting.

Gif showing nested data

Appending elements from nested data

Now we can use d3js to create our charts. We can use the standard .data().enter().append() pattern to first create our SVGs. Remember .data takes an array, .enter() is a selection for each new thing it needs to create that doesn’t already exist. Here we have 4 elements in the nested array, so it will create 4 SVGs.

svg = d3.select('#graphic')
.selectAll('svg.chart')
.data(nested)
.enter()
.append('svg')

The size of the SVGs is determined by the width of the page and how many charts you want on each row. I’ve omitted all the stuff to do with margins for brevity but what’s important to know is that all the charts have the same margin. The margin on the left of each chart is big enough for the ticks but also acts as a bit of breathing space between the charts.

When we want to create the bars for the chart, we can reference the data attached to each SVG. We know the data for the bars is inside the values property of each element of the nested array, so we bind that data to our rects.

chart = svg.selectAll('rect.bars')
.data(function(d){return d.values})//here d is the elements of the nested array so is attaching the array from the property values
.enter()
.append('rect')

You can now do data driven styling for your rects as each rectangle has the data bound to it.

When to call the axis generator

Here’s is the chart on mobile on the left and on desktop on the right.

Comparison of the chart on mobile compared to desktop

On mobile there is one chart per row and on desktop there are two charts per row. With exactly the same axis, you can easily compare the magnitude of different numbers but since the axis are exactly the same we don’t need to duplicate having axes on every chart, we just want the chart on the left side of each row to have the axis ticks. On mobile there’s only one chart in each row so you do need to have ticks on every chart.

To generate ticks only on the left axis, we can use d3’s each. This allows us to do something for each element individually. Here we are going through all the SVGs, using each to apply a function. The function looks at the index of the loop over all the selected elements and if the index divided by the number of charts per row is zero then we know it’s the left chart and we can call the axis generator with ticks, otherwise call the axis generator without ticks.

svg.append('g')
.attr('class', 'y axis')
.each(function(d, i) {
  if (i % config.chart_every[size] == 0) {
    d3.select(this)
      .call(yAxis.tickFormat(d3.format(",.0f")))
  }else{
    d3.select(this)
      .call(yAxis.tickFormat(""))
  }
})

You can also do something similar to just add the x axis label on the chart on the right.

Adapting this approach

If you’ve got categorical y-axis with long labels, it’s going to make the left margin very big would look very clumsy if you tried to repeat it for other charts. To get round this, I separated the left axis into a separate SVG and then did the nested data approach as above to create the small multiples for the final chart.

Custom left axis small multiple

In the above chart, the data is nested on two levels, firstly by sex, and then by how the model adjusted for various factors. When binding the data you can create an svg for each sex, when a g for each model, when inside you’ll have the data for the each ethnicity.

conceptual model for nesting two levels

Hopefully this helps you get your head round using nested data to create small multiple charts. For more information about binding data and creating things based on data I highly recommend Peter Cook’s guide to data joins and the enter exit update pattern.

Alternatives to plotting lots of data on a page

Imagine you’ve got some data with a lot of series and data points over time. You’ve tried a line chart but there is just too much data. You want it to look good but what are your options for visualising it?

Option 1: Just plot the series that matter

It’s more work but just reducing the number of series you plot leave you with just the most important data and a cleaner chart. This is easier to understand what’s going on at a glance. The hard part comes in choosing which series to plot.

From this

to this

Option 2: Use a heatmap

Although this will depend on your dataset and how many timeseries you have. One option to plot data with a time element is to use a heatmap with time running left to right as this is convention (in some societies). This is useful for showing general patterns as you won’t be able to show exact values with a colour scale.

Here are some examples

Option 3: Small multiples

Depending on whether what’s interesting is happening to individual series rather than between series another option would be to plot each series on a mini chart and have small but multiple version of the chart, one for each series. It would be hard to rely on colour alone to distinguish each series if all the series were on the same chart. What you do lose out on is resolution so this solution is better when you want to look at overall patterns rather than the detail. You’ll also want to keep the axis as similar as possible as you’ll be naturally comparing across charts so if they are different scales it makes it easier to misread the chart.

Option 4: Multiline

You could plot all the series on one chart, often in a neutral colour and use some way to highlight with a stronger colour a particular series of interest, for example through a dropdown

or hovering with the mouse

This puts the burden on the user to interact with the page to see what they are interested in or what is relevant to them. It also makes comparisons between series more difficult as you may have to select and remember.

All these solutions have positives and negatives to them and it comes down to what you are trying to say with your chart.

Tidydata Maker

The small multiple line chart template I developed last year uses tidy data but we are used to seeing data in a wide format as it makes it quicker to grasp what’s going on.

We have been copying and pasting data around to rearrange it to fit the format needed for the template and a colleague asked whether I could make something to do this quickly.

And here it is, the tidy data maker.

tidydatamaker

I wanted to learn svelte as part of building this tool and use their REPL tool. Here is the svelte code.

To publish the tool. I downloaded the zip file, unzipped it in a new repo and turned on Github pages. The only thing to note is that svelte put links to the root rather than local so once you’ve changed that so it looks like this it works.

<link rel='stylesheet' href='./global.css'>
<link rel='stylesheet' href='./build/bundle.css'>
<script defer src='./build/bundle.js'></script>

Since publishing this tool, many people have pointed out that this functionality exists in Excel as unpivot which is something I didn’t know.