So a critical discussion of Sankey plots floated across my feed on Bluesky recently, and one reply included an ugly example and the comment “Anybody who thought that this illustration enhanced clarity lives in an alternative reality”. The actual chart I’ve included at the bottom of this post. I agree it’s pretty unhelpful, but I thought I saw potential, and said so. This blog is me seeing if in fact something can be done with it.
The post that started the discussion was Emily Moin saying “Sankey diagrams are just as bad as pie charts, and also worse because conventional wisdom has not rejected them yet so I still have to look at them”. As it happens, I think that even pie charts have their (very limited) place so long as they are carefuly chosen (not when too many categories, for example) and properly polished for the audience. So I guess I am being consistent in thinking that Sankey charts can also be useful.
Here’s what I think is wrong with the original graphic (which is reproduced later in this post), which I think is tracking the progression of patients experiencing symptoms of varying severity over a period of six weeks:
You’ll need to go down a bit in the post to see that original graphic; you’ll see that the combined effect of these problems is indeed one of complexity and clutter. I think the last point – the severity nodes swapping places vertically – is the most important.
I had a go at improving this and came up with a couple of alternatives, using David Sjoberg’s ggsankey
R package. Here’s a Sankey plot version:
… and here’s an alluvial plot version. Alluvial plots are similar to Sankey plots but have no spaces between the nodes, which means in this case you can read the nodes vertically at each week similarly to a stacked bar chart:
Here’s the original graphic for comparison:
I’m pretty confident that either the Sankey or alluvial plot are definite improvements and give a better sense of the average severity in each week, and the overall trend (which is more blue, low severity cases). While still giving a sense of people moving in multiple directions (sometimes upwards) from each severity-week combination. So I think I’ve addressed the main points here:
I’ve left all the code at the end because most of it was about me trying to put together by hand a dataset that resembles that in the original flow chart. Then I had to calibrate it in R to fix the problems in my hand-made version. This included problems like the number of people changing from week to week, and the number of people entering a particular severity state in one week not matching the number exiting it. Once that stuff is dealt with, drawing the actual plot is a relatively simple ggplot2
and ggsankey
chunk of code.
library(tidyverse)
library(janitor)
library(glue)
library(RColorBrewer)
remotes::install_github("davidsjoberg/ggsankey")
library(ggsankey) # one fairly straightforward approach to sankey charts / flow diagrams
# read in some data. This was very crudely hand-entered with some
# rough visual judgements based from a chart that I don't know the
# origin of I saw on the internet. So treat as made-up example data:
d <- read_csv("https://raw.githubusercontent.com/ellisp/blog-source/refs/heads/master/data/complicated-sankey-data.csv",
col_types = "ccccd",
# we want the NAs in the original to be characters, not actual NA:
na = "missing") |>
clean_names()
#------------tidying up data---------------
# we have some adjustments to deal with because of having made up data
# An extra bunch of rows of data that are needed by the Sankey function to
# to make the week 6 nodes show up:
extras <- d |>
filter(week_to == "6") |>
mutate(
week_from = "6",
week_to = NA,
severity_from = severity_to)
#' Convenience relabelling function for turning week numbers into a factor:
weekf <- function(x){
x <- case_when(
x == 0 ~ "Week zero",
x == 1 ~ "Week one",
x == 3 ~ "Week three",
x == 6 ~ "Week six"
)
x <- factor(x, levels = c("Week zero","Week one","Week three", "Week six"))
}
# going to start by treating all flow widths as proportions
total_people <- 1
# add in the extra data rows to show the final week of nodes,
# and relabel the weeks:
d2 <- d |>
rbind(extras) |>
mutate(week_from = weekf(week_from),
week_to = weekf(week_to))
# there should be the same total number of people each week,
# and the same number of people leaving each "node" (a severity-week
# combination) as arrived at it on the flow from the last week.
# we have a little iterative process to clean this up. If we had
# real data, none of this would be necessary; this is basically
# because I made up data with some rough visual judgements:
for(i in 1:5){
# scale the data so the population stays the same week by week
# (adds up to total_people, which is 1, so are proportions)
d2 <- d2 |>
group_by(week_from) |>
mutate(value = value / sum(value) * total_people) |>
group_by(week_to) |>
mutate(value = value / sum(value) * total_people) |>
ungroup()
# how many people arrived at each node (week-severity combination)?
tot_arrived <- d2 |>
group_by(week_from, severity_from) |>
mutate(arrived_sev_from = sum(value)) |>
group_by(week_to, severity_to) |>
mutate(arrived_sev_to = sum(value)) |>
ungroup() |>
distinct(week_to, severity_to, arrived_sev_to)
# scale the data leaving the node to match what came in:
d2 <- d2 |>
left_join(tot_arrived, by = c("week_from" = "week_to",
"severity_from" = "severity_to")) |>
group_by(week_from, severity_from) |>
mutate(value = if_else(is.na(arrived_sev_to), value,
value / sum(value) * unique(arrived_sev_to)) ) |>
select(-arrived_sev_to) |>
ungroup()
}
# manual check - these should all be basically the same numbers
filter(tot_arrived, week_to == "Week one" & severity_to == 4)
filter(d2, week_to == "Week one" & severity_to == 4) |> summarise(sum(value))
filter(d2, week_from == "Week one" & severity_from == 4) |> summarise(sum(value))
#--------------draw plot-------------
# palette that is colourblined-ok and shows sequence. This
# actually wasn't too bad in the original, but it got lost
# in the vertical shuffling of all the severity nodes:
pal <- c("grey", brewer.pal(7, "RdYlBu")[7:1])
names(pal) <- c("NA", 1:7)
# Draw the actual chart. First, the base of chart, common to both:
p0 <- d2 |>
mutate(value = round(value * 1000)) |>
uncount(weights = value) |>
mutate(severity_from = factor(severity_from, levels = c("NA", 1:7)),
severity_to = factor(severity_to, levels = c("NA", 1:7))) |>
ggplot(aes(x = week_from,
next_x = week_to,
node = severity_from,
next_node = severity_to,
fill = severity_from,
label = severity_from)) +
# default has a lot of white space between y axis and the data
# so reduce the expansion of x axis to reduce that
scale_x_discrete(expand = c(0.05, 0)) +
scale_fill_manual(values = pal) +
labs(subtitle = "Chart is still cluttered, but decreasing severity over time is apparent.
To achieve this, vertical sequencing is mapped to severity, and repetitive labels have been moved into the axis guides.",
x = "",
caption = "Data has been hand-synthesised to be close to an original plot of unknown provenance.")
# Sankey plot:
p1 <- p0 +
geom_sankey(alpha = 0.8) +
geom_sankey_label() +
theme_sankey(base_family = "Roboto") +
theme(legend.position = "none",
plot.title = element_text(family = "Sarala")) +
labs(title = "Severity of an unknown disease shown in a Sankey chart")
# Alluvial plot:
p2 <- p0 +
geom_alluvial(alpha = 0.8) +
geom_alluvial_label() +
theme_alluvial(base_family = "Roboto") +
theme(legend.position = "none",
plot.title = element_text(family = "Sarala")) +
labs(title = "Severity of an unknown disease shown in an alluvial chart",
y = "Number of people")
print(p1) # Sankey plot
print(p2) # alluvial plot