We all need to pivot data at some point, so these are just some notes
for my own benefit really, because gather
and spread
are no longer
in favour within tidyr
.
I tended to only ever need gather
, and nearly always relied on the
same key and value names, so it was an easy function for me to use.
I have discovered that pivot_longer
and pivot_wider
are much more
flexible, they just take a little bit more thinking about.
For example, my old approach has now changed along these lines
# old way with `gather` df %>% mutate(row = row_number()) %>% gather('column', 'source', -row, -N) # key = column, value = source, retain row and N # further transforms # new way with pivot_longer df %>% mutate(row = row_number()) %>% pivot_longer(!c(row , N), names_to = 'column', values_to = 'source') # further transforms
However, what I really want to do is show how to replicate much of the
tidyr
pivot functionality with data.table
.
Once again, this is not intended to be in-depth.
I have simply used the tidyr
help file code, and tried to replicate it
with data.table
.
I’d be interested in improvements to my data.table code.
Let’s pivot!
Note – in all examples, I’ll create a copy of the data set as a
data.table using setDT(copy(source_data))
Also, I intended to use code folding to show the datasets and results, but that’s gone horribly wrong, so you can run the code yourself.
You only need:
library(tidyr) library(data.table)
I’m using the base pipe for simplicity.
Using the built-in relig_income
dataset:
Code comparison
relig_income |> pivot_longer(!religion, # keep religion as a column names_to = "income", # desired name for new column values_to = "count") # what data goes into the new column? melt(DT, id.vars = "religion", variable.name = "income", value.name = "count", variable.factor = FALSE) # added to keep output consistent with tidyr
With data.table, you can often get away with only supplying either
measure.vars
or id.vars
, and nothing else, and it does a pretty
great job of guessing what to do.
Obviously it’s better to be specific, but worth bearing in mind.
Now, a slightly more complex case where columns have common prefix, and
missings are structural so should be dropped, using the billboard
dataset
Code comparison:
billboard |> pivot_longer( cols = starts_with("wk"), names_to = "week", names_prefix = "wk", values_to = "rank", values_drop_na = TRUE ) melt(DT, measure.vars = patterns("^wk"), variable.name = "week", value.name = "rank", na.rm = TRUE)
who |> pivot_longer( cols = new_sp_m014:newrel_f65, names_to = c("diagnosis", "gender", "age"), names_pattern = "new_?(.*)_(.)(.*)", # Whoa values_to = "count" ) DT[,melt(.SD, measure.vars = 5:60, value.name = "count") ][,variable := gsub("new_?","", variable) ][,c("diagnosis", "temp") := tstrsplit(variable,"_", fixed = TRUE) ][, gender := tstrsplit(temp, "[^mf]+", fixed = FALSE) ][, age := tstrsplit(temp, "[mf]+", fixed = FALSE, keep = 2) ][, !c("variable","temp") ][,c(1:4,6:8,5)][]
OK – we’re going to need to break this one down. There’s a lot of magic
going on with that names_pattern
call in pivot_longer
, where various
parts of the tidyverse are working in the background.
I’m not going to dissect pivot_longer
, because calling debugonce
show you that various methods are involved, which is harder to debug.
Instead, I will build up the data.table solution.
Firstly, note that this time round, I am using melt
inside of DT
,
rather than melt(DT)
DT[,melt(.SD, measure.vars = 5:60, value.name = "count") ][]
This returns a new variable
column, which needs splitting up into more
columns.
First, we need to get rid of that new_
, using gsub
to replace it
with nothing at all:
DT[,melt(.SD, measure.vars = 5:60, value.name = "count") ][,variable := gsub("new_?","", variable) ][]
Now I want the diagnosis part, so I create a diagnosis column, and a “temp” column for the remaining bits, which I’ll split in the next passes
DT[,melt(.SD, measure.vars = 5:60, value.name = "count") ][,variable := gsub("new_?","", variable) ][,c("diagnosis", "temp") := tstrsplit(variable,"_", fixed = TRUE) ][]
Now to split out the gender
column, from the new temp
column. This
is going to be either an “m” or an “f”.
DT[,melt(.SD, measure.vars = 5:60, value.name = "count") ][,variable := gsub("new_?","", variable) ][,c("diagnosis", "temp") := tstrsplit(variable,"_", fixed = TRUE) ][, gender := tstrsplit(temp, "[^mf]+", fixed = FALSE) ][]
Then retrieve the age
, again from the temp
column. This generates
two vectors, I only want the second one, hence the keep = 2
DT[,melt(.SD, measure.vars = 5:60, value.name = "count") ][,variable := gsub("new_?","", variable) ][,c("diagnosis", "temp") := tstrsplit(variable,"_", fixed = TRUE) ][, gender := tstrsplit(temp, "[^mf]+", fixed = FALSE) ][, age := tstrsplit(temp, "[mf]+", fixed = FALSE, keep = 2) ][]
The last two lines discard the variable
and temp
columns, and sort
the columns into the same order as that returned by pivot_longer
DT[,melt(.SD, measure.vars = 5:60, value.name = "count") ][,variable := gsub("new_?","", variable) ][,c("diagnosis", "temp") := tstrsplit(variable,"_", fixed = TRUE) ][, gender := tstrsplit(temp, "[^mf]+", fixed = FALSE) ][, age := tstrsplit(temp, "[mf]+", fixed = FALSE, keep = 2) ][, !c("variable","temp") ][,c(1:4,6:8,5)][]
Giving us these results:
DT[,melt(.SD, measure.vars = 5:60, value.name = "count") ][,variable := gsub("new_?","", variable) ][,c("diagnosis", "temp") := tstrsplit(variable,"_", fixed = TRUE) ][, gender := tstrsplit(temp, "[^mf]+", fixed = FALSE) ][, age := tstrsplit(temp, "[mf]+", fixed = FALSE, keep = 2) ][, !c("variable","temp") ][,c(1:4,6:8,5)][] ## country iso2 iso3 year diagnosis gender age count ## 1: Afghanistan AF AFG 1980 sp m 014 NA ## 2: Afghanistan AF AFG 1981 sp m 014 NA ## 3: Afghanistan AF AFG 1982 sp m 014 NA ## 4: Afghanistan AF AFG 1983 sp m 014 NA ## 5: Afghanistan AF AFG 1984 sp m 014 NA ## --- ## 405436: Zimbabwe ZW ZWE 2009 rel f 65 NA ## 405437: Zimbabwe ZW ZWE 2010 rel f 65 NA ## 405438: Zimbabwe ZW ZWE 2011 rel f 65 NA ## 405439: Zimbabwe ZW ZWE 2012 rel f 65 NA ## 405440: Zimbabwe ZW ZWE 2013 rel f 65 725
Using the anscombe
dataset
Code comparison:
anscombe |> pivot_longer( everything(), cols_vary = "slowest", names_to = c(".value", "set"), names_pattern = "(.)(.)" ) DT[,melt(.SD, variable.name = "set", value.name = c("x","y"), variable.factor = FALSE, measure.vars = patterns("^x","^y"))]
I have to be honest, dcast
is not a term that makes me think
long to wide
, but you can’t have everything.
We’re using this fish_encounters
dataset for this bit. It’s about time
fish made an appearance on this blog. Probably.
fish_encounters ## # A tibble: 114 × 3 ## fish station seen ## <fct> <fct> <int> ## 1 4842 Release 1 ## 2 4842 I80_1 1 ## 3 4842 Lisbon 1 ## 4 4842 Rstr 1 ## 5 4842 Base_TD 1 ## 6 4842 BCE 1 ## 7 4842 BCW 1 ## 8 4842 BCE2 1 ## 9 4842 BCW2 1 ## 10 4842 MAE 1 ## # … with 104 more rows
Code comparison:
fish_encounters |> pivot_wider(names_from = station, values_from = seen) dcast(DT, fish ~ station, value.var = "seen")
Technically, we can avoid stating that value.var = "seen"
, but we do
then get a large informational message in the console, which is always
disconcerting. (If data.table tells you something, you’re well advised
to act on it, because the help is scarily accurate)
Let’s get rid of those NA’s.
fish_encounters |> pivot_wider(names_from = station, values_from = seen, values_fill = 0) dcast(DT, fish ~ station, value.var = "seen", fill = 0)
Rather than have the DT
inside dcast
, we can use .SD
and have
dcast
inside DT
, which is helpful for further chaining. The same
applies to melt
, as you’ll have seen above.
DT[, dcast(.SD, fish ~ station, value.var = "seen", fill = 0)]
Using the us_rent_income
data
Comparison:
us_rent_income |> pivot_wider( names_from = variable, values_from = c(estimate, moe) ) dcast(DT, GEOID + NAME ~ variable, value.var = c("estimate","moe")) # alternatively we can pass `...` to indicate all other unspecified columns # E.G. : # dcast(DT, ... ~ variable, value.var = c("estimate","moe"))
us_rent_income |> pivot_wider( names_from = variable, names_sep = ".", values_from = c(estimate, moe) ) dcast(DT, GEOID + NAME ~ variable, value.var = c("estimate","moe"), sep = ".") ## alternatively DT[, dcast(.SD, GEOID + NAME ~ variable, value.var = c("estimate","moe"), sep = ".")]
From the tidyr help: “You can control whether names_from
values vary
fastest or slowest relative to the values_from
column names using
names_vary
”.
us_rent_income |> pivot_wider( names_from = variable, values_from = c(estimate, moe), names_vary = "slowest" ) |> names() ## [1] "GEOID" "NAME" "estimate_income" "moe_income" ## [5] "estimate_rent" "moe_rent"
To mimic this in data.table, we specify the final column order
DT[, dcast(.SD, GEOID + NAME ~ variable, value.var = c("estimate","moe")) ][,c(1:3,5,4,6)] |> names() ## [1] "GEOID" "NAME" "estimate_income" "moe_income" ## [5] "estimate_rent" "moe_rent"
The fastest
option is exactly the same as the original
us_rent_income |> pivot_wider( names_from = variable, values_from = c(estimate, moe), names_vary = "fastest" ) |> names() ## [1] "GEOID" "NAME" "estimate_income" "estimate_rent" ## [5] "moe_income" "moe_rent" DT[, dcast(.SD, GEOID + NAME ~ variable, value.var = c("estimate","moe"))] |> names() ## [1] "GEOID" "NAME" "estimate_income" "estimate_rent" ## [5] "moe_income" "moe_rent"
values_fn
Using the warpbreaks
dataset:
warpbreaks %>% pivot_wider( names_from = wool, values_from = breaks, values_fn = mean ) dcast(DT, tension ~ wool, value.var = "breaks", fun = mean) warpbreaks %>% pivot_wider( names_from = wool, values_from = breaks, values_fn = mean ) ## # A tibble: 3 × 3 ## tension A B ## <fct> <dbl> <dbl> ## 1 L 44.6 28.2 ## 2 M 24 28.8 ## 3 H 24.6 18.8 dcast(DT, tension ~ wool, value.var = "breaks", fun = mean) ## tension A B ## 1: L 44.55556 28.22222 ## 2: M 24.00000 28.77778 ## 3: H 24.55556 18.77778
Or, do it inside DT:
DT[,dcast(.SD, tension ~ wool, value.var = "breaks", fun = mean)]
I hope this is useful to others.
If nothing else, please at least give me credit for not using the Ross
Geller GIF.