In this post I’m using a small extract from the SIMD2020 dataset to figure out what the special operator .I
does.
Files and code are on github if you’re interested
# files and code : https://github.com/johnmackintosh/DT_dot_I library(data.table) DT <- fread("highdata.csv") lookup <- fread("https://raw.githubusercontent.com/johnmackintosh/ph_lookups/main/dz_intzone_cp_dz_names.csv")
To join these together, I could set a key on Data_Zone
and DataZone
respectively.
However, that would reorder the data, so, for now, I will do the join “by hand”:
DT <- lookup[DT, on = .(DataZone = Data_Zone)]
I’ll maintain the existing order by keying on the overall SIMD rank
setkey(DT, SIMD2020v2_Rank)
I also update some column names
setnames(DT, old = c("SIMD2020v2_Income_Domain_Rank", "SIMD2020_Employment_Domain_Rank", "SIMD2020_Health_Domain_Rank", "SIMD2020_Education_Domain_Rank", "SIMD2020_Access_Domain_Rank", "SIMD2020_Crime_Domain_Rank", "SIMD2020_Housing_Domain_Rank", "CP_Name"), new = c("Income", "Employment", "Health", "Education", "Access", "Crime", "Housing", "areaname"))
Which row has the highest income ranking?
DT[,.I[Income == max(Income)]] # [1] 437
The same for the Health ranking, but returning a data.table or vector respectively
# returns data.table DT[,.I[Health == max(Health)],.SD] # V1 # <int> # 1: 424 # returns single element vector DT[,.I[Health == max(Health)],.SD]$V1 # [1] 424
Return the maximum health ranking by area
DT[,.I[Health == max(Health)], areaname] areaname V1 <char> <int> 1: Inverness 417 2: Bute and Cowal 414 3: Caithness 358 4: East Ross 364 5: Helensburgh and Lomond 424 6: Nairn and Nairnshire 388 7: Oban, Lorn and the Isles 363 8: Mid-Argyll, Kintyre and Islay 295 9: Lochaber 252 10: Mid Ross 389 11: Sutherland 220 12: Skye, Lochalsh and West Ross 271 13: Badenoch and Strathspey 333
Remember - these are row indices, not the actual values
Extract multiple indices
DT[, .(min_health = .I[Health == min(Health)], max_health = .I[Health == max(Health)]), areaname] areaname min_health max_health <char> <int> <int> 1: Inverness 1 417 2: Bute and Cowal 3 414 3: Caithness 4 358 4: East Ross 5 364 5: Helensburgh and Lomond 7 424 6: Nairn and Nairnshire 12 388 7: Oban, Lorn and the Isles 14 363 8: Mid-Argyll, Kintyre and Islay 21 295 9: Lochaber 36 252 10: Mid Ross 37 389 11: Sutherland 82 220 12: Skye, Lochalsh and West Ross 93 271 13: Badenoch and Strathspey 276 333
What happens if we do something in i, then .I? Specifically, filter for the Inverness area, then find the min and max Health rankings
DT[ areaname == "Inverness", .(min_health = .I[Health == min(Health)], # still 1 max_health = .I[Health == max(Health)])] # now 94
We return a new index for the maximum value which is now row 94, having previously been row 417
If I set the key on DataZone
, then the indices change
setkey(DT, DataZone) DT[,.I[Income == max(Income)]] # now 108, was 437 DT[,.I[Health == max(Health)],.SD] # now 95, was 424
We can use .I
to answer the question here relating to selecting rows based on conditions
We can do it all in data.table
# recreate the example data library(data.table) id <- c(rep(102,9),rep(103,5),rep(104,4)) status <- rep(c('single','relationship','relationship','single','single','single'),3) status <- factor(status, levels = c("single" ,"relationship"), ordered = TRUE) # setting as factor for display reasons age <- c(17:19,22,23,26,28,32,33,21:25,21:24) DT <- data.table(id, status, age) # this is the desired output newdata <- DT[c(1,2,7,8,13,14,18),]
Create a new data.table by finding the indices based on the conditions (single and highest age, relationship and lowest age) and use rleid
to only pick out where the status changes, for each id
.
I also create a rownumber variable.
DT2 <- DT[DT[,.I[status == "single" & age == max(age) | status == "relationship" & age == min(age)], .(rleid(status),id)]$V1 ][,rn := rleid(status),id ][]
You will see what appears to be a random $V1
at the end of this line:
DT[DT[,.I[status == "single" & age == max(age) | status == "relationship" & age == min(age)], .(rleid(status),id)]$V1 ]
If I take that out, I get the following error:
Error: When i is a data.table (or character vector), the columns to join by must be specified using 'on=' argument (see ?data.table), by keying x (i.e. sorted, and, marked as sorted, see ?setkey), or by sharing column names between x and i (i.e., a natural join). Keyed joins might have further speed benefits on very large data due to x being sorted in RAM.
This confirms that what we’re doing here is a self join of DT to itself based on the row indices returned by the conditions we specify.
When the $V1
is returned, we get the following results:
id status age <num> <ord> <num> 1: 102 single 17 2: 102 relationship 18 3: 102 single 28 4: 102 relationship 32 5: 103 single 24 6: 103 relationship 25 7: 104 relationship 21 8: 104 single 24
This gives us far fewer rows than our original dataset, but still more than our desired results.
The row indexed by DT2[,.I[(rn == 1 & status == "relationship")]]
, returns the row index 7
.
This is the one we need to remove, because we don’t want to include results where the first status is “relationship”.
This is an anti join, joining DT2 to itself, minus the row indexed above.
I also update the rownumber for each id, so that each status increments appropriately.
DT2 <- DT2[!DT2[,.I[(rn == 1 & status == "relationship")]] ][,rn := seq(.N), .(id, status)][]
This returns the following:
id status age rn <num> <ord> <num> <int> 1: 102 single 17 1 2: 102 relationship 18 1 3: 102 single 28 2 4: 102 relationship 32 2 5: 103 single 24 1 6: 103 relationship 25 1 7: 104 single 24 1
Which matches the desired data perfectly
all.equal(DT2[,1:3], newdata) # [1] TRUE
Finally, use dcast
to pivot_wider, as per OP’s request
dcast(DT2, id ~ rn + status, value.var = "age") Key: <id> id 1_single 1_relationship 2_single 2_relationship <num> <num> <num> <num> <num> 1: 102 17 18 28 32 2: 103 24 25 NA NA 3: 104 24 NA NA NA
Key takeaway: .I
returns indices, and can be used in circumstances where you might otherwise use which
to identify rows meeting certain conditions.
You can use it to join a data.table to itself, or anti_join to itself, to return your desired results.