IT博客汇
  • 首页
  • 精华
  • 技术
  • 设计
  • 资讯
  • 扯淡
  • 权利声明
  • 登录 注册

    .I in data.table

    John MacKintosh发表于 2024-01-03 00:00:00
    love 0
    [This article was first published on Data By John, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
    Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

    In this post I’m using a small extract from the SIMD2020 dataset to figure out what the special operator .I does.

    Files and code are on github if you’re interested

    # files and code : https://github.com/johnmackintosh/DT_dot_I
    library(data.table)
    DT <- fread("highdata.csv")
    lookup <- fread("https://raw.githubusercontent.com/johnmackintosh/ph_lookups/main/dz_intzone_cp_dz_names.csv")
    

    To join these together, I could set a key on Data_Zone and DataZone respectively. However, that would reorder the data, so, for now, I will do the join “by hand”:

    DT <- lookup[DT, on = .(DataZone = Data_Zone)]
    

    I’ll maintain the existing order by keying on the overall SIMD rank

    setkey(DT, SIMD2020v2_Rank)
    

    I also update some column names

    setnames(DT, 
             old = c("SIMD2020v2_Income_Domain_Rank",
                     "SIMD2020_Employment_Domain_Rank",  
                     "SIMD2020_Health_Domain_Rank",
                     "SIMD2020_Education_Domain_Rank", 
                     "SIMD2020_Access_Domain_Rank", 
                     "SIMD2020_Crime_Domain_Rank",    
                     "SIMD2020_Housing_Domain_Rank",
                     "CP_Name"),
    
             new = c("Income", "Employment", 
                     "Health",   "Education",
                     "Access",  "Crime", 
                     "Housing", "areaname"))
    

    Which row has the highest income ranking?

    DT[,.I[Income == max(Income)]]
    # [1] 437
    

    The same for the Health ranking, but returning a data.table or vector respectively

    # returns data.table
    DT[,.I[Health == max(Health)],.SD] 
    
    #      V1
    #   <int>
    # 1:   424
    
    # returns single element vector
    DT[,.I[Health == max(Health)],.SD]$V1
    # [1] 424
    

    Return the maximum health ranking by area

    DT[,.I[Health == max(Health)], areaname]
    
    
                             areaname    V1
                               <char> <int>
     1:                     Inverness   417
     2:                Bute and Cowal   414
     3:                     Caithness   358
     4:                     East Ross   364
     5:        Helensburgh and Lomond   424
     6:          Nairn and Nairnshire   388
     7:      Oban, Lorn and the Isles   363
     8: Mid-Argyll, Kintyre and Islay   295
     9:                      Lochaber   252
    10:                      Mid Ross   389
    11:                    Sutherland   220
    12:  Skye, Lochalsh and West Ross   271
    13:       Badenoch and Strathspey   333
    
    

    Remember - these are row indices, not the actual values

    Extract multiple indices

    DT[,
       .(min_health = .I[Health == min(Health)], 
         max_health = .I[Health == max(Health)]),
       areaname]
    
                            areaname min_health max_health
                               <char>      <int>      <int>
     1:                     Inverness          1        417
     2:                Bute and Cowal          3        414
     3:                     Caithness          4        358
     4:                     East Ross          5        364
     5:        Helensburgh and Lomond          7        424
     6:          Nairn and Nairnshire         12        388
     7:      Oban, Lorn and the Isles         14        363
     8: Mid-Argyll, Kintyre and Islay         21        295
     9:                      Lochaber         36        252
    10:                      Mid Ross         37        389
    11:                    Sutherland         82        220
    12:  Skye, Lochalsh and West Ross         93        271
    13:       Badenoch and Strathspey        276        333
    
    

    What happens if we do something in i, then .I? Specifically, filter for the Inverness area, then find the min and max Health rankings

    DT[ areaname == "Inverness",  .(min_health = .I[Health == min(Health)], # still 1
                                    max_health = .I[Health == max(Health)])] # now 94
    
    

    We return a new index for the maximum value which is now row 94, having previously been row 417

    If I set the key on DataZone, then the indices change

    setkey(DT, DataZone)
    
    DT[,.I[Income == max(Income)]] # now 108, was 437
    
    
    DT[,.I[Health == max(Health)],.SD]  # now 95, was 424
    
    

    We can use .I to answer the question here relating to selecting rows based on conditions

    We can do it all in data.table 🙂

    # recreate the example data
    
    library(data.table)
    id <- c(rep(102,9),rep(103,5),rep(104,4))
    status <- rep(c('single','relationship','relationship','single','single','single'),3)
    status <- factor(status, levels = c("single" ,"relationship"), ordered = TRUE)
    # setting as factor for display reasons
    age <- c(17:19,22,23,26,28,32,33,21:25,21:24)
    DT <- data.table(id, status, age)
    
    # this is the desired output
    newdata <- DT[c(1,2,7,8,13,14,18),]
    
    

    Create a new data.table by finding the indices based on the conditions (single and highest age, relationship and lowest age) and use rleid to only pick out where the status changes, for each id.

    I also create a rownumber variable.

    DT2 <- DT[DT[,.I[status == "single" & age == max(age) |  
                       status == "relationship" & age == min(age)], 
                 .(rleid(status),id)]$V1
              ][,rn := rleid(status),id
                ][]
    
    

    You will see what appears to be a random $V1 at the end of this line:

    DT[DT[,.I[status == "single" & age == max(age) |  
                       status == "relationship" & age == min(age)], 
                 .(rleid(status),id)]$V1
              ]
    

    If I take that out, I get the following error:

    Error: When i is a data.table (or character vector), the columns to join by must be specified using 'on=' argument (see ?data.table), by keying x (i.e. sorted, and, marked as sorted, see ?setkey), or by sharing column names between x and i (i.e., a natural join). Keyed joins might have further speed benefits on very large data due to x being sorted in RAM.

    This confirms that what we’re doing here is a self join of DT to itself based on the row indices returned by the conditions we specify.

    When the $V1 is returned, we get the following results:

         id       status   age
       <num>        <ord> <num>
    1:   102       single    17
    2:   102 relationship    18
    3:   102       single    28
    4:   102 relationship    32
    5:   103       single    24
    6:   103 relationship    25
    7:   104 relationship    21
    8:   104       single    24
    
    

    This gives us far fewer rows than our original dataset, but still more than our desired results.

    The row indexed by DT2[,.I[(rn == 1 & status == "relationship")]], returns the row index 7. This is the one we need to remove, because we don’t want to include results where the first status is “relationship”.

    This is an anti join, joining DT2 to itself, minus the row indexed above.
    I also update the rownumber for each id, so that each status increments appropriately.

    DT2 <- DT2[!DT2[,.I[(rn == 1 & status == "relationship")]]
               ][,rn := seq(.N), .(id, status)][]
    

    This returns the following:

         id       status   age    rn
       <num>        <ord> <num> <int>
    1:   102       single    17     1
    2:   102 relationship    18     1
    3:   102       single    28     2
    4:   102 relationship    32     2
    5:   103       single    24     1
    6:   103 relationship    25     1
    7:   104       single    24     1
    
    

    Which matches the desired data perfectly

    all.equal(DT2[,1:3], newdata)
    # [1] TRUE
    

    Finally, use dcast to pivot_wider, as per OP’s request

    dcast(DT2,  
          id ~   rn + status, 
          value.var = "age")
    
    Key: <id>
          id 1_single 1_relationship 2_single 2_relationship
       <num>    <num>          <num>    <num>          <num>
    1:   102       17             18       28             32
    2:   103       24             25       NA             NA
    3:   104       24             NA       NA             NA
    

    Key takeaway: .I returns indices, and can be used in circumstances where you might otherwise use which to identify rows meeting certain conditions. You can use it to join a data.table to itself, or anti_join to itself, to return your desired results.

    To leave a comment for the author, please follow the link and comment on their blog: Data By John.

    R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
    Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
    Continue reading: .I in data.table


沪ICP备19023445号-2号
友情链接