IT博客汇
  • 首页
  • 精华
  • 技术
  • 设计
  • 资讯
  • 扯淡
  • 权利声明
  • 登录 注册

    Grouped Sequences in dplyr

    R on Alan Yeung发表于 2023-11-05 00:00:00
    love 0
    [This article was first published on R on Alan Yeung, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
    Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

    For a piece of work I had to calculate the number of matches that a team plays away from home in a row, which we will call days_on_the_road. I was not sure how to do this with dplyr but it’s basically a ‘grouped sequence’. For this post, I’ve created some dummy data to illustrate this idea. The num_matches_away variable is what we want to mimic using some data manipulation.

    library(tidyverse)
    
    sports_df <- tibble::tribble(
      ~team,    ~date,        ~home_or_away, ~num_matches_away,
      "Team A", "07/10/2022", "H",           0L,
      "Team A", "14/10/2022", "A",           1L,
      "Team A", "21/10/2022", "A",           2L,
      "Team A", "28/10/2022", "H",           0L,
      "Team A", "04/11/2022", "A",           1L,
      "Team A", "11/11/2022", "A",           2L,
      "Team A", "18/11/2022", "A",           3L,
      "Team A", "25/11/2022", "H",           0L,
      "Team A", "02/12/2022", "H",           0L,
      "Team A", "09/12/2022", "H",           0L,
      "Team B", "07/10/2022", "H",           0L,
      "Team B", "14/10/2022", "A",           1L,
      "Team B", "21/10/2022", "A",           2L,
      "Team B", "28/10/2022", "A",           3L,
      "Team B", "04/11/2022", "H",           0L,
      "Team B", "11/11/2022", "A",           1L,
      "Team B", "18/11/2022", "A",           2L,
      "Team B", "25/11/2022", "H",           0L,
      "Team B", "02/12/2022", "H",           0L,
      "Team B", "09/12/2022", "A",           1L
    ) %>%      
      mutate(date = as.Date(date, "%d/%m/%Y")) %>% 
      arrange(team, date)
    
    sports_df
    ## # A tibble: 20 × 4
    ##    team   date       home_or_away num_matches_away
    ##    <chr>  <date>     <chr>                   <int>
    ##  1 Team A 2022-10-07 H                           0
    ##  2 Team A 2022-10-14 A                           1
    ##  3 Team A 2022-10-21 A                           2
    ##  4 Team A 2022-10-28 H                           0
    ##  5 Team A 2022-11-04 A                           1
    ##  6 Team A 2022-11-11 A                           2
    ##  7 Team A 2022-11-18 A                           3
    ##  8 Team A 2022-11-25 H                           0
    ##  9 Team A 2022-12-02 H                           0
    ## 10 Team A 2022-12-09 H                           0
    ## 11 Team B 2022-10-07 H                           0
    ## 12 Team B 2022-10-14 A                           1
    ## 13 Team B 2022-10-21 A                           2
    ## 14 Team B 2022-10-28 A                           3
    ## 15 Team B 2022-11-04 H                           0
    ## 16 Team B 2022-11-11 A                           1
    ## 17 Team B 2022-11-18 A                           2
    ## 18 Team B 2022-11-25 H                           0
    ## 19 Team B 2022-12-02 H                           0
    ## 20 Team B 2022-12-09 A                           1

    I firstly came up with a complicated solution for this using a combination of slider::slide() and rle() (run length encoding).

    sports_df2 <- sports_df %>% 
      group_by(team) %>% 
      mutate(days_on_road = unlist(slider::slide(rle(home_or_away)$length, ~1:.x)),
             days_on_road = if_else(home_or_away == "H", 0L, days_on_road)) %>% 
      ungroup()
    
    identical(sports_df2$days_on_road, sports_df$num_matches_away)
    ## [1] TRUE

    These are some pretty cool functions and it was nice to know rle() existed but I was never really happy with this solution as it seemed overly complex and it’s difficult to understand what the code is doing by simply reading it. So I asked a colleague to try to solve this problem and they came up with a better solution which I’m grateful for! It involves using a combination of group_by() and seq_len() which is a whole lot simpler to understand in my opinion.

    sports_df3 <- sports_df2 %>%
      mutate(away = cumsum(home_or_away == "H")) %>%
      group_by(team, home_or_away, away) %>%
      mutate(days_on_road = seq_len(n())) %>%
      ungroup() %>%
      mutate(days_on_road = if_else(home_or_away == "H", 0L, days_on_road))
    
    identical(sports_df3$days_on_road, sports_df$num_matches_away)
    ## [1] TRUE

    I hope this is useful for some of you out there coding with R!

    To leave a comment for the author, please follow the link and comment on their blog: R on Alan Yeung.

    R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
    Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
    Continue reading: Grouped Sequences in dplyr


沪ICP备19023445号-2号
友情链接