IT博客汇
  • 首页
  • 精华
  • 技术
  • 设计
  • 资讯
  • 扯淡
  • 权利声明
  • 登录 注册

    Halloween Data Cleaning

    R on kieranhealy.org发表于 2024-10-12 20:20:24
    love 0
    [This article was first published on R on kieranhealy.org, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
    Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

    This week in Modern Plain Text Computing we put together some of the things we’ve been learning about cleaning and tidying data. Here’s a somewhat sobering example using data from the Fatality Analysis Reporting System, which is how the NTSA tracks information about road accidents in the United States. Our data file shows counts of pedestrians aged <16 killed in road accidents on each day of the month in the United States from 2008 to 2022.

    The data arrives, as is very often the case, in a spreadsheet that is not tidy. It looks like this:

    FARS data reporting tool results

    The Excel spreadsheet we get from querying the FARS data with its online report-generating tool.

    Let’s try to read it in as cleanly as we can, by skipping the few lines of metadata at the top. These are good to have to remind us of exactly what query we ran, but they are not part of the dataset itself.

    r
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    
    library(tidyverse)
    library(here)
    library(socviz)
    
    fars_raw <- readxl::read_xlsx(here("files", "examples", "fars_crash_report.xlsx"), 
                                  skip = 7)
    #> New names:
    #> • `` -> `...1`
    #> • `` -> `...2`
    
    fars_raw |> 
      print(n=35)
    #> # A tibble: 221 × 35
    #>    ...1  ...2      `1`   `2`   `3`   `4`   `5`   `6`   `7`   `8`   `9`  `10`  `11`  `12`  `13`  `14`  `15`  `16`  `17`  `18`  `19`  `20`  `21`  `22`
    #>    <chr> <chr>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
    #>  1 2008  January     1     0     0     3     1     0     1     0     2     0     2     2     1     2     0     0     1     0     0     0     1     0
    #>  2 <NA>  Februa…     0     0     1     1     1     0     1     3     0     0     0     0     0     0     0     1     0     1     0     0     0     0
    #>  3 <NA>  March       0     2     0     0     0     3     0     0     1     2     3     0     0     0     0     0     0     3     1     0     0     1
    #>  4 <NA>  April       0     2     2     1     1     0     2     0     0     1     3     1     1     1     0     2     1     1     1     0     1     1
    #>  5 <NA>  May         1     1     3     0     0     0     0     0     0     1     1     1     2     0     0     3     0     1     0     3     0     1
    #>  6 <NA>  June        6     1     2     1     1     1     0     3     2     1     0     0     0     0     1     1     3     0     0     1     1     0
    #>  7 <NA>  July        0     2     2     1     1     0     0     1     1     0     0     1     1     0     0     0     1     1     1     0     0     0
    #>  8 <NA>  August      1     0     1     0     0     0     2     0     3     0     0     1     1     0     2     6     1     1     0     0     0     1
    #>  9 <NA>  Septem…     3     0     2     1     1     1     1     1     1     0     0     2     1     0     3     0     0     0     0     1     0     1
    #> 10 <NA>  October     0     1     1     1     0     1     0     1     1     1     0     3     0     0     2     0     1     0     0     4     1     3
    #> 11 <NA>  Novemb…     0     0     4     1     1     0     0     0     0     1     1     0     0     1     0     2     1     0     1     0     3     1
    #> 12 <NA>  Decemb…     2     0     0     1     2     2     2     0     2     1     1     0     0     1     0     2     0     0     2     2     0     0
    #> 13 <NA>  Unknown     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
    #> 14 <NA>  Total      14     9    18    11     9     8     9     9    13     8    11    11     7     5     8    17     9     8     6    11     7     9
    #> 15 2009  January     1     0     0     1     0     2     3     1     1     1     0     1     0     0     1     0     1     0     1     0     1     3
    #> 16 <NA>  Februa…     0     1     0     1     0     2     0     0     1     3     0     0     1     1     1     0     1     0     3     0     2     1
    #> 17 <NA>  March       0     1     1     1     1     1     2     0     2     1     0     2     0     1     1     0     1     0     0     0     1     0
    #> 18 <NA>  April       0     0     2     0     0     1     0     0     0     0     1     0     0     0     1     1     1     1     1     1     2     0
    #> 19 <NA>  May         1     1     1     2     0     2     1     2     3     2     1     0     0     0     1     1     0     2     2     2     1     1
    #> 20 <NA>  June        0     1     0     2     0     0     1     2     0     3     1     0     2     1     0     0     1     0     0     1     1     0
    #> 21 <NA>  July        0     1     1     2     2     0     0     0     0     4     2     0     1     0     0     3     1     0     1     0     0     0
    #> 22 <NA>  August      1     2     1     0     0     1     2     1     1     0     1     1     1     0     0     1     0     1     2     0     0     3
    #> 23 <NA>  Septem…     1     0     2     1     0     0     1     1     0     3     2     1     1     2     3     1     0     0     0     1     0     0
    #> 24 <NA>  October     0     1     1     0     1     3     1     2     0     1     1     0     0     0     0     1     1     0     0     1     1     0
    #> 25 <NA>  Novemb…     2     0     1     0     1     0     0     0     0     1     1     1     1     1     0     1     2     1     1     0     1     0
    #> 26 <NA>  Decemb…     0     1     1     1     1     0     0     1     1     0     1     1     1     0     0     0     0     0     1     0     1     0
    #> 27 <NA>  Unknown     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
    #> 28 <NA>  Total       6     9    11    11     6    12    11    10     9    19    11     7     8     6     8     9     9     5    12     6    11     8
    #> 29 2010  January     0     1     1     0     0     0     0     1     0     1     0     1     1     1     0     1     0     1     0     2     1     1
    #> 30 <NA>  Februa…     0     0     0     0     2     0     0     0     3     0     1     0     0     0     0     1     2     0     0     2     0     1
    #> 31 <NA>  March       0     0     0     1     1     2     0     1     0     1     1     0     0     0     1     0     0     0     4     1     0     1
    #> 32 <NA>  April       2     5     0     1     1     0     0     3     2     2     1     0     1     1     2     2     0     0     0     0     1     0
    #> 33 <NA>  May         0     0     0     2     2     0     1     0     1     0     0     0     1     0     0     0     0     0     0     0     1     0
    #> 34 <NA>  June        2     2     1     2     1     1     2     4     3     1     0     2     0     0     1     1     1     1     1     0     0     2
    #> 35 <NA>  July        2     2     0     1     0     0     0     0     1     0     3     2     1     0     2     1     1     2     1     0     0     1
    #> # ℹ 186 more rows
    #> # ℹ 11 more variables: `23` <dbl>, `24` <dbl>, `25` <dbl>, `26` <dbl>, `27` <dbl>, `28` <dbl>, `29` <dbl>, `30` <dbl>, `31` <dbl>, Unknown <dbl>,
    #> #   Total <dbl>

    We need to tidy this up. The first two columns don’t have proper names. It’s in wide format and the months aren’t filled in down the columns. We also have a Total value in the day column. We know the column names, so we can supply those manually.

    r
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    
    colnames(fars_raw) <- c("year", "month", 1:31, "unknown", "total")
    
    
    fars <- fars_raw |> 
      # Drop 'Unknown' and 'Total' rows
      select(!all_of(c("unknown", "total"))) |> 
      # Make sure there's no leading or trailing whitespace
      mutate(year = str_squish(year), 
             month = str_squish(month)) |> 
      # Remove 'Unknown' and 'Total' rows 
      filter(month %nin% c("Unknown", "Total")) |> 
      # Remove 'Redacted', 'Unknown' and 'Total' rows from `year` 
      # (This partially overlaps with the 'Unknown' and 'Total' in `month`
      filter(year %nin% c("Redacted", "Unknown", "Total")) |> 
      # Remove any lines that are all NAs
      filter(!if_all(everything(), \(x) is.na(x))) |> 
      # Fill in the year values downwards
      fill(year) |> 
      # Lengthen 
      pivot_longer(cols = `1`:`31`, names_to = "day",
                   values_to = "n")  
      
    
    fars
    #> # A tibble: 5,921 × 4
    #>    year  month   day       n
    #>    <chr> <chr>   <chr> <dbl>
    #>  1 2008  January 1         1
    #>  2 2008  January 2         0
    #>  3 2008  January 3         0
    #>  4 2008  January 4         3
    #>  5 2008  January 5         1
    #>  6 2008  January 6         0
    #>  7 2008  January 7         1
    #>  8 2008  January 8         0
    #>  9 2008  January 9         2
    #> 10 2008  January 10        0
    #> # ℹ 5,911 more rows

    We’re going to make a graph of average patterns by month and day over all available years. So first we aggregate the data by month and day and take the average across years.

    r
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    
    fars_daily_means <- fars |> 
        summarize(mean_n = mean(n, na.rm = TRUE), 
                .by = c(month, day)) 
    
    
    fars_daily_means
    #> # A tibble: 372 × 3
    #>    month   day   mean_n
    #>    <chr>   <chr>  <dbl>
    #>  1 January 1      0.4  
    #>  2 January 2      0.333
    #>  3 January 3      0.2  
    #>  4 January 4      0.667
    #>  5 January 5      0.467
    #>  6 January 6      0.6  
    #>  7 January 7      0.667
    #>  8 January 8      0.333
    #>  9 January 9      0.533
    #> 10 January 10     0.467
    #> # ℹ 362 more rows

    For the purposes of what we’re going to draw, it will make more sense to treat years and months as categorical variables, i.e. as factors, rather than true dates.

    r
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    
    dates <- unique(paste(fars$month, fars$day))
    
    dates[1:10]
    #>  [1] "January 1"  "January 2"  "January 3"  "January 4"  "January 5"  "January 6"  "January 7"  "January 8"  "January 9"  "January 10"
    
    
    
    fars_daily_means <- fars_daily_means |> 
      mutate(
        # Ordered categorical var of months
        month_fac = factor(month,
                           levels = unique(fars$month),
                           ordered = TRUE),
        # Convert from character to integer
        day = as.integer(day),
        # Ordered factor of month-days
        day_fac = factor(paste(month, day),
                         levels = dates,
                         ordered = TRUE),
        # Dummy variable: is it Halloween today?
        flag = ifelse(month == "October" & day == 31, TRUE, FALSE), 
        # Shortcut .by again
        .by = c(month, day)) |> 
      # Order for convenience
      arrange(day_fac) 
    
    fars_daily_means
    #> # A tibble: 372 × 6
    #>    month     day mean_n month_fac day_fac    flag 
    #>    <chr>   <int>  <dbl> <ord>     <ord>      <lgl>
    #>  1 January     1  0.4   January   January 1  FALSE
    #>  2 January     2  0.333 January   January 2  FALSE
    #>  3 January     3  0.2   January   January 3  FALSE
    #>  4 January     4  0.667 January   January 4  FALSE
    #>  5 January     5  0.467 January   January 5  FALSE
    #>  6 January     6  0.6   January   January 6  FALSE
    #>  7 January     7  0.667 January   January 7  FALSE
    #>  8 January     8  0.333 January   January 8  FALSE
    #>  9 January     9  0.533 January   January 9  FALSE
    #> 10 January    10  0.467 January   January 10 FALSE
    #> # ℹ 362 more rows

    Now we can draw a plot. This one is generated from the code here but there’s also theme setting, not shown, that makes the fonts a bit nicer.

    r
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    
    fars_daily_means |> 
      ggplot( mapping = aes(x = day, y = mean_n, fill = flag)) +
        geom_col() +
        scale_fill_manual(values = c("gray30", "darkorange2")) +
        scale_x_continuous(breaks = c(1, 10, 20, 30)) + 
        guides(fill = "none") + 
        facet_wrap(~ month_fac, ncol = 1) +
        labs(x = "Day of the Month",
             y = "Mean Number of Child Pedestrians Killed",
             title = "Pedestrians aged <16 years killed\nin Fatal Motor Vehicle Crashes",
             subtitle = "Daily Average, 2008-22",
             caption = "Kieran Healy @kjhealy / Source: NHTSA Fatality Analysis Reporting System")
    FARS Graph

    Our graph.

    You can see the uptick in pedestrian child fatalities at Halloween pretty clearly in the data. There’s other structure here too, e.g. strong seasonal differences and also upticks on other US holidays. Some patterns in the original data are obscured when we average by day-date rather than day of the week, such as the difference between weekends and other days. Perhaps most important, there’s also an interesting risk-exposure problem here. On the one hand, we see more child fatalities at Halloween. But on the other, we can be pretty confident exposure is also much higher: there are far more children wandering around the street at Halloween than there normally would be. So it may be that the rate with respect to that denominator is the same or lower than “typical” days. But it’s not really possible to have that denominator in data like this. In any case, if you’re out with your children and walking around this Halloween, stay safe. And rather more pertinently, if you’re out driving, be careful.

    To leave a comment for the author, please follow the link and comment on their blog: R on kieranhealy.org.

    R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
    Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
    Continue reading: Halloween Data Cleaning


沪ICP备19023445号-2号
友情链接