IT博客汇
  • 首页
  • 精华
  • 技术
  • 设计
  • 资讯
  • 扯淡
  • 权利声明
  • 登录 注册

    Stratified Sampling in R: A Practical Guide with Base R and dplyr

    Steven P. Sanderson II, MPH发表于 2024-07-29 04:00:00
    love 0
    [This article was first published on Steve's Data Tips and Tricks, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
    Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

    Introduction

    Stratified sampling is a technique used to ensure that different subgroups (strata) within a population are represented in a sample. This method is particularly useful when certain strata are underrepresented in a simple random sample. In this post, we’ll explore how to perform stratified sampling in R using both base R and the dplyr package. We’ll walk through examples and explain the code, so you can try these techniques on your own data.

    What is Stratified Sampling?

    In stratified sampling, the population is divided into different strata based on a specific characteristic (e.g., age, gender, income level). A random sample is then taken from each stratum. This method ensures that the sample represents the population accurately, especially when the strata are significantly different in size or characteristics.

    Stratified Sampling with Base R

    Let’s start with an example using base R. Suppose we have a dataset with information about individuals, including their gender and income. We want to sample a specific number of individuals from each gender group.

    Here’s how we can do it:

    # Sample data
    set.seed(123) # For reproducibility
    data <- data.frame(
      ID = 1:100,
      Gender = sample(c("Male", "Female"), 100, replace = TRUE),
      Income = rnorm(100, mean = 50000, sd = 10000)
    )
    
    # View the first few rows of the data
    head(data)
      ID Gender   Income
    1  1   Male 52533.19
    2  2   Male 49714.53
    3  3   Male 49571.30
    4  4 Female 63686.02
    5  5   Male 47742.29
    6  6 Female 65164.71

    In this dataset, we have a column for Gender and another for Income. Let’s say we want to sample 10 males and 10 females.

    # Stratified sampling function
    stratified_sample <- function(data, strat_column, size_per_stratum) {
      strata <- unique(data[[strat_column]])
      sampled_data <- do.call(rbind, lapply(strata, function(stratum) {
        subset_data <- data[data[[strat_column]] == stratum, ]
        subset_data[sample(nrow(subset_data), size_per_stratum), ]
      }))
      return(sampled_data)
    }
    
    # Perform stratified sampling
    sampled_data <- stratified_sample(data, "Gender", 10)
    
    # View the sampled data
    table(sampled_data$Gender)
    Female   Male 
        10     10 
    head(sampled_data)
         ID Gender   Income
    45   45   Male 63606.52
    69   69   Male 41502.96
    83   83   Male 50412.33
    29   29   Male 51813.03
    49   49   Male 47643.00
    100 100   Male 37129.70

    In this example:

    • We first create a function stratified_sample that takes the data, the column to stratify by, and the number of samples per stratum.
    • The function identifies unique strata, then samples the specified number of rows from each stratum.
    • The result is a combined dataset with samples from each group.

    Stratified Sampling with dplyr

    Using sample_n

    The dplyr package makes data manipulation straightforward and efficient. Here’s how to do stratified sampling using dplyr:

    library(dplyr)
    
    # Stratified sampling with sample_n()
    sampled_data_n <- data %>%
      group_by(Gender) %>%
      sample_n(10)
    
    # View the sampled data
    sampled_data_n %>% count(Gender)
    # A tibble: 2 × 2
    # Groups:   Gender [2]
      Gender     n
      <chr>  <int>
    1 Female    10
    2 Male      10
    head(sampled_data_n)
    # A tibble: 6 × 3
    # Groups:   Gender [1]
         ID Gender Income
      <int> <chr>   <dbl>
    1    81 Female 64446.
    2     6 Female 65165.
    3     8 Female 55846.
    4    22 Female 26908.
    5    98 Female 56879.
    6    11 Female 53796.

    In this approach:

    • We use group_by() to group the data by the Gender column.
    • sample_n() is used to take 10 samples from each group.
    • count() helps us verify the number of samples from each group.

    Using sample_frac() for Proportional Sampling

    If you want to sample a proportion of each stratum, you can use the sample_frac() function. For example, if you want to sample 20% of each gender group:

    # Stratified sampling with sample_frac()
    sampled_data_frac <- data %>%
      group_by(Gender) %>%
      sample_frac(0.2)
    
    # View the sampled data
    sampled_data_frac %>% count(Gender)
    # A tibble: 2 × 2
    # Groups:   Gender [2]
      Gender     n
      <chr>  <int>
    1 Female     9
    2 Male      11
    head(sampled_data_frac)
    # A tibble: 6 × 3
    # Groups:   Gender [1]
         ID Gender Income
      <int> <chr>   <dbl>
    1    71 Female 51176.
    2    92 Female 47378.
    3    13 Female 46668.
    4    48 Female 65326.
    5    42 Female 55484.
    6    76 Female 43481.

    In this example:

    • sample_frac() is used to take 20% of the rows from each group.
    • This is useful when you want the sample size to be proportional to the size of each stratum.

    Conclusion

    Stratified sampling is a powerful technique to ensure representation from all subgroups in your sample. Whether you’re using base R or dplyr, the process is straightforward and allows you to draw balanced samples from your data.

    Feel free to try these methods on your data! Experimenting with different sizes and strata can help you understand how stratified sampling affects your analyses. Don’t hesitate to dive into the code and see how you can adapt it to your needs.


    Happy coding!

    To leave a comment for the author, please follow the link and comment on their blog: Steve's Data Tips and Tricks.

    R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
    Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
    Continue reading: Stratified Sampling in R: A Practical Guide with Base R and dplyr


沪ICP备19023445号-2号
友情链接