IT博客汇
  • 首页
  • 精华
  • 技术
  • 设计
  • 资讯
  • 扯淡
  • 权利声明
  • 登录 注册

    simstudy: another way to generate data from a non-standard density

    Keith Goldfeld发表于 2024-06-04 00:00:00
    love 0
    [This article was first published on ouR data generation, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
    Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

    One of my goals for the simstudy package is to make it as easy as possible to generate data from a wide range of data distributions. The recent update created the possibility of generating data from a customized distribution specified in a user-defined function. Last week, I added two functions, genDataDist and addDataDist, that allow data generation from an empirical distribution defined by a vector of integers. (See here for how to download latest development version.) This post provides a simple illustration of the new functionality.

    Here are the libraries needed, in case you want to follow along:

    library(simstudy)
    library(data.table)
    library(ggplot2)
    
    set.seed(1234)

    The target density is simply defined by specifying a vector that is intended to loosely represent a data distribution. We start by specifying the vector (which can be of any length):

    base_data <- 
      c(1, 2, 2, 2, 2, 2, 2, 3, 3, 4, 4, 5, 6, 6, 7, 8, 9, 9, 9, 10, 10, 10, 10, 10)

    We can look at the density to make sure this is the distribution we are interested in drawing our data from:

    emp_density <- density(base_data, n = 10000)
    den_curve <- data.table(x = emp_density$x, y = emp_density$y)
    
    ggplot(data = den_curve, aes(x = x, y = y)) +
      geom_line(linewidth = 1) +
      scale_y_continuous(name = "density\n", limits = c(0, 0.11), 
         breaks = seq(0, .10, by = .02)) +
      scale_color_manual(values = colors) +
      theme(panel.grid = element_blank())

    Actually drawing samples from this distribution is a simple call to genDataDensity. The key argument is the data distribution as as represented by the vector of integers:

    dx <- genDataDensity(10000, dataDist = base_data, varname = "x1")

    Here’s a look at the sampled data and their relationship to the target density:

    ggplot(data = dx, aes(x=x1)) +
      geom_histogram(aes(y = after_stat(count / sum(count))), 
        binwidth = 1, fill = "grey", color = "black", alpha = .2) +
      geom_line(data = den_curve, aes(x = x, y = y), 
        color = "black", linewidth = 2) +
      scale_y_continuous(name = "density\n", limits = c(0, 0.11), 
         breaks = seq(0, .10, by = .02)) +
      scale_x_continuous(limits = c(-6, 15), breaks = seq(-5, 10, by = 5)) +
      theme(panel.grid = element_blank(),
            plot.title = element_text(face = "bold", size = 10))

    Just to show that this was not a fluke, here are three additional target distributions, specified with three different vectors:

    base_data <- list(
      c(1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 5, 5, 6, 6, 7,   7,  8,  9, 10),
      c(1, 2, 2, 3, 3, 4, 4, 4, 4, 4, 4, 5, 6, 6, 7, 7, 7, 8, 9, 10, 10, 10, 10, 10),
      c(1, 2, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 8, 8, 9, 9, 9, 9, 10, 10, 10, 10, 10)
    )

    We can generate data from each of the distributions and then confirm that each one adequately fits its target distribution:

    dx1 <- genDataDensity(10000, dataDist = base_data[[1]], varname = "x1")
    dx2 <- genDataDensity(10000, dataDist = base_data[[2]], varname = "x1")
    dx3 <- genDataDensity(10000, dataDist = base_data[[3]], varname = "x1")

    Addendum: code to generate multiple distribution plot

    Here is a little more detail in case someone might find it useful to have the code that generates the “facet” plot. In the plot with the single distribution, I specified the histogram with this command:

    geom_histogram(aes(y = after_stat(count / sum(count))), ...)

    When I tried to apply this to the “facet” plot, the denominator of that plot (sum(count)) was not calculated for each subgroup (i.e., dataset), but was the total across all datasets. As a result, the dataset-specific proportions were underestimated; we can see that here:

    dx <- rbindlist(list(dx1, dx2, dx3), idcol = TRUE)
    
    ggplot(data = dx, aes(x=x1)) +
      geom_histogram(
        aes(y = after_stat(count / sum(count)), fill = factor(.id), color = factor(.id)),
        binwidth = 1, alpha = .2) +
      geom_line(data = dens, aes(x = x, y = y, color = factor(.id)), linewidth = 2) +
        xlab("\nx1") +
      ylab("density\n") +
      scale_fill_manual(values = colors) +
      scale_color_manual(values = colors) +
      theme(panel.grid = element_blank(),
            plot.title = element_text(face = "bold", size = 10),
            legend.position = "none") + 
      facet_grid(~ .id)  

    I looked around for a way to address this, but couldn’t find anything that obviously addressed this shortcoming (though I am convinced it must be possible, and I just couldn’t locate the solution). I considered using ggarrangeor something similar, but was not satisfied with the results. Instead, it turned out to be faster just to calculate the proportions myself. This is the process I used:

    First, I created a dataset with the bins (using a bin size of 1):

    cuts <- seq(dx[,floor(min(x1))], dx[,ceiling(max(x1))], by = 1)
    dcuts <- data.table(bin = 1:length(cuts), binlab = cuts)
    
    dcuts
    ##       bin binlab
    ##     <int>  <num>
    ##  1:     1     -3
    ##  2:     2     -2
    ##  3:     3     -1
    ##  4:     4      0
    ##  5:     5      1
    ##  6:     6      2
    ##  7:     7      3
    ##  8:     8      4
    ##  9:     9      5
    ## 10:    10      6
    ## 11:    11      7
    ## 12:    12      8
    ## 13:    13      9
    ## 14:    14     10
    ## 15:    15     11
    ## 16:    16     12
    ## 17:    17     13
    ## 18:    18     14

    Then, I allocated each observation to a bin using the cut function:

    dx[, bin := cut(x1, breaks = cuts, labels = FALSE)]
    dx <- merge(dx, dcuts, by = "bin")
    
    dx
    ## Key: <bin>
    ##          bin   .id    id        x1 binlab
    ##        <int> <int> <int>     <num>  <num>
    ##     1:     1     1  1251 -2.097413     -3
    ##     2:     1     1  2215 -2.580587     -3
    ##     3:     1     1  2404 -2.042049     -3
    ##     4:     1     1  3228 -2.078958     -3
    ##     5:     1     1  5039 -2.055471     -3
    ##    ---                                   
    ## 29996:    17     3  7690 13.290347     13
    ## 29997:    17     3  8360 13.083991     13
    ## 29998:    17     3  8860 13.149421     13
    ## 29999:    17     3  9214 13.043727     13
    ## 30000:    17     3  9743 13.199752     13

    Finally, I calculated the distribution-specific proportions (showing only the second distribution):

    dp <- dx[, .N, keyby = .(.id, binlab)]
    dp[, p := N/sum(N), keyby = .id]
    
    dp[.id == 2]
    ## Key: <.id>
    ##       .id binlab     N      p
    ##     <int>  <num> <int>  <num>
    ##  1:     2     -3     3 0.0003
    ##  2:     2     -2    38 0.0038
    ##  3:     2     -1   130 0.0130
    ##  4:     2      0   340 0.0340
    ##  5:     2      1   619 0.0619
    ##  6:     2      2   938 0.0938
    ##  7:     2      3  1161 0.1161
    ##  8:     2      4  1155 0.1155
    ##  9:     2      5  1035 0.1035
    ## 10:     2      6   882 0.0882
    ## 11:     2      7   828 0.0828
    ## 12:     2      8   861 0.0861
    ## 13:     2      9   822 0.0822
    ## 14:     2     10   641 0.0641
    ## 15:     2     11   384 0.0384
    ## 16:     2     12   140 0.0140
    ## 17:     2     13    23 0.0023

    And now the facet plot will work just fine. Here is the code and the plot (again).

    ggplot(data = dp, aes(x = binlab, y = p)) +
      geom_bar(aes(fill = factor(.id), color = factor(.id)), stat = "identity", alpha = .4) +
      geom_line(data = dens, aes(x = x, y = y, color = factor(.id)), 
                linewidth = 2) +
      xlab("\nx1") +
      ylab("density\n") +
      scale_fill_manual(values = colors) +
      scale_color_manual(values = colors) +
      theme(panel.grid = element_blank(),
            plot.title = element_text(face = "bold", size = 10),
            legend.position = "none") + 
      facet_grid(~ .id)  

    To leave a comment for the author, please follow the link and comment on their blog: ouR data generation.

    R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
    Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
    Continue reading: simstudy: another way to generate data from a non-standard density


沪ICP备19023445号-2号
友情链接