One of my goals for the simstudy
package is to make it as easy as possible to generate data from a wide range of data distributions. The recent update created the possibility of generating data from a customized distribution specified in a user-defined function. Last week, I added two functions, genDataDist
and addDataDist
, that allow data generation from an empirical distribution defined by a vector of integers. (See here for how to download latest development version.) This post provides a simple illustration of the new functionality.
Here are the libraries needed, in case you want to follow along:
library(simstudy) library(data.table) library(ggplot2) set.seed(1234)
The target density is simply defined by specifying a vector that is intended to loosely represent a data distribution. We start by specifying the vector (which can be of any length):
base_data <- c(1, 2, 2, 2, 2, 2, 2, 3, 3, 4, 4, 5, 6, 6, 7, 8, 9, 9, 9, 10, 10, 10, 10, 10)
We can look at the density to make sure this is the distribution we are interested in drawing our data from:
emp_density <- density(base_data, n = 10000) den_curve <- data.table(x = emp_density$x, y = emp_density$y) ggplot(data = den_curve, aes(x = x, y = y)) + geom_line(linewidth = 1) + scale_y_continuous(name = "density\n", limits = c(0, 0.11), breaks = seq(0, .10, by = .02)) + scale_color_manual(values = colors) + theme(panel.grid = element_blank())
Actually drawing samples from this distribution is a simple call to genDataDensity
. The key argument is the data distribution as as represented by the vector of integers:
dx <- genDataDensity(10000, dataDist = base_data, varname = "x1")
Here’s a look at the sampled data and their relationship to the target density:
ggplot(data = dx, aes(x=x1)) + geom_histogram(aes(y = after_stat(count / sum(count))), binwidth = 1, fill = "grey", color = "black", alpha = .2) + geom_line(data = den_curve, aes(x = x, y = y), color = "black", linewidth = 2) + scale_y_continuous(name = "density\n", limits = c(0, 0.11), breaks = seq(0, .10, by = .02)) + scale_x_continuous(limits = c(-6, 15), breaks = seq(-5, 10, by = 5)) + theme(panel.grid = element_blank(), plot.title = element_text(face = "bold", size = 10))
Just to show that this was not a fluke, here are three additional target distributions, specified with three different vectors:
base_data <- list( c(1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 9, 10), c(1, 2, 2, 3, 3, 4, 4, 4, 4, 4, 4, 5, 6, 6, 7, 7, 7, 8, 9, 10, 10, 10, 10, 10), c(1, 2, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 8, 8, 9, 9, 9, 9, 10, 10, 10, 10, 10) )
We can generate data from each of the distributions and then confirm that each one adequately fits its target distribution:
dx1 <- genDataDensity(10000, dataDist = base_data[[1]], varname = "x1") dx2 <- genDataDensity(10000, dataDist = base_data[[2]], varname = "x1") dx3 <- genDataDensity(10000, dataDist = base_data[[3]], varname = "x1")
Here is a little more detail in case someone might find it useful to have the code that generates the “facet” plot. In the plot with the single distribution, I specified the histogram with this command:
geom_histogram(aes(y = after_stat(count / sum(count))), ...)
When I tried to apply this to the “facet” plot, the denominator of that plot (sum(count)
) was not calculated for each subgroup (i.e., dataset), but was the total across all datasets. As a result, the dataset-specific proportions were underestimated; we can see that here:
dx <- rbindlist(list(dx1, dx2, dx3), idcol = TRUE) ggplot(data = dx, aes(x=x1)) + geom_histogram( aes(y = after_stat(count / sum(count)), fill = factor(.id), color = factor(.id)), binwidth = 1, alpha = .2) + geom_line(data = dens, aes(x = x, y = y, color = factor(.id)), linewidth = 2) + xlab("\nx1") + ylab("density\n") + scale_fill_manual(values = colors) + scale_color_manual(values = colors) + theme(panel.grid = element_blank(), plot.title = element_text(face = "bold", size = 10), legend.position = "none") + facet_grid(~ .id)
I looked around for a way to address this, but couldn’t find anything that obviously addressed this shortcoming (though I am convinced it must be possible, and I just couldn’t locate the solution). I considered using ggarrange
or something similar, but was not satisfied with the results. Instead, it turned out to be faster just to calculate the proportions myself. This is the process I used:
First, I created a dataset with the bins (using a bin size of 1):
cuts <- seq(dx[,floor(min(x1))], dx[,ceiling(max(x1))], by = 1) dcuts <- data.table(bin = 1:length(cuts), binlab = cuts) dcuts ## bin binlab ## <int> <num> ## 1: 1 -3 ## 2: 2 -2 ## 3: 3 -1 ## 4: 4 0 ## 5: 5 1 ## 6: 6 2 ## 7: 7 3 ## 8: 8 4 ## 9: 9 5 ## 10: 10 6 ## 11: 11 7 ## 12: 12 8 ## 13: 13 9 ## 14: 14 10 ## 15: 15 11 ## 16: 16 12 ## 17: 17 13 ## 18: 18 14
Then, I allocated each observation to a bin using the cut
function:
dx[, bin := cut(x1, breaks = cuts, labels = FALSE)] dx <- merge(dx, dcuts, by = "bin") dx ## Key: <bin> ## bin .id id x1 binlab ## <int> <int> <int> <num> <num> ## 1: 1 1 1251 -2.097413 -3 ## 2: 1 1 2215 -2.580587 -3 ## 3: 1 1 2404 -2.042049 -3 ## 4: 1 1 3228 -2.078958 -3 ## 5: 1 1 5039 -2.055471 -3 ## --- ## 29996: 17 3 7690 13.290347 13 ## 29997: 17 3 8360 13.083991 13 ## 29998: 17 3 8860 13.149421 13 ## 29999: 17 3 9214 13.043727 13 ## 30000: 17 3 9743 13.199752 13
Finally, I calculated the distribution-specific proportions (showing only the second distribution):
dp <- dx[, .N, keyby = .(.id, binlab)] dp[, p := N/sum(N), keyby = .id] dp[.id == 2] ## Key: <.id> ## .id binlab N p ## <int> <num> <int> <num> ## 1: 2 -3 3 0.0003 ## 2: 2 -2 38 0.0038 ## 3: 2 -1 130 0.0130 ## 4: 2 0 340 0.0340 ## 5: 2 1 619 0.0619 ## 6: 2 2 938 0.0938 ## 7: 2 3 1161 0.1161 ## 8: 2 4 1155 0.1155 ## 9: 2 5 1035 0.1035 ## 10: 2 6 882 0.0882 ## 11: 2 7 828 0.0828 ## 12: 2 8 861 0.0861 ## 13: 2 9 822 0.0822 ## 14: 2 10 641 0.0641 ## 15: 2 11 384 0.0384 ## 16: 2 12 140 0.0140 ## 17: 2 13 23 0.0023
And now the facet plot will work just fine. Here is the code and the plot (again).
ggplot(data = dp, aes(x = binlab, y = p)) + geom_bar(aes(fill = factor(.id), color = factor(.id)), stat = "identity", alpha = .4) + geom_line(data = dens, aes(x = x, y = y, color = factor(.id)), linewidth = 2) + xlab("\nx1") + ylab("density\n") + scale_fill_manual(values = colors) + scale_color_manual(values = colors) + theme(panel.grid = element_blank(), plot.title = element_text(face = "bold", size = 10), legend.position = "none") + facet_grid(~ .id)