Last week I posted about the proportion of p values that will be ‘fragile’ under various types of scientific experiments or other inferences. The proportion of p values that is fragile had been defined as the proportion that are between 0.01 and 005. I was reacting to the question of whether it’s a good thing that the proportion fragile in the psychology literature seems to have declined over a decade or two from about 32% to about 26%. I had concluded that this didn’t really tell us much about progress in response to the replication crisis.
There was some good discussion on BlueSky when I posted on this and I would now qualify my views a little. The key points that people made that I hadn’t adequately taken into account in my presentation were:
This made me decide I should look more systematically into the difference between “planned for” difference between two samples and the “actual” difference in the population. I had done some simulations of random variation of the actual difference from that planned for, but not systematic and comprehensive enough.
What I’ve done now is systematically compare every combination of a sampling strategy to give 80% power for “minimum detectable differences” from 0.05 to 0.60 (12 different values, spaced 0.05 apart), and actual differences of mean between 0.0 and 1.0 (11 different values, spaced 0.10 apart). With 10,000 simulations at each combination we have 12 x 11 x 10,000 = 1.32 million simulations
We can see that many reasonable combinations of “planned for” and “actual” differences between the two populations give fragile proportions of p values that are quite different from 26%. In particular, in the situation where the actual difference is more than the “minimum detectable difference” that the sample size would have been calculated for 80% power (which is probably what most researchers are aiming for), the proportion fragile quickly gets well below 26%.
That proportion ranges from 80% when the real difference is zero (i.e. the null hypothesis is true) and the p value distribution is uniform over [0,1]; to well below 10% when the sample size is ample for high (much greater than 80%) power, sufficient to pick up the real difference between the two populations.
I think this is better at presenting the issues than my blog last week. Here’s how I think about this issue now:
Here’s the R code to run those 1.32 million simulations, making using of the foreach
and doParallel
R packages for parallel computing to speed things up a bit:
library(pwrss)
library(tidyverse)
library(glue)
library(foreach)
library(doParallel)
#' Function to run a two sample experiment with t test on difference of means
#' @returns a single p value
#' @param d difference in means of the two populations that samples are drawn
#' from. If sd1 and sd2 are both 1, then d is a proportion of that sd and
#' everything is scaleless.
experiment <- function(d, m1 = 0, sd1 = 1, sd2 = 1, n1 = 50, n2 = n1, seed = NULL){
if(!is.null(seed)){
set.seed(seed)
}
x1 <- rnorm(n1, m1, sd1)
x2 <- rnorm(n2, m1 + d, sd2)
t.test(x1, x2)$p.value
}
#--------------varying difference and sample sizes---------------
pd1 <- tibble(planned_diff = seq(from = 0.05, to = 0.60, by = 0.05)) |>
# what sample size do we need to have 80% power, based on that planned
# "minimum difference to detect"?:
mutate(n_for_power = sapply(planned_diff, function(d){
as.numeric(pwrss.t.2means(mu1 = d, power = 0.8, verbose = FALSE)$n[1])
}))
# the actual differences, which can be from zero to much bigger than the minimum
# we planned power to detect:
pd2 <- tibble(actual_diff = seq(from = 0, to = 1.0, by = 0.1))
# Number of simulations we do for each combination of planned power (based on a
# given 'minimum difference to detect') and actual power (based on the real
# difference). when the real difference is zero in particular, there will only
# be 1/20 of reps that come up with a 'significant' difference, so 10000 reps in
# total gives us a sample of 500 signficant tests to get the proportion that are
# fragile from, so still not huge. If I'd bothered I could have changed the
# number of reps to do for each combination based on some number that's really
# needed, but I didn't bother.:
reps_each <- 10000
# combine the planned-for and actual differences in a data frame with a row
# for each repeated sim we are going to do:
data <- expand_grid(pd1, pd2) |>
mutate(link = 1) |>
full_join(tibble(link = 1,
rep = 1:reps_each),
relationship = "many-to-many", by = "link") |>
select(-link) |>
mutate(p = NA)
print(glue("Running {nrow(data)} simulations. This will take a while."))
# set up parallel processing cluster
cluster <- makeCluster(7)
registerDoParallel(cluster)
clusterEvalQ(cluster, {
library(foreach)
})
clusterExport(cluster, c("data", "experiment"))
results <- foreach(i = 1:nrow(data), .combine = rbind) %dopar% {
set.seed(i)
# the row of data for just this simulation:
d <- data[i, ]
# perform the simulation and capture the p value:
d$p <- experiment(d = d$actual_diff,
n1 = d$n_for_power,
seed = d$rep)
# return the result as a row of data, which will be rbinded into a single data
# frame of all the parallel processes:
return(d)
}
stopCluster(cluster)
#--------------summarise and present results--------------------------
# Summarise and calculate the proportions:
res <- results |>
group_by(planned_diff, n_for_power, actual_diff) |>
summarise(number_sig = sum(p < 0.05),
prop_fragile = sum(p > 0.01 & p < 0.05) / number_sig)
Here’s how I draw two plots to summarise that. I have the line plot shown above, and a heatmap that is shown below the code. Overall I think the line plot is easier and clearer to read.
# Line chart plot:
res |>
mutate(pd_lab = glue("80% power planned for diff of {planned_diff}")) |>
ggplot(aes(x = actual_diff, y = prop_fragile)) +
facet_wrap(~pd_lab) +
geom_vline(aes(xintercept = planned_diff), colour = "steelblue") +
geom_hline(yintercept = 0.26, colour = "orange") +
geom_point() +
geom_line() +
scale_y_continuous(label = percent) +
labs(y = "Proportion of significant <i>p</i> values that are between 0.01 and 0.05",
x = "Actual difference (in standard deviations)",
subtitle = "Vertical blue line shows where the actual difference equals the minimum difference to detect that the 80% power calculation was based upon.
Horizontal orange line shows the observed average proportion of 'fragile' p values in the recent psychology literature.",
title = "Fragility of p values in relation to actual and planned differences in a two sample t test.")
# Heatmap:
res |>
ggplot(aes(x = actual_diff, y = as.ordered(planned_diff), fill = prop_fragile)) +
geom_tile() +
geom_tile(data = filter(res, prop_fragile > 0.25 & prop_fragile < 0.31),
fill = "white", alpha = 0.1, colour = "white", linewidth = 2) +
scale_fill_viridis_c(label = percent, direction = -1) +
theme(panel.grid.minor = element_blank()) +
labs(y= "Smallest detectable difference for 80% power",
x = "Actual difference (in standard deviations)",
fill = "Proportion of significant p values that are between 0.01 and 0.05:",
subtitle = "Sample size is based on 80% power for the difference on the vertical axis. White boxes indicate where the proportion of fragile significant p values is between 25% and 31%.",
title = "Fragility of p values in relation to actual and planned differences in a two sample t test.")
Here’s the heatmap we get from that. It’s prettier but I think not actually as clear as the simpler, faceted line plot.
That’s all for now. I don’t think there’s any big implications of this. Just a better understanding of what proportion of p values we might expect to be in this fragile range and what impacts on it.
I’m planning on looking at some real life data, on a completely different topic, in my next post.