Front-door adjustment: a superhero method for handling unobserved confounding by using mediators (if present) to estimate causal effects accurately.
library(dagitty) dag <- dagitty('dag { bb="0,0,1,1" U [pos="0.402,0.320"] X [exposure,pos="0.176,0.539"] Y [outcome,pos="0.614,0.539"] Z [pos="0.409,0.539"] U -> X U -> Y X -> Z Z -> Y } ') plot(dag)
You can easily draw a DAG on
dagitty and then copy and paste the code and slot it in the dag { code }
and voila!
X
is our treatment variable.
Y
is our outcome variable.
U
is confounder that is not measured.
Z
is our mediator.
library(tidyverse) library(DT) # pseudorandomization set.seed(1) # Set sample n <- 1000 # Set noise of each nodes ux <- rnorm(n) uy <- rnorm(n) uz <- rnorm(n) uu <- rnorm(n) # Set each nodes' equation u <- uu x <- 0.4*u + ux z <- 0.6*x + uz y <- -0.5*z + uy + 0.6*u # Create a table df <- tibble(x=x,y=y,z=z,u=u) # For easier viewing on blog datatable(df)
The reason to assign The Right ModelIn order for us to know what a wrong model is, we have to know what the right model is first. Since we know As you can see the screenshot above, if we have the luxury to adjust for correct_model <- lm(y~x+u,df) summary(correct_model) ## ## Call: ## lm(formula = y ~ x + u, data = df) ## ## Residuals: ## Min 1Q Median 3Q Max ## -3.4698 -0.7571 -0.0529 0.7869 3.8100 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -0.02431 0.03641 -0.668 0.505 ## x -0.31841 0.03520 -9.045 <2e-16 *** ## u 0.61805 0.03810 16.220 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.151 on 997 degrees of freedom ## Multiple R-squared: 0.2142, Adjusted R-squared: 0.2126 ## F-statistic: 135.8 on 2 and 997 DF, p-value: < 2.2e-16 Looking at the coefficient for The Wrong ModelNow, let’s look at the wrong model. Let’s say naively, we want to fit just model_no_front <- lm(y~x,df) summary(model_no_front) ## ## Call: ## lm(formula = y ~ x, data = df) ## ## Residuals: ## Min 1Q Median 3Q Max ## -4.1379 -0.8724 -0.0593 0.9068 4.0849 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -0.01287 0.04091 -0.315 0.75319 ## x -0.09505 0.03641 -2.611 0.00917 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.294 on 998 degrees of freedom ## Multiple R-squared: 0.006784, Adjusted R-squared: 0.005789 ## F-statistic: 6.817 on 1 and 998 DF, p-value: 0.009167 Did you notice that the Let’s look at another wrong model, let’s control for Z and see what that shows? model <- lm(y~z+x,df) summary(model) ## ## Call: ## lm(formula = y ~ z + x, data = df) ## ## Residuals: ## Min 1Q Median 3Q Max ## -4.0450 -0.8006 0.0104 0.8316 3.9043 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -0.005363 0.037782 -0.142 0.887 ## z -0.483048 0.036701 -13.162 < 2e-16 *** ## x 0.216663 0.041125 5.268 1.69e-07 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.195 on 997 degrees of freedom ## Multiple R-squared: 0.1538, Adjusted R-squared: 0.1521 ## F-statistic: 90.61 on 2 and 997 DF, p-value: < 2.2e-16 Wait, what? Now Here is a screenshot of wrongly adjusting for What do we do?In Comes The Front-door Adjustment! If we’re lucky…Let’s look at the equation to calculate for front-door adjustment. It is divided into 2 steps 1. Estimate Z and do(X)
Did you notice that there is no need to adjust anything at all because there is a collider on # 1. Estimate Z and do(X) model2 <- lm(z~x,df) summary(model2) ## ## Call: ## lm(formula = z ~ x, data = df) ## ## Residuals: ## Min 1Q Median 3Q Max ## -3.5756 -0.6448 -0.0196 0.7013 2.7987 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.01553 0.03258 0.477 0.634 ## x 0.64531 0.02900 22.254 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.03 on 998 degrees of freedom ## Multiple R-squared: 0.3317, Adjusted R-squared: 0.331 ## F-statistic: 495.2 on 1 and 998 DF, p-value: < 2.2e-16
2. Estimate Y and do(Z)
Did you notice that we need to adjust for one node that we actually have data on to achieve d-separation? Can’t see it? Here you go: Let’s check out this model model <- lm(y~z+x,df) summary(model) ## ## Call: ## lm(formula = y ~ z + x, data = df) ## ## Residuals: ## Min 1Q Median 3Q Max ## -4.0450 -0.8006 0.0104 0.8316 3.9043 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -0.005363 0.037782 -0.142 0.887 ## z -0.483048 0.036701 -13.162 < 2e-16 *** ## x 0.216663 0.041125 5.268 1.69e-07 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.195 on 997 degrees of freedom ## Multiple R-squared: 0.1538, Adjusted R-squared: 0.1521 ## F-statistic: 90.61 on 2 and 997 DF, p-value: < 2.2e-16
3. Put it together - x coefficient * z coefficientThe true
Let’s see if this is true with our codes and data #correct estimated coefficient model$coefficients[["z"]]*model2$coefficients[["x"]] ## [1] -0.3117137 Our estimate Yes, it’s not going to be exactly what the true estimate is but it’s close enough! You can mess around the Lessons learnt
If you like this article:
To leave a comment for the author, please follow the link and comment on their blog: r on Everyday Is A School Day.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job. Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. |
---|