[This article was first published on
pacha.dev/blog, and kindly contributed to
R-bloggers]. (You can report issue about the content on this page
here)
Want to share your content on R-bloggers?
click here if you have a blog, or
here if you don't.
I’ve been busy with the field exams, so I haven’t had much time to work on the blog.
spuriouscorrelations package started as a fun project for one of my tutorials.
Here is a case of an interesting correlation: the number of people who drowned by falling into a pool and the number of films Nicholas Cage appeared in.
library(spuriouscorrelations)
library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
library(ggplot2)
unique(spurious_correlations$var1)
[1] Suicides by hanging, strangulation and suffocation
[2] Number of people who drowned by falling into a pool
[3] Number of people who died by becoming tangled in their bedsheets
[4] Murders by steam, hot vapours and hot objects
[5] Computer science doctorates awarded in the US
[6] Sociology doctorates awarded in the US
[7] Civil engineering doctorates awarded in the US
[8] People who drowned after falling out of a fishing boat
[9] Drivers killed in collision with railway train
[10] Total US crude oil imports
[11] Number of people who drowned while in a swimming-pool
[12] Suicides by crashing of motor vehicle
[13] Number of people killed by venomous spiders
[14] Mathematics doctorates awarded
14 Levels: Civil engineering doctorates awarded in the US ...
drownings <- spurious_correlations %>%
filter(
var1 == "Number of people who drowned by falling into a pool"
) %>%
select(year, var1, var2, var1_value, var2_value)
cor(drownings$var1_value, drownings$var2_value)
Now let’s plot the data.
# compute a scale factor so that max(var2_value * factor) ≈ max(var1_value)
max1 <- max(drownings$var1_value)
max2 <- max(drownings$var2_value)
ratio <- max1 / max2
ggplot(drownings, aes(x = year)) +
geom_line(aes(y = var1_value, color = "Drownings")) +
geom_line(aes(y = var2_value * ratio, color = "Films")) +
scale_y_continuous(
name = "Number of drownings",
sec.axis = sec_axis(~ . / ratio,
name = "Number of films"
),
limits = c(0, NA)
) +
scale_color_manual(
name = "",
values = c(
"Drownings" = "blue",
"Films" = "red"
)
) +
theme_minimal() +
labs(
title = "Number of people who drowned by falling into a pool vs.\nNumber of films Nicholas Cage appeared in",
caption = "Source: Spurious Correlations (Vigen 2015)"
)
Interested? You can install the package from GitHub
pak::pkg_install("pachadotdev/spuriouscorrelations")
Continue reading:
spuriouscorrelations: An R package to show examples about spurious correlations