If you are into running, chances are that you will be chasing your PB (personal best) times. This post is about using R to search for your PBs, and to monitor them over time.
Usually runners target four distances for PBs: 5 km, 10 km, half marathon and full marathon. It’s likely that a PB will come in a race of exactly that distance, but not necessarily. For example, you can hit a 5K PB during a 10K race. Therefore we need a way to scan our runs for the fastest segment at each distance to track our PBs correctly. Similarly, your best ever half marathon event time may be slightly longer than the half marathon distance, so what was your “real” PB for that distance?
Strava and other fitness tracking software will find your best time for a given distance within your activities, but here we’ll use R. We’ll look at the data first. Scroll down or click here to see the code.
Here we are looking at all running data over several years. Each run is shown as a dot, and the blue line represents my PB. We can look at 5K, 10K, half and full marathon. For all distances, my PB has come down over time (obviously) but satisfyingly, I set PBs in all four distances this year.
Notice that the density of dots is highest for the 5K plot. It shows all runs of 5K or more. The same for the other distances. We search all runs to find the fastest segment at the set distance shown on the plot.
In the plots above, we cannot see what the total distance of each run is. Did the PB times come from the set distance or a longer run? To show this we can plot the fastest segment versus the total distance of the run.
These plots show that my 10K PB currently comes from a half marathon run. Also, I managed a few sub-20 min 5K times during runs of that distance too. Apart from 10K, my PBs came from running the set distance.
The script analyses the four main distances, but obviously it could search for any distance at all. Examples that others like to track include 1K, 30K and integer mile distances.
As mentioned above, Strava calculates these in what it calls Best Efforts. It allows you to enter your real PBs in a separate section.
How does this compare? The code finds these PBs:
These are all faster than the Strava best efforts. All GPS data is noisy and different platforms and software make different assumptions, which makes comparisons quite difficult. Note that shortest set distance times are more error prone due to GPS blips. My 400 m PB is definitely not 45 s. The current world record is 43 s.
My actual PBs, measured by chip timing are longer than both.
Yes, I really need to work on the that marathon time…
According to this analysis my calculated PBs are a little bit better than my actual PBs.
I think it’s best to be consistent and either go off one set of numbers or another, because ultimately, the only point of tracking PBs is to try to improve on them. For this, a single goal to shoot for is necessary.
I’m starting this script with a large GPX file containing all of my runs. Other starting points might be a bunch of GPX files, one for each run, but we need to assemble a data frame of the timepoints and distances form all runs to do this calculation.
We use trackeR
for loading in the gpx files and a few other libraries to help with wrangling dates/times and for plotting.
The code needs to search for the fastest segment of a set distance. My solution is far from optimal but it uses a few steps to reduce computation time. First, only runs of the set distance or grater are considered by first subsetting the large dataframe. Second, for each time point we can calculate the distance to every other point as a vector and then find the correct set distance, we take its time and compare to the minimum time found so far, and continue the search. This is considerably faster than comparing each point to every other point in nested for-loops. Third, if the max distance of the vector is less than the set distance, we have run out of points and we can stop the search. Further improvements would come from parallelising the code.
library(trackeR) library(ggplot2) library(zoo) library(hms) library(lubridate) ## Functions ---- find_fastest_segment <- function(data, dist) { data$best_segment <- NA # for each activity id in activityDF, we will subset runDF and calculate the # speed for the fastest segment for (i in 1:nrow(data)) { activity_id <- data$activity_id[i] # check that max_distance is greater than the segment distance, if not, break if (data$max_distance[i] < dist * 0.99) { data$best_segment[i] <- NA next } subsetDF <- runDF[runDF$activity_id == activity_id, ] # add progress bar for every 10th activity if (i %% 10 == 0) { cat("Processing activity", i, "of", nrow(data), "\n") } # calculate the best segment best_segment <- 0 for (j in 1:(nrow(subsetDF) - 1)) { these_dists <- subsetDF$distance_id - subsetDF$distance_id[j] # check if the maximum distance is less than the segment distance if (max(these_dists, na.rm = TRUE) < dist * 0.99) { break } these_times <- subsetDF$time_id - subsetDF$time_id[j] target_dists <- these_dists[these_dists > (dist * 0.99) & these_dists < (dist * 1.01)] # check we have some target distances if (length(target_dists) == 0) { # should not encounter this but we will go again if so (rather than break) next } target_times <- these_times[these_dists > (dist * 0.99) & these_dists < (dist * 1.01)] target_speeds <- target_dists / target_times segment_speed <- max(target_speeds, na.rm = TRUE) if (segment_speed > best_segment) { best_segment <- segment_speed data$best_segment[i] <- best_segment } } } # for each of the best_segment column, convert the speed to km/h # and convert to a pace in min:ss per the distance # the column best_segment is in m/s convert to km/h data$best_kmh <- data$best_segment * 3.6 # convert the column best_fullkmh to a pace in min:ss per 42.195 km data$pace <- (3600 / data$best_kmh) * (dist / 1000) # convert the pace to a hms object data$pace <- hms::as_hms(data$pace) # remove rows with > 16 kmh because there are gps blips in the data data <- data[data$best_kmh < 16, ] # remove rows with NA in the best_segment column data <- data[!is.na(data$best_segment), ] return(data) } calculate_pb_pace <- function(data) { # make a new column the best pace at that point in time, i.e. if the pace is # faster than previous that becomes the new best until it is surpassed data$best_pace <- NA for (i in 1:nrow(data)) { if (i == 1) { data$best_pace[i] <- data$pace[i] } else { if (data$pace[i] < data$best_pace[i - 1]) { data$best_pace[i] <- data$pace[i] } else { data$best_pace[i] <- data$best_pace[i - 1] } } } return(data) } ## Script ---- # find the gpx file in Data/ if there is more than one, select a gpx file # in the Data/ folder files <- list.files(path = "Data", pattern = "*.gpx", full.names = TRUE) if (length(files) > 1) { cat("Please select a GPX file:\n") for (i in seq_along(files)) { cat(i, ": ", basename(files[i]), "\n", sep = "") } choice <- as.integer(readline(prompt = "Enter the number of the file: ")) filepath <- files[choice] } else { filepath <- files[1] } runDF <- readGPX(file = filepath, timezone = "GMT") # save runDF as an r data object save(runDF, file = "Output/Data/runDF.RData") # .. or load the runDF object load("Output/Data/runDF.RData") # calculate point-to-point distance from the cumulative distance runDF$dist_point <- c(0, diff(runDF$distance, lag=1)) # time calculations runDF$time_temp <- strptime(runDF$time, format = "%Y-%m-%d %H:%M:%S") runDF$time_temp <- as.numeric(runDF$time_temp) runDF$time_point <- c(0, diff(runDF$time_temp, lag=1)) # if time_point is greater than 380 s it means that it is a different activity # so we set it to 0 runDF$time_point[runDF$time_point > 380] <- 0 # if time_point is 0 we set the dist_point to 0 runDF$dist_point[runDF$time_point == 0] <- 0 # recalculate the cumulative distance runDF$distance <- cumsum(runDF$dist_point) # label the activities with an id number that we can use to refer to them runDF$activity_id <- cumsum(runDF$time_point == 0) # for each activity calculate the cumulative distance and time runDF$distance_id <- ave(runDF$dist_point, runDF$activity_id, FUN = function(x) cumsum(x)) runDF$time_id <- ave(runDF$time_point, runDF$activity_id, FUN = function(x) cumsum(x)) # make a dataframe with the activity id and maximum cumulative distance runDF$max_distance <- ave(runDF$distance_id, runDF$activity_id, FUN = max) runDF$the_time <- ave(runDF$time, runDF$activity_id, FUN = min) activityDF <- data.frame(activity_id = unique(runDF$activity_id), max_distance = unique(runDF$max_distance), time = unique(runDF$the_time)) runDF$max_distance <- runDF$the_time <- NULL # make separate dataframes for each distance activityDF_5k <- activityDF[activityDF$max_distance > 5000 * 0.99, ] activityDF_10k <- activityDF[activityDF$max_distance > 10000 * 0.99, ] activityDF_half <- activityDF[activityDF$max_distance > 21097.5 * 0.99, ] activityDF_full <- activityDF[activityDF$max_distance > 42195 * 0.99, ] # do the calculations for each distance activityDF_5k <- find_fastest_segment(activityDF_5k, 5000) activityDF_10k <- find_fastest_segment(activityDF_10k, 10000) activityDF_half <- find_fastest_segment(activityDF_half, 21097.5) activityDF_full <- find_fastest_segment(activityDF_full, 42195) # remove activity 410 activityDF_5k <- activityDF_5k[activityDF_5k$activity_id != 410, ] # find pb over time activityDF_5k <- calculate_pb_pace(activityDF_5k) activityDF_10k <- calculate_pb_pace(activityDF_10k) activityDF_half <- calculate_pb_pace(activityDF_half) activityDF_full <- calculate_pb_pace(activityDF_full) # save each of these dfs as an r data object save(activityDF_5k, file = "Output/Data/activityDF_5k.RData") save(activityDF_10k, file = "Output/Data/activityDF_10k.RData") save(activityDF_half, file = "Output/Data/activityDF_half.RData") save(activityDF_full, file = "Output/Data/activityDF_full.RData") ## Plots ---- ggplot(activityDF_5k, aes(x = time, y = pace)) + geom_point(aes(color = "Pace")) + geom_line(aes(y = best_pace, color = "Best Pace")) + labs(x = "Time", y = "5K Time") + # scale y from 0 to 30 min scale_y_continuous(limits = c(0, 1800), breaks = seq(0, 1800, by = 300), labels = function(x) { paste0(floor(x / 60), ":", sprintf("%02d", x %% 60)) }) + scale_color_manual(values = c("blue", "#ff00003f")) + theme_bw() + theme(legend.position = "none") ggsave("Output/Plots/activityDF_5k_plot.png", width = 8, height = 6, bg = "white", dpi = 300) ggplot(activityDF_10k, aes(x = time, y = pace)) + geom_point(aes(color = "Pace")) + geom_line(aes(y = best_pace, color = "Best Pace")) + labs(x = "Time", y = "10K Time") + # scale y from 0 to 70 min scale_y_continuous(limits = c(0, 4200), breaks = seq(0, 4200, by = 300), labels = function(x) { paste0(floor(x / 60), ":", sprintf("%02d", x %% 60)) }) + scale_color_manual(values = c("blue", "#ff00003f")) + theme_bw() + theme(legend.position = "none") ggsave("Output/Plots/activityDF_10k_plot.png", width = 8, height = 6, bg = "white", dpi = 300) ggplot(activityDF_half, aes(x = time, y = pace)) + geom_point(aes(color = "Pace")) + geom_line(aes(y = best_pace, color = "Best Pace")) + labs(x = "Time", y = "HM Time") + # scale y from 0 to 150 min scale_y_continuous(limits = c(0, 9000), breaks = seq(0, 9000, by = 600), labels = function(x) { paste0(floor(x / 3600), ":", sprintf("%02d", floor(x / 60) %% 60), ":", sprintf("%02d", x %% 60)) }) + scale_color_manual(values = c("blue", "#ff00003f")) + theme_bw() + theme(legend.position = "none") ggsave("Output/Plots/activityDF_half_plot.png", width = 8, height = 6, bg = "white", dpi = 300) ggplot(activityDF_full, aes(x = time, y = pace)) + geom_point(aes(color = "Pace")) + geom_line(aes(y = best_pace, color = "Best Pace")) + labs(x = "Time", y = "Marathon Time") + # scale y from 0 to 300 min scale_y_continuous(limits = c(0, 18000), breaks = seq(0, 18000, by = 1200), labels = function(x) { paste0(floor(x / 3600), ":", sprintf("%02d", floor(x / 60) %% 60), ":", sprintf("%02d", x %% 60)) }) + scale_color_manual(values = c("blue", "#ff00003f")) + theme_bw() + theme(legend.position = "none") ggsave("Output/Plots/activityDF_full_plot.png", width = 8, height = 6, bg = "white", dpi = 300) # graph of best 5k pace as a function of max_distance ggplot(activityDF_5k, aes(x = max_distance, y = pace)) + geom_point(color = "#ff00003f") + labs(x = "Max Distance", y = "Best 5K Pace") + # scale y from 0 to 30 min scale_y_continuous(limits = c(0, 1800), breaks = seq(0, 1800, by = 300), labels = function(x) { paste0(floor(x / 60), ":", sprintf("%02d", x %% 60)) }) + lims(x = c(0, 70000)) + scale_color_manual(values = c("blue", "#ff00003f")) + theme_bw() + theme(legend.position = "none") ggsave("Output/Plots/activityDF_5k_max_distance_plot.png", width = 8, height = 6, bg = "white", dpi = 300) # graph of best 10k pace as a function of max_distance ggplot(activityDF_10k, aes(x = max_distance, y = pace)) + geom_point(color = "#ff00003f") + labs(x = "Max Distance", y = "Best 10K Pace") + # scale y from 0 to 70 min scale_y_continuous(limits = c(0, 4200), breaks = seq(0, 4200, by = 300), labels = function(x) { paste0(floor(x / 60), ":", sprintf("%02d", x %% 60)) }) + lims(x = c(0, 70000)) + theme_bw() + theme(legend.position = "none") ggsave("Output/Plots/activityDF_10k_max_distance_plot.png", width = 8, height = 6, bg = "white", dpi = 300) # graph of best half pace as a function of max_distance ggplot(activityDF_half, aes(x = max_distance, y = pace)) + geom_point(color = "#ff00003f") + labs(x = "Max Distance", y = "Best Half Pace") + # scale y from 0 to 150 min scale_y_continuous(limits = c(0, 9000), breaks = seq(0, 9000, by = 600), labels = function(x) { paste0(floor(x / 3600), ":", sprintf("%02d", floor(x / 60) %% 60), ":", sprintf("%02d", x %% 60)) }) + lims(x = c(0, 70000)) + theme_bw() + theme(legend.position = "none") ggsave("Output/Plots/activityDF_half_max_distance_plot.png", width = 8, height = 6, bg = "white", dpi = 300) # what is the last value of best pace? # it's in seconds - covert to minutes and seconds last_5k_pace <- activityDF_5k$best_pace[nrow(activityDF_5k)] last_10k_pace <- activityDF_10k$best_pace[nrow(activityDF_10k)] last_half_pace <- activityDF_half$best_pace[nrow(activityDF_half)] last_full_pace <- activityDF_full$best_pace[nrow(activityDF_full)] cat("Last 5K pace:", floor(last_5k_pace / 60), "min", last_5k_pace %% 60, "sec\n") cat("Last 10K pace:", floor(last_10k_pace / 60), "min", last_10k_pace %% 60, "sec\n") cat("Last Half pace:", floor(last_half_pace / 60), "min", last_half_pace %% 60, "sec\n") cat("Last Full pace:", floor(last_full_pace / 60), "min", last_full_pace %% 60, "sec\n")
—
The post title comes from “A Pace Far Different” by Gladie from their Safe Sins album.