install.packages("schematic")
I’m thrilled to announce the release of schematic
, an R package that helps you (the developer) communicate data validation problems to non-technical users. With schematic, you can leverage tidyselect
selectors and other conveniences to compare incoming data against a schema, avoiding punishing issues caused by invalid or poor quality data.
schematic can now be installed via CRAN:
install.packages("schematic")
Learn more about schematic by checking out the docs.
Having built and deployed a number of shiny apps or APIs that require users to upload data, I noticed a common pain point: how do I communicate in simple terms any issues with the data and, more importantly, what those issues are? I needed a way to present the user with error messages that satisfy two needs:
is.logical
means).There already exists a number of data validation packages for R, including (but not limited to) pointblank, data.validator, and validate; so why introduce a new player? schematic certainly shares similarities with many of these packages, but where I think it innovates over existing solutions is in its unique combination of the following:
All R errors that appear in this post are intentional for the purpose of demonstrating schematic’s error messaging.
Schematic is extremely simple. You only need to do two things: create a schema and then check a data.frame against the schema.
A schema is a set of rules for columns in a data.frame. A rule consists of two parts:
Let’s imagine a scenario where we have survey data and we want to ensure it matches our expectations. Here’s some sample survey data:
survey_data <- data.frame( id = c(1:3, NA, 5), name = c("Emmett", "Billy", "Sally", "Woolley", "Duchess"), age = c(19.2, 10, 22.5, 19, 19), sex = c("M", "M", "F", "M", NA), q_1 = c(TRUE, FALSE, FALSE, FALSE, TRUE), q_2 = c(FALSE, FALSE, TRUE, TRUE, TRUE), q_3 = c(TRUE, TRUE, TRUE, TRUE, FALSE) )
We declare a schema using schema()
and provide it with rules following the format selector ~ predicate
:
library(schematic) my_schema <- schema( id ~ is_incrementing, id ~ is_all_distinct, c(name, sex) ~ is.character, c(id, age) ~ is_whole_number, education ~ is.factor, sex ~ function(x) all(x %in% c("M", "F")), starts_with("q_") ~ is.logical, final_score ~ is.numeric )
Then we use check_schema
to evaluate our data against the schema. Any and all errors will be captured in the error message:
check_schema( data = survey_data, schema = my_schema )
Error in `check_schema()`: ! Schema Error: - Columns `education` and `final_score` missing from data - Column `id` failed check `is_incrementing` - Column `age` failed check `is_whole_number` - Column `sex` failed check `function(x) all(x %in% c("M", "F"))`
The error message will combine columns into a single statement if they share the same validation issue. schematic will also automatically report if any columns declared in the schema are missing from the data.
By default the error message is helpful for developers, but if you need to communicate the schema mismatch to a non-technical person they’ll have trouble understanding some or all of the errors. You can customize the output of each rule by inputting the rule as a named argument.
Let’s fix up the previous example to make the messages more understandable.
my_helpful_schema <- schema( "values are increasing" = id ~ is_incrementing, "values are all distinct" = id ~ is_all_distinct, "is a string" = c(name, sex) ~ is.character, "is a string with specific levels" = education ~ is.factor, "is a whole number (no decimals)" = c(id, age) ~ is_whole_number, "has only entries 'F' or 'M'" = sex ~ function(x) all(x %in% c("M", "F")), "includes only TRUE or FALSE" = starts_with("q_") ~ is.logical, "is a number" = final_score ~ is.numeric ) check_schema( data = survey_data, schema = my_helpful_schema )
Error in `check_schema()`: ! Schema Error: - Columns `education` and `final_score` missing from data - Column `id` failed check `values are increasing` - Column `age` failed check `is a whole number (no decimals)` - Column `sex` failed check `has only entries 'F' or 'M'`
And that’s really all there is to it. schematic does come with a few handy predicate functions like is_whole_number()
which is a more permissive version of is.integer()
that allows for columns stored as numeric or double but still requires non-decimal values.
Moreover, schematic includes a handful of modifiers that allow you to change the behavior of some predicates, for instance, allowing NAs with mod_nullable()
:
# Before using `mod_nullable()` this rule triggered an error my_schema <- schema( "all values are increasing (except empty values)" = id ~ mod_nullable(is_incrementing) ) check_schema( data = survey_data, schema = my_schema )
In the end, my hope is to make schematic as simple as possible and help both developers and users. It’s a package I designed initially with the sole intention of saving myself from writing validation code that takes up 80% of the actual codebase.1 I hope you find it useful too.
This post was created using R version 4.5.0 (2025-04-11) and schematic version 0.1.0.
Not an exaggeration. I have a Plumber API that allows users to POST data to be processed. 80% of that plumber code is to validate the incoming data.︎