The purpose of schematic is to make data validation easy, expressive, and user-focused. A common pain point of data validation is communicating bad data to users. schematic helps by crafting informative error messages that convey all schema problems. As a developer, you can guide non-data users to fix problems with their data.
Let’s start with some sample data of 5 people who answered 3 yes/no questions.
survey_data <- data.frame(
id = c(1:3, NA, 5),
name = c("Emmett", "Billy", "Sally", "Woolley", "Duchess"),
age = c(19.2, 10, 22.5, 19, 19),
sex = c("M", "M", "F", "M", NA),
q_1 = c(TRUE, FALSE, FALSE, FALSE, TRUE),
q_2 = c(FALSE, FALSE, TRUE, TRUE, TRUE),
q_3 = c(TRUE, TRUE, TRUE, TRUE, FALSE)
)
A schema is a set of rules for columns in a data.frame. These rules are usually to do with the type of data and its contents. A rule consists of two parts:
We declare a schema using schema()
. Note that we do not
need to provide any data at this point, so the schema can be easily
reused. Note that schematic has several predicate functions built in to
address common validations.
Each rule follows the format selector ~ predicate
.
Provide the column names using tidyselect
syntax and then
after the tilde can be any callable function.
Once the schema has been created, you can apply it against the data. This applies all the schema checks and then reports any failures as an error message.
check_schema(
data = survey_data,
schema = my_schema
)
#> Error in `check_schema()`:
#> ! Schema Error:
#> - Columns `education` and `final_score` missing from data
#> - Column `id` failed check `is_incrementing`
#> - Column `age` failed check `is_whole_number`
#> - Column `sex` failed check `function(x) all(x %in% c("M", "F"))`
What distinguishes schematic from other data validation packages is its holistic error messaging, informing the user on all failures.
By default the error message is helpful for developers, but if you need to communicate the schema mismatch to a non-technical person they’ll have trouble understanding some or all of the errors. You can customize the output of each rule by inputting the rule as a named argument.
Let’s fix up the previous example to make the messages more understandable.
my_helpful_schema <- schema(
"values are increasing" = id ~ is_incrementing,
"values are all distinct" = id ~ is_all_distinct,
"is a string" = c(name, sex) ~ is.character,
"is a whole number (no decimals)" = c(id, age) ~ is_whole_number,
"has only entries 'F' or 'M'" = sex ~ function(x) all(x %in% c("M", "F")),
"includes only TRUE or FALSE" = starts_with("q_") ~ is.logical,
"is a number" = final_score ~ is.numeric
)
check_schema(
data = survey_data,
schema = my_helpful_schema
)
#> Error in `check_schema()`:
#> ! Schema Error:
#> - Column `final_score` missing from data
#> - Column `id` failed check `values are increasing`
#> - Column `age` failed check `is a whole number (no decimals)`
#> - Column `sex` failed check `has only entries 'F' or 'M'`
Now the message is easier for a non-technical person to understand and could be exposed to users in Shiny app or plumber endpoint.