
Creating Schemas and Validating Data
validating-data.RmdRV schemas are nested list objects that are passed to the
Schema class. The class automatically re-orders and
validates schema before it is supplied to theValidator
class for data validation. During validation, the schema definition
determines the data matching and rule application. As a result,
understanding schemas is central to using RV effectively. Most of this
vignette focuses on constructing schemas, with the final section
demonstrating how to use them for data validation.
Creating a schema list
The lists used for RV schemas are nested list objects with rule-named leaf elements that determine validation behaviour. The names or positions of the nested lists are used to match their rules to the corresponding data elements.
list(
type = "data.frame",
id = list(
type = "numeric"
),
email = list(
type = "character",
regex = "@gmail.com$"
),
list(
min_length = 2
)
)Rules at the top level of the nested list are applied to the whole data object. Nested list elements are matched to data elements by their name if present or by position if there is no name present.
When matching by position, rule elements are first removed, so the
first non-rule element will always be matched against [[1]]
of the matching data node. See the following illustrations:
#> <list>
#> ├─Top level rule: "Applied to `data`."
#> └─<list>
#> ├─Depth 1 rule: "Applied to `data[[1]]`."
#> └─<list>
#> └─Depth 2 rule: "Applied to `data[[1]][[1]]`."
#> <list>
#> ├─Top level rule: "Applied to `data`."
#> └─x: <list>
#> ├─Depth 1 rule: "Applied to `data[['x']]`."
#> └─x: <list>
#> └─Depth 2 rule: "Applied to `data[['x']][['x']]`."
This behaviour continues no matter the level of nesting, so it is possible to apply rules to deeply nested values.
#> <list>
#> └─<list>
#> └─<list>
#> └─<list>
#> └─"Applied to `data[[1]][[1]][[1]]`."
Double hits with positional matching
When matching schema elements to data, the Validator
first attempts to match by name before falling back to positional
matching. As data elements are not flagged when validated (and thus can
be validated multiple times), this can cause unexpected behaviour. See
the following example where the data is matched twice:
Validator(
data = list(x = 1L),
schema = list(
list(type = "integer"), # matched positionally
x = list(type = "character") # matched by name to same element
)
)@errors
#> [[1]]
#> [[1]]$type
#> NULL
#>
#>
#> $x
#> $x$type
#> [1] "Is not type `character`."It is strongly encouraged to use fully named schemas/data unless you are certain about their structure.
Using the Schema class
The Schema class takes the nested list and re-orders it,
transforms certain rules from strings to functions where necessary, then
validates the schema.
Schema re-ordering
Rules are applied in three separate passes according to their category: ‘control’, ‘transform’, and ‘validate’, with a special rule ‘apply_last’ occurring in a final pass.
For each category, the Schema reorders the list upon
ingest according to the corresponding orders of the name properties in
the Registry (by default Schema uses a
default, uncustomised Registry if one is not provided):
r <- Registry()
r@control_rules
#> [1] "required" "default"
r@transform_rules
#> [1] "coerce" "apply"
r@validate_rules
#> [1] "type" "inherits" "allowed" "forbidden"
#> [5] "unique" "positive" "negative" "finite"
#> [9] "allow_na" "min_val" "max_val" "min_length"
#> [13] "max_length" "min_nrow" "max_nrow" "min_nchar"
#> [17] "max_nchar" "nzchar" "regex" "labelled"
#> [21] "levels" "ordered_levels" "dependency" "dependencies"
#> [25] "predicate"
Schema(
list(
min_length = 2L,
type = "integer",
default = 10L,
coerce = "double"
)
)@schema
#> $default
#> [1] 10
#>
#> $coerce
#> [1] "double"
#>
#> $type
#> [1] "integer"
#>
#> $min_length
#> [1] 2The order of the rules within each of the Registry
properties can be edited to specify a different order, which can then be
fed to the Schema:
r <- Registry()
r@validate_rules <- c("min_length", r@validate_rules[!grepl("min_length", r@validate_rules)])
Schema(
schema = list(
min_length = 2L,
type = "integer",
default = 10L,
coerce = "double"
),
registry = r
)@schema
#> $default
#> [1] 10
#>
#> $coerce
#> [1] "double"
#>
#> $min_length
#> [1] 2
#>
#> $type
#> [1] "integer"For more information about the builtin rules and how they operate, or
how to add custom rules to a Registry, see the builtin rules vignette and the adding
rules vignette.
String to function conversion
Certain rules can be given character strings as an input which are
turned into functions during schema validation. The rules that this
apply to can be found in the Registry, along with the
function that does the conversion. Both can be edited.
BEWARE: No check is made on the content of the string, so use the
builtin converter with extreme care for user inputs - it is vulnerable
to code injection. This functionality can be removed by simply making
the @str_to_fn_rules property an empty character.
r <- Registry()
r@str_to_fn_rules
#> [1] "apply" "apply_last" "predicate"
r@str_to_fn_converter
#> function (str)
#> {
#> tryCatch(as.function(eval(str2lang(str))), error = function(cnd) {
#> NULL
#> })
#> }
#> <bytecode: 0x55ba9e3a6530>
#> <environment: namespace:RV>
Schema(
list(predicate = "function(x) x > 10")
)@schema
#> $predicate
#> function (x)
#> x > 10
#> <environment: 0x55ba9f976e88>Schema validation and errors
Schema objects validate their list input and store an
@errors property that highlights validation errors.
The @errors list mirrors the structure of the input
schema list with NULL elements where the schema is valid
and error messages where the schema is invalid:
Schema(
list(
type = "not a type",
list(apply = 1),
list(type = "character"),
list(a = list(min_length = function(x) x + 1))
)
)@errors
#> $type
#> [1] "`not a type` not found in allowed types."
#>
#> [[2]]
#> [[2]]$apply
#> [1] "Must be a function (or valid string)."
#>
#>
#> [[3]]
#> [[3]]$type
#> NULL
#>
#>
#> [[4]]
#> [[4]]$a
#> [[4]]$a$min_length
#> [1] "Must be a single, positive, non-NA integerish value."This can be used by the user in their own error messages, or if the
@error property is set to TRUE, an error will
occur with the non-null elements forming the message (with possible
truncation according to the @error_print_opts), see
below.
Note: the error messages from the Validator instead show
the locations of the data that failed validation, see the ‘Errors’
section of ‘Data validation’ below.
Schema nodes
For each schema node Schema validates that:
- There are no duplicate names.
- All leaf elements are named.
- Leaf elements are named with recognised rules.
Schema(
list(
x = list(type = "character"),
x = list(type = "integer"),
list("character"),
list(my_rule = 1L)
),
error = TRUE
)
#> Error:
#> ! <RV::Schema> object is invalid:
#> - Schema validation failed with the following errors:
#> ├─ x: Names must be unique at the same depth.
#> ├─ x: Names must be unique at the same depth.
#> ├─ [[3]]
#> │ └─ [[1]]: Schema leafs must be named with rules.
#> └─ [[4]]
#> └─ my_rule: Unknown rule: `my_rule`.Rule validation
Each rule has an associated schema validation rule that checks the value given. For example, the ‘predicate’ rule checks that given values are either strings or functions. The ‘dependency’ rule checks that given values are either a character vector (names), a numeric integerish vector, or a non-nested list containing a mix of the two.
Data validation
Data validation in RV is undertaken with the Validator
class, which ingests a Schema and applies the rules within
to the input data.
Validation process
Validator walks through the Schema list
object, matching data elements by name or position and applying the
rule-based behaviour.
Order of evaluation
The validation walk sequences along each schema node and recurses into list elements - following this basic pattern:
recursive_walk <- function(lst) {
for (i in seq_along(lst)) {
if (!is.list(lst[[i]])) {
# do rule...
} else {
# recurse into list node...
lst[[i]] <- recursive_walk(lst[[i]])
}
}
lst
}This has implications if you want to access transformed data, so it is important to consider when designing schemas (see below).
Referencing transformed elements
RV validation can access transformed data elements on the fly, so any
rules that use .data to access other data nodes will be
accessing the data state at that point of the schema walk, rather than
the original state of the data. See the following example:
s <- Schema(
list(
list(
apply = "function(x, .data, ...) if (.data[[2]] == 1) x + 1"
),
list(
apply = "function(x, .data, ...) if (.data[[2]] == 0) x + 1"
),
list(
apply = "function(x, .data, ...) if (.data[[2]] == 1) x + 2"
),
list(
apply = "function(x, .data, ...) if (.data[[3]] == 2) x + 3"
)
)
)
Validator(c(0, 0, 0, 0), s)@data
#> [1] 0 1 2 3The first element remains 0 as .data[[2]]
had not been transformed yet, whilst the third and fourth elements both
change as the .data elements they referenced had been
transformed by the time of their evaluation.
Errors
Validator objects also store an @errors
property that highlights data validation errors. Like with
Schema, this property also mirrors the structure of the
input schema list with NULL elements where the schema is
valid and error messages where the schema is invalid.
Validator(
data = list(a = 1, b = 2),
schema = list(
type = "double",
a = list(type = "character"),
list(type = "array")
)
)@errors
#> $type
#> [1] "Is not type `double`."
#>
#> $a
#> $a$type
#> [1] "Is not type `character`."
#>
#>
#> [[3]]
#> [[3]]$type
#> [1] "Is not type `array`."However, when @error is set to TRUE in the
Validator, instead of the schema paths being shown, instead
they are converted to the matched data positions. See the following
example:
Validator(
data = list(a = 1, b = 2),
schema = list(
type = "double",
a = list(type = "character"),
list(type = "array")
),
error = TRUE
)
#> Error:
#> ! <RV::Validator> object is invalid:
#> - Data validation failed with the following errors:
#> ├─ type: Is not type `double`.
#> ├─ a
#> │ └─ type: Is not type `character`.
#> └─ [[2]]
#> └─ type: Is not type `array`.Hence, the message about ‘array’ shows for element [[2]]
as that was the data element it was matched to, as rule elements in the
node are removed before positional matching.
Validator short-circuits if the input
Schema is invalid:
v <- Validator(1L, list(type = "not a type"))
v@errors
#> $valid_schema
#> [1] FALSE
v@Schema@errors
#> $type
#> [1] "`not a type` not found in allowed types."
Validator(1L, list(type = "not a type"), error = TRUE)
#> Error:
#> ! <RV::Validator> object is invalid:
#> - Schema validation failed with the following errors:
#> └─ type: `not a type` not found in allowed types.Validating data from different sources
Putting it all together, you can flexibly validate data in R from a myriad of different sources.
YAML -> list -> RV.
yaml_schema <- yaml::yaml.load(
"
type: 'list'
a:
type: 'character'
b:
type: 'list'
a:
type: 'numeric'
b:
type: 'character'
min_nchar: 3
"
)
yaml_data <- yaml::yaml.load(
"
a: 1
b:
a: 1
b: 'Hi'
"
)
Validator(yaml_data, yaml_schema, error = TRUE)
#> Error:
#> ! <RV::Validator> object is invalid:
#> - Data validation failed with the following errors:
#> ├─ a
#> │ └─ type: Is not type `character`.
#> └─ b
#> └─ b
#> └─ min_nchar: Char length(s) must be at least 3.JSON -> list -> RV.
json_schema <- jsonlite::fromJSON(
'{
"type": "list",
"a": {
"type": "numeric",
"min_length": 2
},
"b": {
"type": "list",
"a": {
"type": "numeric",
"max_val": 5
},
"b": {
"type": "character"
}
}
}'
)
json_data <- jsonlite::fromJSON(
'{
"a": 1,
"b": {
"a": 10,
"b": "Hi"
}
}'
)
Validator(json_data, json_schema, error = TRUE)
#> Error:
#> ! <RV::Validator> object is invalid:
#> - Data validation failed with the following errors:
#> ├─ a
#> │ └─ min_length: Length must be at least 2.
#> └─ b
#> └─ a
#> └─ max_val: Value(s) must be at most 5.SPSS, SAV, Excel, etc. -> data.frame -> RV.
# rectangular data, from `readr` readme
# works for any data.frame data, e.g., sav, dta, xls, xlsx, csv, tsv, etc.
rect_schema <- list(
type = "data.frame",
chicken = list(type = "character", nzchar = TRUE),
sex = list(coerce = "factor", levels = c("rooster", "hen")),
eggs_laid = list(type = "integer", positive = TRUE),
motto = list(type = "character", nzchar = TRUE)
)
rect_data <- readr::read_csv(
readr::readr_example("chickens.csv"),
show_col_types = FALSE
)
Validator(rect_data, rect_schema, error = TRUE)
#> Error:
#> ! <RV::Validator> object is invalid:
#> - Data validation failed with the following errors:
#> └─ eggs_laid
#> └─ type: Is not type `integer`.RV can validate any list, data.frame, or atomic vector.