Handling meta data

Benjamin Becker

2024-10-09

When trying to understand data, most often not only the actual data is required, but also so called meta data. Meta data usually includes:

While the data.frame class in R supports value labels to a certain degree with the factor class, its functionality is limited. Other data formats like .xlsx or .csv support no meta data at all. Commercial software like SPSS provides such functionality but can not compete with the various tools for analyzing data that R provides.

eatGADS is an R package that was developed to bridge this gap. Its main purpose is providing a data format in R specifically designed for storing meta data together with data in one place. Therefore it provides an S3 class called GADSdat. The following vignette concentrates on how to import data into the GADSdat format and work with it in the R environment. In collaboration with the IQB Forschungsdatenzentrum (FDZ) the package can also be used to distribute data.

Note that eatGADS also allows the handling of large hierarchical data structures via relational data bases. This functionality is explained in more detail in an additional vignette.

Setup

The package can be installed from GitHub. Note that older R versions had issues with installations from online repositories like GitHub. R version > 3.6.0 should work without any issues.

devtools::install_github("beckerbenj/eatGADS")
# loading the package
library(eatGADS)

Importing data into the GADSdat format

Importing from SPSS

R offers a variety of tools to import data from all sorts of data formats. SPSS data (.sav files) can be imported directly into the GADSdat format, with haven used as a backend. Note that this is the easiest way to import data into the GADSdat format.

# importing an SPSS file
gads <- import_spss("path/example.sav")

Importing from Excel etc.

All other file types should be imported into R first and then supplied as data.frames to import_raw. Below is a small selection of functions that import data as data.frames. For an extensive overview of importing functions using the package readr see also this book chapter, while the package readxl is explained in more detail on this [homepage] (https://readxl.tidyverse.org/). As these files are plain data files, meta data has to be supplied as separate data sheets.

Note that none of the data.frames can contain variables of the class factor, as this in itself constitutes meta data. If using base R to import data make sure to use the argument stringsAsFactors = FALSE. If necessary, convert factors to character via as.character.

# importing text files
input_txt <- read.table("path/example.txt", stringsAsFactors = FALSE)
# importing German csv files (; separated)
input_csv <- read.csv2("path/example.csv", stringsAsFactors = FALSE)
# importing Excel files
input_xlsx <- readxl::read_excel("path/example.xlsx")

import_raw takes three separate data.frames as input. The actual data set (df), the variable labels (varLabels) and the value labels (valLabels). These three objects have to be supplied in a very specific format.

The varLabels object has to contain two variables: varName, which should exactly correspond to the variable names in df and varLabels which should contain the desired variable labels as strings. Note that this data.frame should contain as many rows as there are variables in df.

The optional valLabels object has to contain four variables: varName, which should exactly correspond to the variable names in df; values, which should correspond to the respective values in df and has to be a numeric vector (labels for character vectors are currently not supported); valLabels, which should contain the value labels as strings; and missings, a column indicating whether the value indicates a missing value. Valid values for missings are "valid" = no missing code and "miss" = missing code. Note that this data.frame can not contain any varNames that are not variables in df. However, not all variables in df have to occur in valLabels.

# Example data set
df <- data.frame(ID = 1:4, sex = c(0, 0, 1, 1), 
                 forename = c("Tim", "Bill", "Ann", "Chris"), stringsAsFactors = FALSE)
# Example variable labels
varLabels <- data.frame(varName = c("ID", "sex", "forename"), 
                        varLabel = c("Person Identifier", "Sex as self reported", 
                                     "first name as reported by teacher"), 
                        stringsAsFactors = FALSE)
# Example value labels
valLabels <- data.frame(varName = rep("sex", 3), 
                        value = c(0, 1, -99), 
                        valLabel = c("male", "female", "missing - omission"), 
                        missings = c("valid", "valid", "miss"), stringsAsFactors = FALSE)

df
#>   ID sex forename
#> 1  1   0      Tim
#> 2  2   0     Bill
#> 3  3   1      Ann
#> 4  4   1    Chris
varLabels
#>    varName                          varLabel
#> 1       ID                 Person Identifier
#> 2      sex              Sex as self reported
#> 3 forename first name as reported by teacher
valLabels
#>   varName value           valLabel missings
#> 1     sex     0               male    valid
#> 2     sex     1             female    valid
#> 3     sex   -99 missing - omission     miss

# import 
gads <- import_raw(df = df, varLabels = varLabels, valLabels = valLabels)

GADSdat class

The resulting object is of the class GADSdat and contains a data sheet and a meta data sheet.

# Inpsect resulting object 
gads 
#> $dat
#>   ID sex forename
#> 1  1   0      Tim
#> 2  2   0     Bill
#> 3  3   1      Ann
#> 4  4   1    Chris
#> 
#> $labels
#>    varName                          varLabel format display_width labeled value           valLabel
#> 1       ID                 Person Identifier   <NA>            NA      no    NA               <NA>
#> 2      sex              Sex as self reported   <NA>            NA     yes   -99 missing - omission
#> 3      sex              Sex as self reported   <NA>            NA     yes     0               male
#> 4      sex              Sex as self reported   <NA>            NA     yes     1             female
#> 5 forename first name as reported by teacher   <NA>            NA      no    NA               <NA>
#>   missings
#> 1     <NA>
#> 2     miss
#> 3    valid
#> 4    valid
#> 5     <NA>
#> 
#> attr(,"class")
#> [1] "GADSdat" "list"

Saving GADSdat objects

GADSdat objects, for example, can be saved as RDS files. This is also the preferred data format for distributing GADSdat objects to the FDZ.

# Inpsect resulting object 
saveRDS(gads, "path/gads.RDS")

Using GADSdat objects in R

eatGADS provides convenient functions for extracting data and meta data from GADSdat objects. extractMeta is used to access the meta data for specific variables (or all variables, if no specific variable name is provided).

# Inpsect resulting object 
extractMeta(gads, vars = c("sex"))
#>   varName             varLabel format display_width labeled value           valLabel missings
#> 2     sex Sex as self reported   <NA>            NA     yes   -99 missing - omission     miss
#> 3     sex Sex as self reported   <NA>            NA     yes     0               male    valid
#> 4     sex Sex as self reported   <NA>            NA     yes     1             female    valid
extractMeta(gads)
#>    varName                          varLabel format display_width labeled value           valLabel
#> 1       ID                 Person Identifier   <NA>            NA      no    NA               <NA>
#> 2      sex              Sex as self reported   <NA>            NA     yes   -99 missing - omission
#> 3      sex              Sex as self reported   <NA>            NA     yes     0               male
#> 4      sex              Sex as self reported   <NA>            NA     yes     1             female
#> 5 forename first name as reported by teacher   <NA>            NA      no    NA               <NA>
#>   missings
#> 1     <NA>
#> 2     miss
#> 3    valid
#> 4    valid
#> 5     <NA>

extractData is used to extract data. With its arguments the structure of the resulting data can be defined. If convertMiss = TRUE, which is the default, is used, values that are listed as missing codes are recoded to NAs. With the convertLabels argument it can be specified how value labels should be used. If set to "character" all labeled values are recoded to character, the same applies to “factor”. If set to "numeric", the value labels are not applied.

# Extract data without applying labels
dat1 <- extractData(gads, convertMiss = TRUE, convertLabels = "numeric")
dat1
#>   ID sex forename
#> 1  1   0      Tim
#> 2  2   0     Bill
#> 3  3   1      Ann
#> 4  4   1    Chris

dat2 <- extractData(gads, convertMiss = TRUE, convertLabels = "character")
dat2
#>   ID    sex forename
#> 1  1   male      Tim
#> 2  2   male     Bill
#> 3  3 female      Ann
#> 4  4 female    Chris

Modifying GADSdat objects

GADSdat objects can also be modified even though only a certain amount of operations are supported. For smaller changes to the data and meta data a number of convenience functions exists. These functions allow modifying variable labels (changeVarLabels), modifying variable names (changeVarNames) and recoding values (recodeGADS).

### wrapper functions
# Modify variable labels
gads2 <- changeVarLabels(gads, varName = c("ID"), varLabel = c("Test taker ID"))
extractMeta(gads2, vars = "ID")
#>   varName      varLabel format display_width labeled value valLabel missings
#> 1      ID Test taker ID   <NA>            NA      no    NA     <NA>     <NA>

# Modify variable name
gads3 <- changeVarNames(gads, oldNames = c("ID"), newNames = c("idstud"))
extractMeta(gads3, vars = "idstud")
#>   varName          varLabel format display_width labeled value valLabel missings
#> 1  idstud Person Identifier   <NA>            NA      no    NA     <NA>     <NA>
extractData(gads3)
#>   idstud    sex forename
#> 1      1   male      Tim
#> 2      2   male     Bill
#> 3      3 female      Ann
#> 4      4 female    Chris

# recode GADS
gads4 <- recodeGADS(gads, varName = "sex", oldValues = c(0, 1, -99), newValues = c(1, 2, 99))
extractMeta(gads4, vars = "sex")
#>   varName             varLabel format display_width labeled value           valLabel missings
#> 2     sex Sex as self reported   <NA>            NA     yes     1               male    valid
#> 3     sex Sex as self reported   <NA>            NA     yes     2             female    valid
#> 4     sex Sex as self reported   <NA>            NA     yes    99 missing - omission     miss
extractData(gads4, convertLabels = "numeric")
#>   ID sex forename
#> 1  1   1      Tim
#> 2  2   1     Bill
#> 3  3   2      Ann
#> 4  4   2    Chris

For simultaneous changes to multiple variables a set of functions is implemented that extract a table for changes and applies the changes as written into this change table. To enable an easier work flow the change table could also be saved as an Excel file, modified via Excel and again imported into R. See the help pages of the respective functions for more details.

# extract changeTable
varChanges <- getChangeMeta(gads, level = "variable")
# modify changeTable
varChanges[varChanges$varName == "ID", "varLabel_new"] <- "Test taker ID"
# Apply changes
gads5 <- applyChangeMeta(varChanges, gads)
extractMeta(gads5, vars = "ID")
#>   varName      varLabel format display_width labeled value valLabel missings
#> 1      ID Test taker ID   <NA>            NA      no    NA     <NA>     <NA>

Writing SPSS files

Objects of the class GADSdat can also be exported into the SPSS format, utilizing haven. Note that this function is slightly experimental and problems with specific character strings might occur.

write_spss(gads, "path/example_out.sav")

If the haven format is preferred for working in R, a GADSdat object can also be transformed to its equivalent tibble format, as if the data was imported from SPSS via haven.

haven_dat <- export_tibble(gads)
haven_dat
#> # A tibble: 4 × 3
#>      ID sex        forename
#>   <dbl> <dbl+lbl>  <chr>   
#> 1     1 0 [male]   Tim     
#> 2     2 0 [male]   Bill    
#> 3     3 1 [female] Ann     
#> 4     4 1 [female] Chris
lapply(haven_dat, attributes)
#> $ID
#> $ID$label
#> [1] "Person Identifier"
#> 
#> 
#> $sex
#> $sex$label
#> [1] "Sex as self reported"
#> 
#> $sex$na_values
#> [1] -99
#> 
#> $sex$class
#> [1] "haven_labelled_spss" "haven_labelled"     
#> 
#> $sex$labels
#> missing - omission               male             female 
#>                -99                  0                  1 
#> 
#> 
#> $forename
#> $forename$label
#> [1] "first name as reported by teacher"
#> 
#> $forename$na_values
#> character(0)