The cvsem package provides cross-validation (CV) of
structural equation models (SEM) across a user-defined number of folds.
CV is based on computing the discrepancy among the held-out test sample
covariance and the model implied covariance from the training samples.
This approach of cross-validating SEM’s is described in Cudeck and
Browne (1983) and Browne and Cudeck (1992). The individual models are fitted
via the lavaan package (Rosseel 2012) to obtain the model implied
covariance matrix. The discrepancy of the implied matrix to the test
sample covariance matrix is obtained via a pre-specified metric
(defaults to Kullback-Leibler divergence aka. Maximum Likelihood
discrepancy). The cvsem
function returns the average
discrepancy together with a corresponding standard error for each tested
model.
Currently, the provided model code needs to follow one of lavaan’s allowed specifications.
You can install the development version of cvsem from GitHub with:
# install.packages("devtools")
::install_github("AnnaWysocki/cvsem") devtools
Cross-validating the Holzingerswineford1939 dataset
Load package and read in data from the lavaan package:
library(cvsem)
<- lavaan::HolzingerSwineford1939 example_data
Add column names
colnames(example_data) <- c("id", "sex", "ageyr", "agemo", "school", "grade",
"visualPerception", "cubes", "lozenges", "comprehension",
"sentenceCompletion", "wordMeaning", "speededAddition",
"speededCounting", "speededDiscrimination")
Define some models to be compared with cvsem
using
lavaan
notation:
<- 'comprehension ~ sentenceCompletion + wordMeaning'
model1
<- 'comprehension ~ meaning
model2
## Add some latent variables:
meaning =~ wordMeaning + sentenceCompletion
speed =~ speededAddition + speededDiscrimination + speededCounting
speed ~~ meaning'
<- 'comprehension ~ wordMeaning + speededAddition' model3
Gather models into a named list object with cvgather
.
These could also be fitted lavaan
objects based on the same
data.
<- cvgather(model1, model2, model3) models
Define number of folds k
and call cvsem
function. Here we use k=10
folds. CV is based on the
discrepancy between test sample covariance matrix and the model implied
matrix from the training data. The discrepancy among sample and implied
matrix is defined in discrepancyMetric
. Currently three
discrepancy metrics are available: KL-Divergence
,
Generalized Least Squares GLS
, and Frobenius Distance
FD
. Here we use KL-Divergence
.
<- cvsem( data = example_data, Models = models, k = 10, discrepancyMetric = "KL-Divergence")
fit #> [1] "Cross-Validating model: model1"
#> [1] "Cross-Validating model: model2"
#> [1] "Cross-Validating model: model3"
Print fitted cvsem
-object. Note, the model with the
smallest (best) discrepancy is listed first. The metric reflects the
average of the discrepancy metric across all folds (aka. expected
cross-validation index (ECVI)) together with the associated standard
error.
fit#> Cross-Validation Results of 3 models
#> based on k = 10 folds.
#>
#> Model E(KL-D) SE
#> 1 model1 1.29 0.44
#> 3 model3 2.28 0.50
#> 2 model2 3.48 0.64
Browne, Michael W., and Robert Cudeck. 1992. “Alternative Ways of Assessing Model Fit.” Sociological Methods & Research 21: 230–58.
Cudeck, Robert, and Michael W. Browne. 1983. “Cross-Validation of Covariance Structures.” Multivariate Behavioral Research 18: 147–67. https://doi.org/10.1207/s15327906mbr1802_2.
Rosseel, Yves. 2012. “lavaan: An R Package for Structural Equation Modeling.” Journal of Statistical Software. https://doi.org/10.18637/jss.v048.i02.