This vignette demonstrates one of the newer features in the SimDesign
package pertaining to multiple analysis function definitions that can be selected for each Design
condition or whenever they are applicable. The purpose of providing multiple analysis functions is to
Analyse()
function contains too much code to easily track, and toFunctionality speaking, this type of organization does not change how SimDesign
generally operates. For that reason, the coding style presented in this vignette can be considered optional. However, if any of the above points resonate well with you then following the details of this coding organization style may prove useful.
The usual work-flow with SimDesign
requires first calling SimFunctions()
to generate a working template, such as the following.
SimDesign::SimFunctions()
## #-------------------------------------------------------------------
##
## library(SimDesign)
##
## Design <- createDesign(factor1 = NA,
## factor2 = NA)
##
## #-------------------------------------------------------------------
##
## Generate <- function(condition, fixed_objects) {
## dat <- data.frame()
## dat
## }
##
## Analyse <- function(condition, dat, fixed_objects) {
## ret <- nc(stat1 = NaN, stat2 = NaN)
## ret
## }
##
## Summarise <- function(condition, results, fixed_objects) {
## ret <- c(bias = NaN, RMSE = NaN)
## ret
## }
##
## #-------------------------------------------------------------------
##
## res <- runSimulation(design=Design, replications=2, generate=Generate,
## analyse=Analyse, summarise=Summarise)
## res
which uses the default nAnalyses=1
to generate only a single Analyse()
function. In the context of multiple analysis functions, however, users may be more interested in passing the number of analysis functions they believe they will need in their simulation (e.g., if analyzing a \(t\)-test setup to compare the Welch versus independent samples t-test, then two analysis functions should be used). Passing nAnalyses=2
to SimFunctions()
creates the following template:
SimDesign::SimFunctions(nAnalyses = 2)
## #-------------------------------------------------------------------
##
## library(SimDesign)
##
## Design <- createDesign(factor1 = NA,
## factor2 = NA)
##
## #-------------------------------------------------------------------
##
## Generate <- function(condition, fixed_objects) {
## dat <- data.frame()
## dat
## }
##
## Analyse.A1 <- function(condition, dat, fixed_objects) {
## ret <- nc(stat1 = NaN, stat2 = NaN)
## ret
## }
##
## Analyse.A2 <- function(condition, dat, fixed_objects) {
## ret <- nc(stat1 = NaN, stat2 = NaN)
## ret
## }
##
## #-------------------------------------------------------------------
##
## Summarise <- function(condition, results, fixed_objects) {
## ret <- c(bias = NaN, RMSE = NaN)
## ret
## }
##
## #-------------------------------------------------------------------
##
## res <- runSimulation(design=Design, replications=2,generate=Generate,
## analyse=list(A1=Analyse.A1, A2=Analyse.A2),
## summarise=Summarise)
## res
Notice in this case that there are two Analyse.#()
definitions constructed, and when passed to runSimulation()
these are organized as a named list
. The names of the list will ultimately be attached to the names of the analysis objects so that there is no ambiguity in the outputted information. However, the inputs to the Analyse()
functions will always be the same, as the dat
object formed by the Generate()
call will be passed to each of these Analyse definitions (hence, the generate data is held constant across the respective analyses).
The above template should of course be modified to replace the less useful names of the Analyse.#()
components. By default users will want to change these to something like Analyse.some_statistic
, Analyse.some_other_statistic
, …, Analyse.some_other_other_statistic
, and so on, where the number of Analyse.#
function definitions will ultimately end up in the runSimulation(..., Analyse=list())
input. Supplying better names to the named list
component is also recommended as these will be used to name the associated results in the Summarise()
step.
Finally, note that all the rules about objects and object naming from the typical single Analyse function still apply and are properly checked internally for suitable names and consistency. The independently defined Analyse functions are also interchangable and removable/replaceable, which makes the structure of the Generate-Analyse-Summarise setup more modular with respect to the analysis components.
The following code is adopted from the Wiki, and so details about the simulation should be obtained from that source.
library(SimDesign)
# SimFunctions(nAnalyses = 2)
sample_sizes <- c(250, 500, 1000)
nitems <- c(10, 20)
Design <- createDesign(sample_size = sample_sizes,
nitems = nitems)
# create list of additional parameters which are fixed across conditions
set.seed(1)
pars_10 <- rbind(a = round(rlnorm(10, .3, .5)/1.702, 2),
d = round(rnorm(10, 0, .5)/1.702, 2))
pars_20 <- rbind(a = round(rlnorm(20, .3, .5)/1.702, 2),
d = round(rnorm(20, 0, .5)/1.702, 2))
pars <- list(ten=pars_10, twenty=pars_20)
P_logit <- function(a, d, Theta) exp(a * Theta + d) / (1 + exp(a * Theta + d))
P_ogive <- function(a, d, Theta) pnorm(a * Theta + d)
Generate <- function(condition, fixed_objects) {
N <- condition$sample_size
nitems <- condition$nitems
nitems_name <- ifelse(nitems == 10, 'ten', 'twenty')
#extract objects from fixed_objects
a <- fixed_objects[[nitems_name]]['a', ]
d <- fixed_objects[[nitems_name]]['d', ]
dat <- matrix(NA, N, nitems)
colnames(dat) <- paste0('item_', 1:nitems)
Theta <- rnorm(N)
for(j in 1:nitems){
p <- P_ogive(a[j], d[j], Theta)
for(i in 1:N)
dat[i,j] <- sample(c(1,0), 1, prob = c(p[i], 1 - p[i]))
}
as.data.frame(dat) #data.frame works nicer with lavaan
}
Analyse.FIML <- function(condition, dat, fixed_objects) {
mod <- mirt(dat, 1L, verbose=FALSE)
if(!extract.mirt(mod, 'converged')) stop('mirt did not converge')
cfs <- mirt::coef(mod, simplify = TRUE, digits = Inf)
FIML_as <- cfs$items[,1L] / 1.702
ret <- c(as=unname(FIML_as))
ret
}
Analyse.DWLS <- function(condition, dat, fixed_objects) {
nitems <- condition$nitems
lavmod <- paste0('F =~ ', paste0('NA*', colnames(dat)[1L], ' + '),
paste0(colnames(dat)[-1L], collapse = ' + '),
'\nF ~~ 1*F')
lmod <- sem(lavmod, dat, ordered = colnames(dat))
if(!lavInspect(lmod, 'converged')) stop('lavaan did not converge')
cfs2 <- lavaan::coef(lmod)
DWLS_alpha <- cfs2[1L:nitems]
const <- sqrt(1 - DWLS_alpha^2)
DWLS_as <- DWLS_alpha / const
ret <- c(as=unname(DWLS_as))
ret
}
Summarise <- function(condition, results, fixed_objects) {
nitems <- condition$nitems
nitems_name <- ifelse(nitems == 10, 'ten', 'twenty')
#extract objects from fixed_objects
a <- fixed_objects[[nitems_name]]['a', ]
pop <- c(a, a)
obt_bias <- bias(results, pop)
obt_RMSE <- RMSE(results, pop)
ret <- c(bias=obt_bias, RMSE=obt_RMSE)
ret
}
res <- runSimulation(Design, replications=100, verbose=FALSE, parallel=TRUE,
generate=Generate,
analyse=list(FIML=Analyse.FIML, DWLS=Analyse.DWLS),
summarise=Summarise, filename = 'mirt_lavaan',
packages=c('mirt', 'lavaan'), fixed_objects=pars)
res
## # A tibble: 6 × 86
## sample_size nitems bias.FIML.as1 bias.FIML.as2 bias.FIML.as3 bias.FIML.as4
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 250 10 0.0099887 0.013457 -0.020161 0.18616
## 2 500 10 -0.021136 0.018720 -0.018173 0.10424
## 3 1000 10 -0.0039473 -0.0014778 -0.013393 0.077890
## 4 250 20 0.060647 0.011787 0.064688 -0.026314
## 5 500 20 0.019202 0.045376 0.021837 -0.0063587
## 6 1000 20 0.019976 0.018608 0.0079817 -0.011498
## # ℹ 80 more variables: bias.FIML.as5 <dbl>, bias.FIML.as6 <dbl>,
## # bias.FIML.as7 <dbl>, bias.FIML.as8 <dbl>, bias.FIML.as9 <dbl>,
## # bias.FIML.as10 <dbl>, bias.DWLS.as1 <dbl>, bias.DWLS.as2 <dbl>,
## # bias.DWLS.as3 <dbl>, bias.DWLS.as4 <dbl>, bias.DWLS.as5 <dbl>,
## # bias.DWLS.as6 <dbl>, bias.DWLS.as7 <dbl>, bias.DWLS.as8 <dbl>,
## # bias.DWLS.as9 <dbl>, bias.DWLS.as10 <dbl>, RMSE.FIML.as1 <dbl>,
## # RMSE.FIML.as2 <dbl>, RMSE.FIML.as3 <dbl>, RMSE.FIML.as4 <dbl>, …
In this particular formulation the mirt
and lavaan
package analyses have been completely isolated into their own respective functions, and in principle could therefore be analyzed independently in future simulation studies. This adds a nicer layer of potential modularity to the Analyse portion of the SimDesign
framework, where re-using or modifying previous SimDesign
code should be less painful.
AnalyseIf()
In situations where analysis functions defined in the analyse
list should only be applied in certain design conditions, users can include an AnalyseIf()
definition at the beginning of their respective functions to ensure the analyses are only executed when the provided logical is TRUE
. This logical ensures the data generation conditions are suitable for the analysis function to be investigated; otherwise, it is skipped over in the generate-analyse-summarise work-flow.
As a continuation from above, say that an investigator was also interested in recovering the slope parameters of a factor analysis model where the observed indicator variable were continuous as well as discrete. The Design
definition may therefore look like the following.
Design <- createDesign(sample_size = sample_sizes,
nitems = nitems,
indicators = c('discrete', 'continuous'))
Design
## # A tibble: 12 × 3
## sample_size nitems indicators
## <dbl> <dbl> <chr>
## 1 250 10 discrete
## 2 500 10 discrete
## 3 1000 10 discrete
## 4 250 20 discrete
## 5 500 20 discrete
## 6 1000 20 discrete
## 7 250 10 continuous
## 8 500 10 continuous
## 9 1000 10 continuous
## 10 250 20 continuous
## 11 500 20 continuous
## 12 1000 20 continuous
Provided that the Generate()
step utilized this indicators
character vector, this would imply that the dat
object returned from Generate()
could consist of discrete or continuous data. In the case of continuous indicator variables, lavaan
could be used as it supports such indicator types; however, mirt
cannot. So, to ensure that only the analysis function pertaining to lavaan
is used one could include the following replacement definition that used mirt
, but now includes an AnalyseIf()
logical given the indicators
variable’s state.
Analyse.FIML <- function(condition, dat, fixed_objects) {
AnalyseIf(condition$indicators == 'discrete')
# equivalently:
# AnalyseIf(indicators == 'discrete', condition)
# with(condition, AnalyseIf(indicators == 'discrete'))
mod <- mirt(dat, 1L, verbose=FALSE)
if(!extract.mirt(mod, 'converged')) stop('mirt did not converge')
cfs <- coef(mod, simplify = TRUE, digits = Inf)
FIML_as <- cfs$items[,1L] / 1.702
ret <- c(as=unname(FIML_as))
ret
}
Using this definition the final object returned by runSimulation()
will provide suitable NA
placeholders (where appropriate). For continuous indicators the results will be presented as though mirt
was never used for the continuous indicator conditions controlled by the Design
object.
Note that a similar logic can be made for multiple Generate()
functions, templated via SimDesign::SimFunctions(nGenerate = 2)
, and requiring specific GenerateIf()
tests to indicate which generation function should be used. For those interested in this type of structure, the following could be used to separate the discrete/continuous data generation per Design
row:
Generate.G1 <- function(condition, fixed_objects) {
GenerateIf(condition$indicators == 'discrete')
...
dat
}
Generate.G2 <- function(condition, fixed_objects) {
GenerateIf(condition$indicators == 'continuous')
...
dat
}
res <- runSimulation(design=Design, replications=1000,
generate=list(G1=Generate.G1, G2=Generate.G2),
analyse=list(DWLS=Analyse.DWLS, FIML=Analyse.FIML),
summarise=Summarise)
Interestingly, AnalyseIf()
could also be used to select only one analysis function at a time given the components in the Design
object. For instance, if the Design
definition were constructed using
Design <- createDesign(sample_size = sample_sizes,
nitems = nitems,
method = c('FIML', 'DWLS'))
Design
## # A tibble: 12 × 3
## sample_size nitems method
## <dbl> <dbl> <chr>
## 1 250 10 FIML
## 2 500 10 FIML
## 3 1000 10 FIML
## 4 250 20 FIML
## 5 500 20 FIML
## 6 1000 20 FIML
## 7 250 10 DWLS
## 8 500 10 DWLS
## 9 1000 10 DWLS
## 10 250 20 DWLS
## 11 500 20 DWLS
## 12 1000 20 DWLS
and the analysis functions above were supplied defined as
Analyse.FIML <- function(condition, dat, fixed_objects) {
AnalyseIf(method == 'FIML', condition)
#...
}
Analyse.DWLS <- function(condition, dat, fixed_objects) {
AnalyseIf(method == 'DWLS', condition)
# ...
}
# ...
res <- runSimulation(Design, replications=100, verbose=FALSE, parallel=TRUE,
generate=Generate,
analyse=list(Analyse.FIML, Analyse.DWLS),
summarise=Summarise, filename = 'mirt_lavaan',
packages=c('mirt', 'lavaan'), fixed_objects=pars)
then only one analysis function will be applied at a time in the simulation experiment. Note that in this case there is no need to append ‘MML’ or ‘DWLS’ to the results
objects as this becomes redundant with the method
column in the Design
object, and so the analyse
list input is specified as an unnamed list (cf. earlier when the input was named, which appended MML.
and DWLS.
to the results
output in Summarise()
).
Users may find this a more natural setup than having to merge all analysis information into a single Analyse()
definition. The downside, however, is that the analysis function will be applied to different generated datasets, which while theoretically unbiased could have ramifications should the analysis functions throw errors at different rates (even when explicitly supplying a seed
vector input to runSimulation()
).