The MRTAnalysis
package provides user-friendly functions
to conduct primary and secondary analyses for micro-randomized trial
(MRT), where the proximal outcomes are continuous or binary and the
intervention option is binary. For continuous outcomes, the estimated
causal effects are on the additive scale. For binary outcomes, the
estimated causal effects are on the log relative risk scale. In
particular, this package can be used to
MRT is an experimental design for optimizing mobile health
interventions. The marginal and the moderated causal excursion effects
are the quantities of interest in primary and secondary analyses for
MRT. In this tutorial we briefly review MRT and causal excursion
effects, and illustrate the use of the estimators implemented in this
package for conducting primary and secondary analyses of MRT:
wcls()
(weighted and centered least squares, for MRT with
continuous outcomes) and emee()
(estimator for marginal
excursion effect, for MRT with binary outcome).
In an MRT, each participant is repeatedly randomized among treatment options many times throughout the trial. Suppose there are \(n\) participants, and for the \(i\)-th participant, they are enrolled in the trial for \(m_i\) decision points. (In many MRT, the number of decision points \(m_i\) is the same for all participants. This package also automatically handles the setting where \(m_i\) is different for different participants, so we present the data structure in a more general way.)
For the \(i\)-th participant at the \(t\)-th decision point, we use the triplet \((X_{it}, A_{it}, Y_{it})\) to denote the data collected, where
data.frame
input: each row in the data.frame
would correspond to \((X_{it}, A_{it},
Y_{it})\) (and possibly other variables — see below) for a
specific \((i,t)\) combination.An MRT may include availability considerations. When it is inappropriate or unethical to deliver interventions to an individual, that individual is considered “unavailable”, and no intervention will be delivered at that decision point so \(A_{it} = 0\).
Mathematically, we use \(I_{it}\) denotes the availability status of participant \(i\) at decision point \(t\): \(I_{it} = 1\) if available and \(I_{it} = 0\) if unavailable. Temporal-wise, availability \(I_{it}\) is determined before \(A_{it}\), and one can conceptualize it by considering \(I_{it}\) to be measured at the same time as \(X_{it}\).
To use any of the estimators in this package, you need to prepare
your data set as a data.frame
in long format, meaning that
each row records observations from a decision point of a participant
(i.e., \((X_{it}, A_{it}, Y_{it})\)
(and possibly other variables — see below) for a specific \((i,t)\) combination). The
data.frame
should be sorted so that consecutive rows should
be from adjacent decision points from the same participant. Furthermore,
the data set should contain the following columns:
The data set may also contain the following columns, depending on your MRT:
library(MRTAnalysis)
<- options(digits = 3) # save current options for restoring later current_options
The following code uses wcls()
to estimate the fully
marginal causal excursion effect from a synthetic data set
data_mimicHeartSteps
that mimics the HeartSteps V1 MRT
(Klasnja and others (2015)). This
data set contains observations for 37 individuals at 210 different time
points. The data set contains the following variables:
userid
: id number of an individual.time
: index of decision point.day_in_study
: day in the study.logstep_30min
: the proximal outcome, i.e., the step
count in the 30 minutes following the current decision point
(log-transformed).logstep_30min_lag1
: the proximal outcome at the
previous decision point (lag-1 outcome), i.e., the step count in the 30
minutes following the previous decision point (log-transformed).logstep_pre30min
: the step count in the 30 minutes
prior to the current decision point (log-transformed).is_at_home_or_work
: whether the individual is at home
or work (=1) or at other locations (=0) at the current decision
point.intervention
: whether the intervention is randomized to
be delivered (=1) or not (=0) at the current decision point; the
randomization probability is a constant 0.6 for this data set, mimicking
HeartSteps V1.avail
: whether the individual is available (=1) or not
(=0) at the current decision point.A summary of data_mimicHeartSteps
is as follows:
head(data_mimicHeartSteps, 10)
#> userid decision_point day_in_study logstep_30min logstep_30min_lag1
#> 1 1 1 0 2.3902 0.0000
#> 2 1 2 0 -0.6931 2.3902
#> 3 1 3 0 2.4647 -0.6931
#> 4 1 4 0 0.1207 2.4647
#> 5 1 5 0 0.8322 0.1207
#> 6 1 6 1 1.8450 0.8322
#> 7 1 7 1 4.6315 1.8450
#> 8 1 8 1 4.1929 4.6315
#> 9 1 9 1 -0.0344 4.1929
#> 10 1 10 1 -0.1495 -0.0344
#> logstep_pre30min is_at_home_or_work intervention rand_prob avail
#> 1 -0.693 1 0 0.6 0
#> 2 2.196 1 0 0.6 1
#> 3 4.589 1 1 0.6 1
#> 4 3.179 1 1 0.6 1
#> 5 3.295 0 0 0.6 0
#> 6 4.666 1 0 0.6 0
#> 7 3.774 0 0 0.6 1
#> 8 -0.693 1 1 0.6 1
#> 9 -0.693 0 1 0.6 1
#> 10 -0.693 1 1 0.6 1
In the following function call of wcls()
, we specify the
proximal outcome variable by outcome = logstep_30min
. We
specify the treatment variable by treatment = intervention
.
We specify the constant randomization probability by
rand_prob = 0.6
. We specify the fully marginal effect as
the quantity to be estimated by setting
moderator_formula = ~1
. We use
logstep_pre30min
as a control variable by setting
control_formula = ~logstep_pre30min
. We specify the
availability variable by availability = avail
.
<- wcls(
fit1 data = data_mimicHeartSteps,
id = "userid",
outcome = "logstep_30min",
treatment = "intervention",
rand_prob = 0.6,
moderator_formula = ~1,
control_formula = ~logstep_pre30min,
availability = "avail"
)#> Constant randomization probability 0.6 is used.
#> Constant numerator probability 0.5 is used.
summary(fit1)
#> $call
#> wcls(data = data_mimicHeartSteps, id = "userid", outcome = "logstep_30min",
#> treatment = "intervention", rand_prob = 0.6, moderator_formula = ~1,
#> control_formula = ~logstep_pre30min, availability = "avail")
#>
#> $causal_excursion_effect
#> Estimate 95% LCL 95% UCL StdErr Hotelling df1 df2 p-value
#> (Intercept) 0.157 0.031 0.284 0.0622 6.4 1 34 0.0162
The summary()
function provides the estimated causal
excursion effect as well as the 95% confidence interval, standard error,
and p-value. The only row in the output
$causal_excursion_effect
is named (Intercept)
,
indicating that this is the fully marginal effect (like an intercept in
the causal effect model). In particular, the estimated marginal
excursion effect is 0.157, with 95% confidence interval (0.031, 0.284),
and p-value 0.016. The confidence interval and the p-value are based on
a small sample correction technique that is based on Hotelling’s T
distribution, so the Hotelling’s T statistic (Hotelling
)
and the degrees of freedom (df1
and df2
) are
also printed. See Boruvka and others (2018)
for details on the small sample correction.
One can include more control variables, e.g., by setting
control_formula = ~logstep_pre30min + logstep_30min_lag1 + is_at_home_or_work
.
This is illustrated by the following code. Different choices of control
variables should lead to similar causal effect estimates, but better
control variables (i.e., those that are correlated with the proximal
outcome) usually decrease the standard error of the causal effect
estimates. This is the case here: the standard error of the marginal
causal excursion effect decreases slightly from 0.062 to 0.061 after we
included two additional control variables.
<- wcls(
fit2 data = data_mimicHeartSteps,
id = "userid",
outcome = "logstep_30min",
treatment = "intervention",
rand_prob = 0.6,
moderator_formula = ~1,
control_formula = ~ logstep_pre30min + logstep_30min_lag1 + is_at_home_or_work,
availability = "avail"
)#> Constant randomization probability 0.6 is used.
#> Constant numerator probability 0.5 is used.
summary(fit2)
#> $call
#> wcls(data = data_mimicHeartSteps, id = "userid", outcome = "logstep_30min",
#> treatment = "intervention", rand_prob = 0.6, moderator_formula = ~1,
#> control_formula = ~logstep_pre30min + logstep_30min_lag1 +
#> is_at_home_or_work, availability = "avail")
#>
#> $causal_excursion_effect
#> Estimate 95% LCL 95% UCL StdErr Hotelling df1 df2 p-value
#> (Intercept) 0.159 0.0351 0.283 0.0607 6.84 1 32 0.0135
One can ask summary()
to print out the fitted
coefficients for the control variables as well, by setting
show_control_fit = TRUE
. This is illustrated by the
following code. However, we caution against interpreting the fitted
coefficients for the control variables, because these coefficients are
only interpretable when the control model is correctly specified, which
can rarely be true in an MRT setting.
summary(fit2, show_control_fit = TRUE)
#> Interpreting the fitted coefficients for control variables is not recommended.
#> $call
#> wcls(data = data_mimicHeartSteps, id = "userid", outcome = "logstep_30min",
#> treatment = "intervention", rand_prob = 0.6, moderator_formula = ~1,
#> control_formula = ~logstep_pre30min + logstep_30min_lag1 +
#> is_at_home_or_work, availability = "avail")
#>
#> $causal_excursion_effect
#> Estimate 95% LCL 95% UCL StdErr Hotelling df1 df2 p-value
#> (Intercept) 0.159 0.0351 0.283 0.0607 6.84 1 32 0.0135
#>
#> $control_variables
#> Estimate 95% LCL 95% UCL StdErr Hotelling df1 df2 p-value
#> (Intercept) 1.8373 1.7222 1.9524 0.0565 1056.59 1 32 0.000000
#> logstep_pre30min 0.3400 0.2997 0.3803 0.0198 295.31 1 32 0.000000
#> logstep_30min_lag1 0.0399 0.0181 0.0618 0.0107 13.84 1 32 0.000764
#> is_at_home_or_work 0.1355 0.0263 0.2447 0.0536 6.39 1 32 0.016585
The following code uses wcls()
to estimate the causal
excursion effect moderated by the location of the individual,
is_at_home_or_work
. This is achieved by setting
moderator_formula = ~is_at_home_or_work
.
<- wcls(
fit3 data = data_mimicHeartSteps,
id = "userid",
outcome = "logstep_30min",
treatment = "intervention",
rand_prob = 0.6,
moderator_formula = ~is_at_home_or_work,
control_formula = ~ logstep_pre30min + logstep_30min_lag1 + is_at_home_or_work,
availability = "avail"
)#> Constant randomization probability 0.6 is used.
#> Constant numerator probability 0.5 is used.
summary(fit3)
#> $call
#> wcls(data = data_mimicHeartSteps, id = "userid", outcome = "logstep_30min",
#> treatment = "intervention", rand_prob = 0.6, moderator_formula = ~is_at_home_or_work,
#> control_formula = ~logstep_pre30min + logstep_30min_lag1 +
#> is_at_home_or_work, availability = "avail")
#>
#> $causal_excursion_effect
#> Estimate 95% LCL 95% UCL StdErr Hotelling df1 df2 p-value
#> (Intercept) 0.109 -0.0288 0.247 0.0677 2.606 1 31 0.117
#> is_at_home_or_work 0.135 -0.1661 0.436 0.1475 0.835 1 31 0.368
The moderated causal excursion effect is modeled as a linear form:
\(\beta_0 + \beta_1 \cdot
\text{is_at_home_or_work}\). The output
$causal_excursion_effect
contains two rows, one for \(\beta_0\) (with row name
(Intercept)
) and the other for \(\beta_1\) (with row name
is_at_home_or_work
). Here, \(\beta_0\) is the causal excursion effect
when the individual is not at home or work (estimate = 0.109, 95% CI =
(-0.029, 0.247), p = 0.117), and \(\beta_1\) is the difference in the causal
excursion effects between when at home or work and when at other
locations (estimate = 0.135, 95% CI = (-0.166, 0.435), p = 0.368).
One can also ask summary()
to calculate and print the
estimated coefficients for \(\beta_0 +
\beta_1\), the causal excursion effect when the individual is at
home or work, by using the lincomb
optional argument. This
is illustrated by the following code. We set
lincomb = c(1, 1)
, i.e., asks summary()
to
print out \([1, 1] \times (\beta_0, \beta_1)^T
= \beta_0 + \beta_1\). The third row in
$causal_excursion_effect
, named
(Intercept) + is_at_home_or_work
, is the fitted result for
this \(\beta_0 + \beta_1\) coefficient
combination.
summary(fit3, lincomb = c(1, 1))
#> $call
#> wcls(data = data_mimicHeartSteps, id = "userid", outcome = "logstep_30min",
#> treatment = "intervention", rand_prob = 0.6, moderator_formula = ~is_at_home_or_work,
#> control_formula = ~logstep_pre30min + logstep_30min_lag1 +
#> is_at_home_or_work, availability = "avail")
#>
#> $causal_excursion_effect
#> Estimate 95% LCL 95% UCL StdErr Hotelling df1
#> (Intercept) 0.109 -0.0288 0.247 0.0677 2.606 1
#> is_at_home_or_work 0.135 -0.1661 0.436 0.1475 0.835 1
#> (Intercept) + is_at_home_or_work 0.244 0.0846 0.403 0.1260 3.753 2
#> df2 p-value
#> (Intercept) 31 0.11661
#> is_at_home_or_work 31 0.36787
#> (Intercept) + is_at_home_or_work 31 0.00187
Based on the output, the causal excursion effect at home or work is estimated to be 0.244, with 95% CI (0.085, 0.403) and p-value 0.002.
The syntax of emee()
is exactly the same as
wcls()
. We illustrate the use of emee()
below
for completeness.
The following code uses emee()
to estimate the fully
marginal causal excursion effect from a synthetic data set
data_binary
. This data set contains observations for 100
individuals at 30 different time points. The data set contains the
following variables:
userid
: id number of an individual.time
: index of decision point.time_var1
: time-varying covariate 1 for illustration
purpose. Here it is defined as the “standardized time in study”, defined
as the current decision point index divided by the total number of
decision points.time_var2
: time-varying covariate 2 for illustration
purpose. Here it is the indicator of “the second half of the study”,
defined as whether the current decision point index is greater than the
total number of decision points divided by 2.Y
: the binary proximal outcome.A
: whether the intervention is randomized to be
delivered (=1) or not (=0) at the current decision point;rand_prob
: the randomization probability at each
decision point, which is not a constant over time.avail
: whether the individual is available (=1) or not
(=0) at the current decision point.A summary of data_binary
is as follows:
head(data_binary, 30)
#> userid time time_var1 time_var2 Y A avail rand_prob
#> 1 1 1 0.0333 0 0 0 1 0.3
#> 2 1 2 0.0667 0 0 1 1 0.5
#> 3 1 3 0.1000 0 0 0 1 0.7
#> 4 1 4 0.1333 0 1 0 0 0.3
#> 5 1 5 0.1667 0 0 0 0 0.5
#> 6 1 6 0.2000 0 0 0 0 0.7
#> 7 1 7 0.2333 0 0 0 1 0.3
#> 8 1 8 0.2667 0 0 1 1 0.5
#> 9 1 9 0.3000 0 0 0 1 0.7
#> 10 1 10 0.3333 0 1 0 1 0.3
#> 11 1 11 0.3667 0 0 0 1 0.5
#> 12 1 12 0.4000 0 0 1 1 0.7
#> 13 1 13 0.4333 0 1 0 1 0.3
#> 14 1 14 0.4667 0 0 0 1 0.5
#> 15 1 15 0.5000 0 1 1 1 0.7
#> 16 1 16 0.5333 1 0 0 1 0.3
#> 17 1 17 0.5667 1 1 1 1 0.5
#> 18 1 18 0.6000 1 1 0 1 0.7
#> 19 1 19 0.6333 1 1 0 1 0.3
#> 20 1 20 0.6667 1 1 0 1 0.5
#> 21 1 21 0.7000 1 0 1 1 0.7
#> 22 1 22 0.7333 1 1 0 1 0.3
#> 23 1 23 0.7667 1 1 1 1 0.5
#> 24 1 24 0.8000 1 1 0 0 0.7
#> 25 1 25 0.8333 1 1 1 1 0.3
#> 26 1 26 0.8667 1 1 1 1 0.5
#> 27 1 27 0.9000 1 1 0 1 0.7
#> 28 1 28 0.9333 1 1 1 1 0.3
#> 29 1 29 0.9667 1 0 1 1 0.5
#> 30 1 30 1.0000 1 1 1 1 0.7
In the following function call of emee()
, we specify the
proximal outcome variable by outcome = Y
. We specify the
treatment variable by treatment = A
. We specify the
randomization probability by rand_prob = rand_prob
(the
first rand_prob
is an argument of emee()
; the
second rand_prob
is a column in data_binary
).
We specify the fully marginal effect as the quantity to be estimated by
setting moderator_formula = ~1
. We use
time_var1
and time_var2
as control variables
by setting control_formula = ~ time_var1 + time_var2
. We
specify the availability variable by
availability = avail
.
<- emee(
fit4 data = data_binary,
id = "userid",
outcome = "Y",
treatment = "A",
rand_prob = "rand_prob",
moderator_formula = ~1,
control_formula = ~ time_var1 + time_var2,
availability = "avail"
)#> Constant numerator probability 0.5 is used.
summary(fit4)
#> $call
#> emee(data = data_binary, id = "userid", outcome = "Y", treatment = "A",
#> rand_prob = "rand_prob", moderator_formula = ~1, control_formula = ~time_var1 +
#> time_var2, availability = "avail")
#>
#> $causal_excursion_effect
#> Estimate 95% LCL 95% UCL StdErr t_value df p-value
#> (Intercept) 0.341 0.241 0.44 0.05 6.81 96 8.54e-10
The summary()
function provides the estimated causal
excursion effect as well as the 95% confidence interval, standard error,
and p-value. The only row in the output
$causal_excursion_effect
is named (Intercept)
,
indicating that this is the fully marginal effect (like an intercept in
the causal effect model). In particular, the estimated marginal
excursion effect is 0.341 (on the log relative risk scale), with 95%
confidence interval (0.241, 0.44), and p-value\(<0.001\).
One can ask summary()
to print out the fitted
coefficients for the control variables as well, by setting
show_control_fit = TRUE
. This is illustrated by the
following code. However, we caution against interpreting the fitted
coefficients for the control variables, because these coefficients are
only interpretable when the control model is correctly specified, which
can rarely be true in an MRT setting.
summary(fit4, show_control_fit = TRUE)
#> Interpreting the fitted coefficients for control variables is not recommended.
#> $call
#> emee(data = data_binary, id = "userid", outcome = "Y", treatment = "A",
#> rand_prob = "rand_prob", moderator_formula = ~1, control_formula = ~time_var1 +
#> time_var2, availability = "avail")
#>
#> $causal_excursion_effect
#> Estimate 95% LCL 95% UCL StdErr t_value df p-value
#> (Intercept) 0.341 0.241 0.44 0.05 6.81 96 8.54e-10
#>
#> $control_variables
#> Estimate 95% LCL 95% UCL StdErr t_value df p-value
#> (Intercept) -1.580 -1.7154 -1.45 0.068 -23.26 96 3.13e-41
#> time_var1 0.672 0.2878 1.06 0.194 3.47 96 7.76e-04
#> time_var2 0.248 0.0268 0.47 0.112 2.22 96 2.84e-02
The following code uses emee()
to estimate the causal
excursion effect moderated by time_var1
. This is achieved
by setting moderator_formula = ~time_var1
.
<- emee(
fit5 data = data_binary,
id = "userid",
outcome = "Y",
treatment = "A",
rand_prob = "rand_prob",
moderator_formula = ~time_var1,
control_formula = ~ time_var1 + time_var2,
availability = "avail"
)#> Constant numerator probability 0.5 is used.
summary(fit5)
#> $call
#> emee(data = data_binary, id = "userid", outcome = "Y", treatment = "A",
#> rand_prob = "rand_prob", moderator_formula = ~time_var1,
#> control_formula = ~time_var1 + time_var2, availability = "avail")
#>
#> $causal_excursion_effect
#> Estimate 95% LCL 95% UCL StdErr t_value df p-value
#> (Intercept) 0.0811 -0.180 0.342 0.132 0.616 95 0.5391
#> time_var1 0.4293 0.051 0.808 0.191 2.253 95 0.0266
The moderated causal excursion effect is modeled as a linear form:
\(\beta_0 + \beta_1 \cdot
\text{time_var1}\). The output
$causal_excursion_effect
contains two rows, one for \(\beta_0\) (with row name
(Intercept)
) and the other for \(\beta_1\) (with row name
time_var1
). Here, \(\beta_0\) is the causal excursion effect
when time_var1
\(=0\)
(estimate = 0.081, 95% CI = (-0.180, 0.342), p = 0.54), and \(\beta_1\) is the slope of
time_var1
in the causal excursion effect model (estimate =
0.429, 95% CI = (0.051, 0.808), p = 0.03).
One can also ask summary()
to calculate and print the
linear combination of coefficients and their confidence interval,
standard error, and p-value, by using the lincomb
optional
argument. The following code sets
lincomb = rbind(c(1, 0.0333), c(1, 0.5), c(1, 1))
, i.e.,
asks summary()
to print out the estimates for \[
\begin{bmatrix}1 & 0.0333\\
1 & 0.5\\
1 & 1
\end{bmatrix}\times\begin{bmatrix}\beta_{0}\\
\beta_{1}
\end{bmatrix}=\begin{bmatrix}\beta_{0}+0.0333\beta_{1}\\
\beta_{0}+0.5\beta_{1}\\
\beta_{0}+\beta_{1}
\end{bmatrix}.
\] Because \(\beta_1\) is the
slope of time_var1
, which is a scaled version of decision
time index that starts at 0.0333 and ends at 1, \(\beta_0 + 0.0333\beta_1\), \(\beta_0 + 0.5\beta_1\) and \(\beta_0 + \beta_1\) are the causal
excursion effects at the beginning of the study, mid-way during the
study, and at the end of the study, respectively. The 3rd to 5th rows in
$causal_excursion_effect
show these results. Note that the
interpretation is under the assumption that the causal excursion effect
changes linearly over time.
summary(fit5, lincomb = rbind(c(1, 0.0333), c(1, 0.5), c(1, 1)))
#> $call
#> emee(data = data_binary, id = "userid", outcome = "Y", treatment = "A",
#> rand_prob = "rand_prob", moderator_formula = ~time_var1,
#> control_formula = ~time_var1 + time_var2, availability = "avail")
#>
#> $causal_excursion_effect
#> Estimate 95% LCL 95% UCL StdErr t_value df
#> (Intercept) 0.0811 -0.180 0.342 0.1316 0.616 95
#> time_var1 0.4293 0.051 0.808 0.1905 2.253 95
#> (Intercept) + 0.0333*time_var1 0.0954 -0.154 0.345 0.1258 0.759 95
#> (Intercept) + 0.5*time_var1 0.2958 0.183 0.409 0.0569 5.202 95
#> (Intercept) + time_var1 0.5105 0.341 0.680 0.0854 5.978 95
#> p-value
#> (Intercept) 5.39e-01
#> time_var1 2.66e-02
#> (Intercept) + 0.0333*time_var1 4.50e-01
#> (Intercept) + 0.5*time_var1 1.13e-06
#> (Intercept) + time_var1 3.93e-08
options(current_options) # restore old options
Below are some references: