missoNet is a package that fits penalized multi-task regression – that is, with multiple correlated tasks or response variables – to simultaneously estimate the coefficients of a set of predictor variables for all tasks and the conditional response network structure given all predictors, via penalized maximum likelihood in an undirected conditional Gaussian graphical model. In contrast to most penalized multi-task regression (conditional graphical lasso) methods, missoNet has the capability of obtaining estimates even when the response data is corrupted by missing values. The method automatically enjoys the theoretical and computational benefits of convexity, and returns solutions that are comparable/close to the estimates without any missing values.
missoNet assumes the model \[ \mathbf{Y} = \mathbf{X}\mathbf{B}^* + \mathbf{E},\ \ \mathbf{E}_{i\cdot} \stackrel{iid}{\sim} \mathcal{MVN}(0_q,(\mathbf{\Theta}^*)^{-1})\ \ \forall i = 1,...,n, \] where \(\mathbf{Y} \in \mathbb{R}^{n\times q}\) and \(\mathbf{X} \in \mathbb{R}^{n\times p}\) represent the centered response data matrix and predictor data matrix, respectively (thus the intercepts are omitted). \(\mathbf{E}_{i\cdot} \in \mathbb{R}^{q}\) is the \(i\)th row (\(i\)th sample) of the error matrix and is drawn from a multivariate Gaussian distribution. \(n, p, q\) denote the sample size, the number of predictors and the number of responses, respectively. The regression coefficient matrix \(\mathbf{B}^* \in \mathbb{R}^{p\times q}\) and the precision (inverse covariance) matrix \(\mathbf{\Theta}^* \in \mathbb{R}^{q\times q}\) are parameters to be estimated in this model. Note that the entries of \(\mathbf{\Theta}^*\) have a one-to-one correspondence with partial correlations. That is, the random variables \(Y_j\) and \(Y_k\) are conditionally independent (i.e. \(\mathbf{\Theta}^*_{jk} = 0\)) given \(X\) and all other remaining variables in \(Y\) iff the corresponding partial correlation coefficient is zero.
To estimate the parameters, missoNet solves a penalized MLE problem: \[ (\hat{\mathbf{\Theta}},\hat{\mathbf{B}}) = \underset{\mathbf{\Theta} \succeq 0,\ \mathbf{B}}{\text{argmin}}\ \mathrm{tr}\left[\frac{1}{n}(\mathbf{Y} - \mathbf{XB})^\top(\mathbf{Y} - \mathbf{XB}) \mathbf{\Theta}\right] - \mathrm{log}|\mathbf{\Theta}| + \lambda_{\Theta}(\|\mathbf{\Theta}\|_{1,\mathrm{off}} + 1_{n\leq \mathrm{max}(p,q)} \|\mathbf{\Theta}\|_{1,\mathrm{diag}}) + \lambda_{B}\|\mathbf{B}\|_1. \] \(\mathbf{B}^*\) that mapping predictors to responses and \(\mathbf{\Theta}^*\) that revealing the responses’ conditional dependencies are estimated for the lasso penalties – \(\lambda_B\) and \(\lambda_\Theta\), over a grid of values (on the log scale) covering the entire range of possible solutions. To learn more about sparse multi-task regression with inverse covariance estimation using datasets without missing values, see MRCE.
However, missingness in real data is inevitable. Most penalized multi-task regression / conditional graphical lasso methods either assume the data is fully-observed (eg. MRCE), or deal with missing data through a likelihood-based method such as the EM algorithm (e.g, cglasso). An important challenge in this context is the possible non-convexity of the associated optimization problem when there is missing data, as well as the high computational cost from iteratively updating the estimations for parameters.
missoNet aims to handle a specific class of datasets where the response matrix \(\mathbf{Y}\) contains data which is missing at random (MAR). To use missoNet, users do not need to possess additional knowledge for pre-processing missing data (e.g. imputation) nor for selection of regularization parameters (\(\lambda_B\) and \(\lambda_\Theta\)), the method provides a unified framework for automatically solving a convex modification of the multi-task learning problem defined above, using corrupted datasets.
The package provides an integrated set of core routines for data
simulation, model fitting and cross-validation, visualization of
results, as well as predictions in new data. The function arguments are
in the same style as those of glmnet
, making it easy for
experienced users to get started.
To install the package missoNet from CRAN, type the following command in the R console:
Or install the development version of missoNet from GitHub:
The purpose of this section is to give users a general overview of the functions and their usages. We will briefly go over the main functions, basic operations and outputs.
First, we load the missoNet package:
We will use a synthetic dataset for illustration. To generate a set
of data with missing response values, we can simply call the built-in
function generateData
:
## Specify a random seed for reproducibility.
## The overall missing rate in the response matrix is around 10%.
## The missing mechanism can also be "MCAR" or "MNAR".
sim.dat <- generateData(n = 150, p = 15, q = 12, rho = 0.1, missing.type = "MAR", with.seed = 1512)
tr <- 1:120 # training set indices
tst <- 121:150 # test set indices
This command returns an object containing all necessary components for analysis including:
'X'
, 'Y'
: a predictor matrix \(\mathbf{X} \in \mathbb{R}^{n\times p}\) and
a complete (clean) response matrix \(\mathbf{Y} \in \mathbb{R}^{n\times q}\), in
which rows correspond to samples and columns correspond to variables.
The rows of \(\mathbf{Y}\) is sampled
from \(\mathcal{MVN}(X\mathbf{B}^*,(\mathbf{\Theta}^*)^{-1})\).
'Beta'
, 'Theta'
: the true parameters
\(\mathbf{B}^* \in \mathbb{R}^{p\times
q}\) and \(\mathbf{\Theta}^* \in
\mathbb{R}^{q\times q}\) for the generative model that will be
estimated later (see the later section for more details on their
structures and sparsity).
'Z'
: a corrupted response matrix \(\mathbf{Z} \in \mathbb{R}^{n\times q}\)
having elements \(\mathbf{Z}_{ij}\)
(\(\forall\ i=1,...,n,\ \ j=1,...,q\))
which is either NA
or equal to \(\mathbf{Y}_{ij}\).
The probabilities of missingness for the response variables and the
missing mechanism are specified by 'rho'
and
'missing.type'
, respectively. 'rho'
can be a
scalar or a vector of length \(q\). In
the code above, 'rho' = 0.1
indicates an overall missing
rate of around 10%. In addition to "MAR"
, the
'missing.type'
can also be "MCAR"
or
"MNAR"
, see the later section for more details on how
missing values are generated under different mechanisms.
NOTE 1: the program only accepts missing values that are coded as either
NA
s orNaN
s.
NOTE 2: although the program will provide estimates even when \(n \leq \text{max}(p,q)\), convergence is likely to be rather slow, and the variance of estimation is likely to be excessively large. Therefore, we suggest that both the predictor columns and the response columns be reduced (filtered) if possible, prior to fitting any models. For example in genomics, where \(\mathbf{X}\) can very wide (i.e. large \(p\)), we often filter features based on variance or coefficient of variation.
We can easily visualize how missing data is distributed in the corrupted response matrix \(\mathbf{Z}\) using package visdat:
A single model can be fitted using the most basic call to
missoNet
:
## Training set
X.tr <- sim.dat$X[tr, ] # predictor matrix
Z.tr <- sim.dat$Z[tr, ] # corrupted response matrix
## Using the training set to fit the model.
## 'lambda.Beta' and 'lambda.Theta' are arbitrarily set to 0.1.
## 'verbose' = 0 suppresses printing of messages.
fit <- missoNet(X = X.tr, Y = Z.tr, lambda.Beta = 0.1, lambda.Theta = 0.1, verbose = 0)
However, missoNet
should be more commonly used with a
grid of values for \(\lambda_B\) and
\(\lambda_\Theta\). The two penalty
vectors must have the same length, and the missoNet
function will be called with each consecutive pair of values – i.e. with
the first elements of the two vectors, then with the second elements,
etc.
lambda.Beta.vec <- 10^(seq(from = 0, to = -1, length.out = 10)) # 10 values on the log scale, from 1 to 0.1
lambda.Theta.vec <- rep(0.1, 10) # for each value of 'lambda.Beta', 'lambda.Theta' remains constant at 0.1
fit_list <- missoNet(X = X.tr, Y = Z.tr, lambda.Beta = lambda.Beta.vec, lambda.Theta = lambda.Theta.vec, verbose = 0)
The command above returns a sequence of models for users to choose
from. In many cases, users may prefer that the software selects the best
fit among them. Cross-validation is perhaps the simplest and most widely
used method for this task. cv.missoNet
is the main function
to do cross-validation, along with supporting methods such as
plot
and predict
.
Here we use cv.missoNet
to perform a five-fold
cross-validation, samples will be permuted before splitting into
multiple folds. For reproducibility, we assign a random seed to the
permutation.
## If 'permute' = FALSE, the samples will be split into k-folds in their original orders,
## i.e. the first (n/'kfold') samples will be assigned to the first fold an so on.
cvfit <- cv.missoNet(X = X.tr, Y = Z.tr, kfold = 5,
lambda.Beta = lambda.Beta.vec, lambda.Theta = lambda.Theta.vec,
permute = TRUE, with.seed = 433, verbose = 0)
#> Warning in boundaryCheck(lambda.Theta = lambda.Theta, lambda.Beta = lambda.Beta, :
#> The optimal `lambda.Theta` is close to the upper boundary of the search scope,
#> try to provide a new sequence covering larger values for the following argument:
#>
#> 1. lambda.Theta
The program has warned us that the range of \(\lambda_\Theta\) is inappropriate so that
we need to supply a sequence of \(\lambda_\Theta\) covering larger values.
However, picking a suitable range for such hyperparameters may require
experience or a series of trials and errors. Therefore,
cv.missoNet
provides a smarter way to solve this
problem.
Let’s fit the cross-validation model again, this time running all folds in parallel on two CPU cores. The parallelization of missoNet models relies on package parallel, so make sure the parallel clusters have been registered beforehand.
## 'fit.1se = TRUE' tells the program to make additional estimations of the
## parameters at the largest value of 'lambda.Beta' / 'lambda.Theta' that gives
## the most regularized model such that the cross-validated error is within one
## standard error of the minimum.
cl <- parallel::makeCluster(min(parallel::detectCores()-1, 2))
cvfit <- cv.missoNet(X = X.tr, Y = Z.tr, kfold = 5,
fit.1se = TRUE, parallel = TRUE, cl = cl,
permute = TRUE, with.seed = 433, verbose = 1)
parallel::stopCluster(cl)
Note that we have not explicitly specified the regularization
parameters \(\lambda_B\) and \(\lambda_\Theta\) in the command above. In
this case, a grid of \(\lambda_B\) and
\(\lambda_\Theta\) values in a
(hopefully) reasonable range will be computed and used by the program.
Users can even provide just one of \(\lambda_B\) and \(\lambda_\Theta\) vectors such that the
program will compute the other. Generally, it is difficult at the
beginning to choose suitable \(\lambda\) sequences, so we
suggest users start analysis using
cv.missoNet
, and let the program compute the appropriate
\(\lambda_B\) and \(\lambda_\Theta\) values itself, and then
automatically pick the optimal combination of them at which the minimum
cross-validated error is achieved.
cv.missoNet
returns a 'cv.missoNet'
object,
a list with all the ingredients of the cross-validated fit. We can
execute the plot
method
## The values of mean cross-validated errors along with upper and lower standard deviation bounds
## can be accessed via 'cvfit$cvm', 'cvfit$cvup' and 'cvfit$cvlo', respectively.
plot(cvfit)
to visualize the mean cross-validated error in a heatmap style. The
white solid box marks out the position of the minimum mean
cross-validated error with corresponding \(\lambda_B\) and \(\lambda_\Theta\) ("lambda.min"
:= \(({\lambda_B}_\text{min},
{\lambda_\Theta}_\text{min})\)), and the white dashed boxes
indicate the largest \(\lambda_B\) and
\(\lambda_\Theta\) at which the mean
cross-validated error is within one standard error of the minimum, by
fixing the other one at "lambda.min"
(i.e. "lambda.1se.Beta"
:= \(({\lambda_B}_\text{1se},
{\lambda_\Theta}_\text{min})\) and
"lambda.1se.Theta"
:= \(({\lambda_B}_\text{min},
{\lambda_\Theta}_\text{1se})\)). We often use this
“one-standard-error” rule when selecting the most parsimonious model;
this acknowledges the fact that the cross-validation surface is
estimated with error, so error on the side of parsimony is
preferable.
We can also plot the errors in a 3D scatter plot (without surfaces):
After cross-validation, \(\mathbf{B}^*\) and \(\mathbf{\Theta}^*\) can be estimated at
three special \(\lambda\) pairs
discussed above – "lambda.min"
,
"lambda.1se.Beta"
and "lambda.1se.Theta"
. The
corresponding results, along with the \(\lambda\) values, are stored in three
separate lists: 'est.min'
, 'est.1se.B'
and
'est.1se.Tht'
('est.1se.B'
and
'est.1se.Tht'
are NULL
unless the argument
'fit.1se' = TRUE
when calling
cv.missoNet
).
Let’s extract the estimates of the model parameters \(\hat{\mathbf{B}}\) and \(\hat{\mathbf{\Theta}}\) at
"lambda.min"
and "lambda.1se.Beta"
then plot
them beside the ground truth \(\mathbf{B}^*\) and \(\mathbf{\Theta}^*\):
## Define a plotting function
plot_heatmap <- function(est, col, legend = FALSE, lgd_name, title) {
return(ComplexHeatmap::Heatmap(est, cluster_rows = FALSE, cluster_columns = FALSE,
col = col, show_heatmap_legend = legend, name = lgd_name,
column_names_gp = grid::gpar(fontsize = 8),
row_names_gp = grid::gpar(fontsize = 8), row_names_side = "left",
border = TRUE, column_title = title))
}
## Color space
col <- circlize::colorRamp2(c(-2, 0, 2), c("blue", "white", "red"))
## Beta*
Beta_star <- sim.dat$Beta
Beta_star_ht <- plot_heatmap(Beta_star, col, title = expression(paste(bold(Beta), "*")))
## Beta_hat at "lambda.min"
Beta_hat_min <- cvfit$est.min$Beta
Beta_hat_min_ht <- plot_heatmap(Beta_hat_min, col, title = expression(paste(hat(bold(Beta)), " at `lambda.min`")))
## Beta_hat at "lambda.1se.Beta"
Beta_hat_1se <- cvfit$est.1se.B$Beta
Beta_hat_1se_ht <- plot_heatmap(Beta_hat_1se, col, legend = TRUE, lgd_name = "values",
title = expression(paste(hat(bold(Beta)), " at `lambda.Beta.1se`")))
## Theta*
Theta_star <- sim.dat$Theta
Theta_star_ht <- plot_heatmap(Theta_star, col, title = expression(paste(bold(Theta), "*")))
## Theta_hat at "lambda.min"
Theta_hat_min <- cvfit$est.min$Theta
Theta_hat_min_ht <- plot_heatmap(Theta_hat_min, col, title = expression(paste(hat(bold(Theta)), " at `lambda.min`")))
## Theta_hat at "lambda.1se.Beta"
Theta_hat_1se <- cvfit$est.1se.B$Theta
Theta_hat_1se_ht <- plot_heatmap(Theta_hat_1se, col, legend = TRUE, lgd_name = "values",
title = expression(paste(hat(bold(Theta)), " at `lambda.Beta.1se`")))
## Plot
Beta_star_ht + Beta_hat_min_ht + Beta_hat_1se_ht
Theta_star_ht + Theta_hat_min_ht + Theta_hat_1se_ht
Like other regression methods, predictions can be made based on the
fitted 'cv.missoNet'
object as well. The code below gives
predictions for a new input matrix 'newx'
at
"lambda.min"
('s' = "lambda.1se.Beta"
and
's' = "lambda.1se.Theta"
is available only when
'fit.1se' = TRUE
):
## Predictions on the test set: newy = mu + newx %*% Beta.
## 's' can also be "lambda.1se.Beta" or "lambda.1se.Theta"
## (why 's' but not 'lambda'? Because we follow the naming rules of glmnet).
newy <- predict(cvfit, newx = sim.dat$X[tst, ], s = "lambda.min")
cat("dim(newy):", dim(newy))
#> dim(newy): 30 12
Users now should be able to use missoNet package with the functions introduced so far. There are many more arguments in the functions that give users a great deal of flexibility. To learn more, move on to the next section. In the section on a real data application, we demonstrate a complete process of analyzing a neuroimaging and genetic dataset using missoNet, and we highly recommend interested users to browse this content.
missoNet includes a variety of functions for data simulation, goodness-of-fit evaluation, regularization parameter tuning, and visualization of results, as well as predictions in new data. In addition to the basic function arguments introduced so far, there are some other commonly used arguments available for users to customize the fit.
cv.missoNet
'rho'
: (optional) a scalar or a numeric vector of
length \(q\) for the missing
probabilities of the response variables. The default is
'rho' = NULL
and the program will compute the empirical
missing rates for each of the columns of 'Y'
and use them
as the working missing probabilities; a user-supplied 'rho'
overrides this default. The default setting should suffice in most
cases; note that misspecified missing probabilities would introduce
biases into the model. Use the global assignment for an overall missing
probability (i.e. a scalar 'rho'
) with care, because it
might introduce extra errors especially when the outcomes
'Y'
has highly unbalanced missing rates across
variables.
'lambda.Beta'
and 'lambda.Theta'
are
(optional) user-supplied sequences of regularization parameters for the
lasso penalties, {\(\lambda_B\)} and
{\(\lambda_\Theta\)}, among which the
cross-validation procedure searches. When
'lambda.Beta' = NULL
and/or
'lambda.Theta' = NULL
(the default), a grid of \(\lambda_B\) and/or \(\lambda_\Theta\) in a (hopefully)
reasonable range will be computed and used by the program. In this case,
the following arguments can be used to adjust the range of sequence as
well as the density of grid (considering the limited space,
similar control arguments are discussed together, below “/” serves as a
separator between \(\lambda_B\) and
\(\lambda_\Theta\)):
'lamBeta.min.ratio'
/
'lamTheta.min.ratio'
: the smallest value of \(\lambda_B\) / \(\lambda_\Theta\) is calculated as the
data-derived \(\text{max}(\lambda_B)\)
/ \(\text{max}(\lambda_\Theta)\)
multiplied by this positive ratio. This argument controls the width of
the generated lambda sequence by trimming its smallest end. The default
depends on the sample size \(n\)
relative to the number of predictors \(p\) / responses \(q\). If \(n>p\) / \(n>q\), the default is
'lamBeta.min.ratio' = 1.0E-4
/
'lamTheta.min.ratio' = 1.0E-4
, otherwise it is
1.0E-2
/ 1.0E-2
. A very small value of
'lamBeta.min.ratio'
/ 'lamTheta.min.ratio'
may
significantly increase runtime and lead to a saturated fit when \(n \leq p\) / \(n
\leq q\).'n.lamBeta'
/ 'n.lamTheta'
: the program
generates 'n.lamBeta'
/ 'n.lamTheta'
values
linear on the log scale from \(\text{max}(\lambda_B)\) / \(\text{max}(\lambda_\Theta)\) down to (\(\text{max}(\lambda_B)*\)'lamBeta.min.ratio'
)
/ (\(\text{max}(\lambda_\Theta)*\)'lamTheta.min.ratio'
).
If \(n>p\) / \(n>q\), the default is
'n.lamBeta' = 40
/ 'n.lamTheta' = 40
,
otherwise it is 20
/ 20
. This argument
controls the density of the generated lambda sequence. Avoid supplying a
very large number to both 'n.lamBeta'
and
'n.lamTheta'
at the same time, because the program will fit
('n.lamBeta'
\(*\)'n.lamTheta'
) models in
total for each fold of the cross-validation, thus the complexity grows
in \(\mathcal{O}(n^2)\). Instead, users
can adopt a dynamic search strategy, see the next section for an
example.'lamBeta.scale.factor'
/
'lamTheta.scale.factor'
: a positive multiplication factor
controlling the overall magnitude of the entire \(\lambda_B\) / \(\lambda_\Theta\) sequence. This argument
plays the role of scaling the lambda sequence (or shifting it on the log
scale), the default is 'lamBeta.scale.factor' = 1
/
'lamTheta.scale.factor' = 1
and a typical usage is when the
program has warned that the magnitudes of the \(\lambda_B\) / \(\lambda_\Theta\) values are inappropriate
(either too large or too small), so that the optimal value of \(\lambda_B\) / \(\lambda_\Theta\) selected by the
cross-validation is close to either boundary of the search range.'standardize'
and
'standardize.response'
are logical flags (the defaults are
TRUE
) for standardization of variables in 'X'
and 'Y'
to have unit variances prior to fitting the model.
The estimated results are always returned on the original scale.
cv.missoNet
computes appropriate lambda sequences relying
on standardization. If users wish to compare the results with those of
other softwares, it is suggested to supply the program with datasets
which have been standardized beforehand (by using 'scale()'
or similar functions), then set 'standardize'
and
'standardize.response'
to FALSE
.
'fit.1se'
is a logical flag (the default is
FALSE
) telling the program whether \(\mathbf{B}^*\) and \(\mathbf{\Theta}^*\) need to be estimated at
"lambda.1se.Beta"
:=\(({\lambda_B}_\text{1se},
{\lambda_\Theta}_\text{min})\) and
"lambda.1se.Theta"
:=\(({\lambda_B}_\text{min},
{\lambda_\Theta}_\text{1se})\), where \({\lambda_B}_\text{1se}\) and \({\lambda_\Theta}_\text{1se}\) are the
largest \(\lambda_B\) and \(\lambda_\Theta\) respectively at which the
mean cross-validated error is within one standard error of the minimum,
while the other one is fixed at "lambda.min"
:=\(({\lambda_B}_\text{min},
{\lambda_\Theta}_\text{min})\). The corresponding estimation
results are stored in 'est.1se.B'
and
'est.1se.Tht'
inside the returned object.
'fit.relax'
: if TRUE
(the default is
FALSE
), the program will re-estimate the edges in the
active set (i.e. non-zero off-diagonal elements) of \(\hat{\mathbf{\Theta}}\) without
penalization (\(\lambda_\Theta\) = 0),
such a debiased estimate of the network structure could be useful for
some interdependency analyses. This “relaxed” fit is stored in
'relax.net'
. WARNING: there may be convergence issues if
the working empirical covariance matrix is not of full rank (e.g. \(n < q\)).
'verbose'
: can be value of 0
,
1
or 2
. 'verbose' = 0
– silent
mode; 'verbose' = 1
(the default) – limited tracing with
progress bars; 'verbose' = 2
– detailed tracing. Note that
displaying the progress bars slightly increases the computation overhead
compared to the silent mode. The detailed tracing should be used for
monitoring progress only when the program runs extremely slowly, and it
is not supported under 'parallel' = TRUE
.
'parallel'
and 'cl'
: if
'parallel' = TRUE
, the program uses clusters to compute all
folds of the cross-validation in parallel. Users must first register
parallel clusters by using parallel::makeCluster('nCores')
and supply the created cluster object to 'cl'
. One way to
determine the number of cores for parallelization could be
'nCores' = min(parallel::detectCores()-1, ceiling(kfold/2))
.
Note that if 'verbose' == 1
and
'nCores' == 'kfold'
, the progress bar for the
cross-validation process will jump directly from 0 to 100, because the
computations for all folds start and end at approximately the same time
in a homogeneous machine.
missoNet
missoNet
shares most arguments with
cv.missoNet
except that 'lambda.Beta'
and
'lambda.Theta'
are not optional for missoNet
.
Instead, users must specify either a scalar or a vector for both
'lambda.Beta'
and 'lambda.Theta'
; the vectors
must have the same length. missoNet
will be called with
each consecutive pair of 'lambda.Beta'
and
'lambda.Theta'
– i.e. with the first elements of the two
vectors ('lambda.Beta[1]'
, 'lambda.Theta[1]'
),
then with the second elements ('lambda.Beta[2]'
,
'lambda.Theta[2]'
), etc.
generateData
generateData
provides a convenient way for users to
quickly generate simulation data and get started to build their own
analysis frameworks.
Given the predictor matrix \(\mathbf{X}\) and the true parameters \(\mathbf{B}^*\) and \(\mathbf{\Theta}^*\), a response matrix
\(\mathbf{Y}\) without missing values
is generated by \(\mathbf{Y} =
\mathbf{X}\mathbf{B}^* + \mathbf{E}\), where the rows of \(\mathbf{E}\) are sampled from \(\mathcal{MVN}(0_q,(\mathbf{\Theta}^*)^{-1})\).
A corrupted response matrix \(\mathbf{Z} :=
f_{\mathbf{M}}(\mathbf{Y})\) has elements \(\mathbf{Z}_{ij}\) = NA
if
\(\mathbf{M}_{ij} = 1\), otherwise
\(\mathbf{Z}_{ij} = \mathbf{Y}_{ij}\).
\(\mathbf{M}\) is a indicator matrix of
missingness having the same dimension as \(\mathbf{Y}\) and parameterized by
'rho'
and 'missing.type'
, see below for
details.
'Beta'
: (optional) a user-supplied true regression
coefficient matrix \(\mathbf{B}^*\)
(\(p \times q\)). If
'Beta' = NULL
(the default), a dense (\(p \times q\)) matrix will first be created
by setting each element to a random draw from \(\mathcal{N}(0, 1)\), then a sparse \(\mathbf{B}^*\) is generated by randomly
assigning zeros to some of the elements in this matrix. There are two
kinds of sparsity that can be controlled by using the following
arguments if \(\mathbf{B}^*\) is
automatically generated:
'Beta.row.sparsity'
: a Bernoulli parameter between 0
and 1 controlling the approximate proportion of rows where at least one
element could be nonzero in \(\mathbf{B}^*\).'Beta.elm.sparsity'
: a Bernoulli parameter between 0
and 1 controlling the approximate proportion of nonzero elements among
the rows of \(\mathbf{B}^*\) where not
all elements are zeros.'Theta'
: (optional) a user-supplied (positive
definite) true precision matrix \(\mathbf{\Theta}^*\) (\(q \times q\)) for sampling the error matrix
\(\mathbf{E} \sim
\mathcal{MVN}(0_q,(\mathbf{\Theta}^*)^{-1})\). The default is
'Theta' = NULL
and the program will generate a
block-structured \(\mathbf{\Theta}^*\)
having four blocks corresponding to four types of network structures:
independent, weak graph, strong graph and chain. This argument is only
needed when 'E' = NULL
.
'Sigma.X'
: (optional) a user-supplied (positive
definite) covariance matrix (\(p \times
p\)) for sampling the predictor matrix \(\mathbf{X} \sim \mathcal{MVN}(0_p,
\mathbf{\Sigma}_X)\). If 'Sigma.X' = NULL
(the
default), the function will generate \(\mathbf{X}\) using an AR(1) covariance
matrix with 0.7 autocorrelation (i.e. \([\mathbf{\Sigma}_X]_{jk} = 0.7^{|j-k|}\)).
This argument is only needed when 'X' = NULL
.
'rho'
: a scalar or a vector of length \(q\) specifying the approximate proportion
of missing values in each column of \(\mathbf{Z}\). Note that a scalar will be
automatically converted to a vector filled with the same
values.
'missing.type'
: this argument determines how the
missing values in the corrupted response matrix \(\mathbf{Z}\) are randomly generated, can be
one of "MCAR"
, "MAR"
and
"MNAR"
:
"MCAR"
: missing completely at random. For all \(i=1,\dots,n\) and \(j=1,\dots,q\): \(\mathbf{Z}_{ij}\) = NA
if
\(\mathbf{M}_{ij}=1\), otherwise \(\mathbf{Z}_{ij}\) = \(\mathbf{Y}_{ij}\), where \(\mathbf{M}_{ij}\) are i.i.d. Bernoulli
draws with probability 'rho[j]'
;"MAR"
: missing at random. For all \(i=1,\dots,n\) and \(j=1,\dots,q\): \(\mathbf{Z}_{ij}\) = NA
if
\(\mathbf{M}_{ij}=1\), otherwise \(\mathbf{Z}_{ij}\) = \(\mathbf{Y}_{ij}\), where \(\mathbf{M}_{ij}\) are Bernoulli draws with
probability ('rho[j]'
\(*\frac{c}{1+e^{-[\mathbf{X}\mathbf{B}^*]_{ij}}}\)),
in which \(c\) is a constant correcting
the missing rate of the \(j\)th column
of \(\mathbf{Y}\) back to the level of
'rho[j]'
;"MNAR"
: missing not at random. For all \(i=1,\dots,n\) and \(j=1,\dots,q\): \(\mathbf{Z}_{ij}\) = NA
if
\(\mathbf{M}_{ij}=1\), otherwise \(\mathbf{Z}_{ij}\) = \(\mathbf{Y}_{ij}\), where \(\mathbf{M}_{ij} = 1 \cdot (\mathbf{Y}_{ij} <
T^j)\), in which \(T^{j} =
\mathcal{Q}\)('rho[j]'
, \(\mathbf{Y}_{\cdot j}\)) is the
'rho[j]'
-quantile of \(\mathbf{Y}_{\cdot j}\). In other words,
\(\mathbf{Y}_{ij}\) will be missing if
it is smaller than the 'rho[j]'
-quantile of the \(j\)th column of \(\mathbf{Y}\).plot
'detailed.axes'
is a logical flag telling the plotting
function whether detailed axes should be plotted. The default is
'detailed.axes' = TRUE
, set to FALSE
when the
\(\lambda\) values are too dense
(e.g. if you have assigned a large number to 'n.lamBeta'
or
'n.lamTheta'
).In this section, we will apply missoNet to a
relatively high-dimensional dataset – bgsmtr_example_data
,
as a part of the bgsmtr package, obtained from the
Alzheimer’s Disease Neuroimaging Initiative (ADNI-1) database.
This example dataset consists of 15 structural neuroimaging measures and 486 single nucleotide polymorphisms (SNPs) determined from a sample of 632 subjects. Importantly, the 486 SNPs cover 33 genes deemed associated with Alzheimer’s disease. In this case, \(\mathbf{X}\) is a 632-by-486 matrix containing minor allele counts (i.e. 0, 1 or 2) for 486 SNPs on the 632 subjects. \(\mathbf{Y}\) is a 632-by-15 matrix containing volumetric and cortical thickness measures for 15 regions of interest. For more details, see Greenlaw et al. (2017) <doi:10.1093/bioinformatics/btx215> and bgsmtr.
First let’s load the dataset and randomly split the data into training and test sets. As a demonstration, we only use the SNPs with large variance (i.e. variance in the top 50% among all SNPs, \(p\) = 243) as the predictor variables. In most practical cases, this kind of unsupervised filtering of variables for dimension reduction before analyses can help the algorithm make faster progress and get more robust results.
## Load data
library(bgsmtr, quietly = TRUE)
data(bgsmtr_example_data)
## Transpose data matrix to make rows correspond to
## samples and columns correspond to variables.
SNP <- t(bgsmtr_example_data$SNP_data) # predictor matrix
BM <- t(bgsmtr_example_data$BrainMeasures) # complete response matrix
## Unsupervised filtering of the top 50% variables.
SNP_var <- apply(SNP, 2, var)
SNP_subset <- SNP_var > quantile(SNP_var, 0.5)
SNP <- SNP[ ,SNP_subset]
set.seed(123) # a random seed for reproducibility
tr <- sample(1:632, 550, replace = FALSE) # training set indices
tst <- c(1:632)[-tr] # test set indices
cat("dim(SNP):", dim(SNP[tr, ]), "
dim(BM):", dim(BM[tr, ]))
#> dim(SNP): 550 243
#> dim(BM): 550 15
Since this dataset does not contain missing values, we manually drop around 5% of values in the matrix of brain measures (outcomes) following the MCAR mechanism:
## Generate the indicator matrix of missingness.
M <- matrix(1, nrow(BM), ncol(BM))
NA.pos <- do.call("cbind", lapply(1:ncol(BM), function(x) {
rbinom(nrow(BM), size = 1, prob = 0.05) == 1
}))
## All missing values should be coded as 'NA's or 'NaN's.
M[NA.pos] <- NA
BM.mis = BM * M
cat(BM.mis[1:6, 5:8])
#> Left_InfLatVent.adj Left_LatVent.adj Left_EntCtx.adj Left_Fusiform.adj
#> V1 -473.42061 -4595.2672 0.60678817 NA
#> V2 100.12989 -9163.7594 0.29868885 0.196951758
#> V3 373.36131 -10659.8137 0.18835349 0.105863582
#> V4 876.03547 NA -1.19832801 -0.233082858
#> V5 -887.40929 -6789.0606 0.09203347 0.098447465
#> V6 354.12157 NA -0.32757347 0.310694508
We remark that this is a challenging task because the relatively small sample size and the existences of non-Gaussian variables make the estimation results have large variances, and heavily depend on how the data is split for the cross-validation, which is very likely to result in difficulty in convergence and abnormal cross-validation behaviors (e.g. the optimal area of \(\lambda := (\lambda_B, \lambda_\Theta)\) is heavily skewed towards the boundary of the search domain). Therefore, we adopt a dynamic strategy of “coarse + fine” search here to avoid the unnecessary over-searching in those areas with obviously excessive cross-validated errors.
We first perform a large-scope coarse search for the optimal lambda
pair \(({\lambda_B}_\text{min}^\text{coarse},
{\lambda_\Theta}_\text{min}^\text{coarse})\) that achieves the
minimum prediction error using cv.missoNet
. We set
'lamBeta.scale.factor' = 3
and
'lamTheta.scale.factor' = 3
(the default is 1
)
to raise the upper boundaries of the search ranges for \(\lambda_B\) and \(\lambda_\Theta\) (i.e. \(\text{max}(\lambda_B)\) and \(\text{max}(\lambda_\Theta)\)) a little bit,
meanwhile 'lamBeta.min.ratio'
and
'lamTheta.min.ratio'
are given a small value
1E-4
(the default) close to zero to ensure that the entire
ranges of all possible solutions are covered. As a rough search, the
number of lambda values ('n.lamBeta'
and
'n.lamTheta'
) usually does not need to be very large,
because we only need approximate ranges of \(\lambda_B\) and \(\lambda_\Theta\) where \(({\lambda_B}_\text{min}^\text{coarse},
{\lambda_\Theta}_\text{min}^\text{coarse})\) is most likely to
exist.
## Model I (the coarse-grid fit)
## To be more in line with real-world usage, the model is trained without specified
## missing probabilities for the response variables ('rho' = NULL) and lambda sequences
## for the lasso penalties ('lambda.Beta' = NULL, 'lambda.Theta' = NULL), in which case
## the program will automatically compute and use reasonable values.
cl <- parallel::makeCluster(min(parallel::detectCores()-1, 2))
cvfit.BM <- cv.missoNet(X = SNP[tr, ], Y = BM.mis[tr, ], kfold = 5,
rho = NULL, lambda.Beta = NULL, lambda.Theta = NULL,
lamBeta.min.ratio = 1e-4, lamTheta.min.ratio = 1e-4,
n.lamBeta = 20, n.lamTheta = 20,
lamBeta.scale.factor = 3, lamTheta.scale.factor = 3,
standardize = TRUE, standardize.response = TRUE,
permute = TRUE, with.seed = 433,
parallel = TRUE, cl = cl, verbose = 1)
parallel::stopCluster(cl)
Then we plot the heatmap of the standardized mean cross-validated errors:
Now we can pick more reasonable ranges for \(\lambda_B\) and \(\lambda_\Theta\) for a further fine-grid
search according to the figure above. We observed that the current
optimal \(\lambda\) = \(({\lambda_B}_\text{min}^\text{coarse},
{\lambda_\Theta}_\text{min}^\text{coarse})\) achieving the
minimum cross-validated error (white box) is about (0.24, 0.62), while
\(\text{max}(\lambda_B)\) and \(\text{max}(\lambda_\Theta)\) are around
2.66 (appropriate) and 29.94 (too large), respectively. Therefore, we
decide to shrink 'lamTheta.scale.factor'
from
3
to 0.3
(= 3/10) as we believe 2.99 (=
29.94/10) is a more appropriate upper boundary for \(\lambda_\Theta\).
'lamBeta.scale.factor'
can remain at 3
so that
\(\text{max}(\lambda_B)\) will still be
2.66. For the same reason, the new \(\text{min}(\lambda_B)\) and \(\text{min}(\lambda_\Theta)\) are determined
to be 0.027 (= \(2.66*0.01\)) and 0.030
(= \(2.99*0.01\)) respectively by
setting 'lamBeta.min.ratio' = 0.01
and
'lamTheta.min.ratio' = 0.01
.
TIPS: Sometimes the lambda sequences automatically computed by the program may have inappropriate ranges, if the optimal lambda pair
"lambda.min"
:= \(({\lambda_B}_\text{min}, {\lambda_\Theta}_\text{min})\) selected by the cross-validation is located at the boundary of the search domain, users can use the arguments'x.scale.factor'
and/or'x.min.ratio'
to shift and/or zoom the search ranges accordingly.
Because we are going to do a fine-grid search, the number of lambda
values can be slightly increased (here we increase
'n.lamBeta'
and 'n.lamTheta'
from
20
to 40
). Typically we suggest
'n.x' = -log10('x.min.ratio') * c
, where c
\(\in\) [10, 20] (x
is a
placeholder for lamBeta
or lamTheta
), to
achieve a balance between the computational fidelity and speed.
## Model II (the fine-grid fit)
## For brevity, arguments taking the defaults are not specified explicitly.
cl <- parallel::makeCluster(min(parallel::detectCores()-1, 2))
cvfit2.BM <- cv.missoNet(X = SNP[tr, ], Y = BM.mis[tr, ], kfold = 5,
lamBeta.min.ratio = 0.01, lamTheta.min.ratio = 0.01,
n.lamBeta = 40, n.lamTheta = 40,
lamBeta.scale.factor = 3, lamTheta.scale.factor = 0.3,
fit.relax = TRUE, permute = TRUE, with.seed = 433,
parallel = TRUE, cl = cl, verbose = 1)
parallel::stopCluster(cl)
## Certainly, users can specify the lambda values (linear on the log scale) by themselves
## for the cross-validation search rather than use the control arguments. The following
## commands should return almost the same results as the above ones (subtle differences
## come from rounding the float numbers).
# cl <- parallel::makeCluster(min(parallel::detectCores()-1, 2))
# cvfit2.BM <- cv.missoNet(X = SNP[tr, ], Y = BM.mis[tr, ], kfold = 5,
# lambda.Beta = 10^(seq(from = log10(2.66), to = log10(2.66*0.01), length.out = 40)),
# lambda.Theta = 10^(seq(from = log10(2.99), to = log10(2.99*0.01), length.out = 40)),
# fit.relax = TRUE, permute = TRUE, with.seed = 433,
# parallel = TRUE, cl = cl, verbose = 0)
# parallel::stopCluster(cl)
Plot the standardized mean cross-validated errors of this refitted
model, 'detailed.axes' = FALSE
can prevent axes to get
squeezed together:
cat("\"lambda.min\":
- lambda.Beta:", cvfit2.BM$est.min$lambda.Beta, "
- lambda.Theta:", cvfit2.BM$est.min$lambda.Theta)
plot(cvfit2.BM, detailed.axes = FALSE)
#> "lambda.min":
#> - lambda.Beta: 0.198283
#> - lambda.Theta: 0.573265
We found the new "lambda.min"
= \(({\lambda_B}_\text{min}^\text{fine},
{\lambda_\Theta}_\text{min}^\text{fine})\) is (0.20, 0.57), which
is actually very close to the one from the coarse search \(({\lambda_B}_\text{min}^\text{coarse},
{\lambda_\Theta}_\text{min}^\text{coarse})\) = (0.24, 0.62),
showing that one single search is able to provide reasonable accuracy
most of the time as long as sufficiently large search ranges are
given.
As we did before, the estimated parameters \(\hat{\mathbf{B}}\) and \(\hat{\mathbf{\Theta}}\) at the new
"lambda.min"
can be visualized by heatmaps. For the sake of
brevity, we shall skip the discussion of "lambda.1se.Beta"
and "lambda.1se.Theta"
here.
## Theta_hat at "lambda.min"
Theta_hat <- cvfit2.BM$est.min$Theta
rownames(Theta_hat) <- colnames(Theta_hat) <- colnames(BM)
col <- circlize::colorRamp2(c(-2, 0, 2), c("blue", "white", "red"))
ht1 <- plot_heatmap(Theta_hat, col = col, legend = TRUE, lgd_name = "values",
title = expression(paste(hat(bold(Theta)), " at `lambda.min` (unclustered)")))
plot(ht1)
## Beta_hat at "lambda.min"
Beta_hat <- cvfit2.BM$est.min$Beta
colnames(Beta_hat) <- colnames(BM)
col <- circlize::colorRamp2(c(-0.05, 0, 0.05), c("blue", "white", "red"))
ht2 <- plot_heatmap(Beta_hat, col = col, legend = TRUE, lgd_name = "values",
title = expression(paste(hat(bold(Beta)), " at `lambda.min` (unclustered)")))
plot(ht2)
## Relaxed network
relax.net <- cvfit2.BM$est.min$relax.net
rownames(relax.net) <- colnames(relax.net) <- colnames(BM)
col <- circlize::colorRamp2(c(-4, 0, 4), c("blue", "white", "red"))
ht3 <- plot_heatmap(relax.net, col = col, legend = TRUE, lgd_name = "values",
title = "Relaxed fit of network (unclustered)")
plot(ht3)
## Plot the partial correlation matrix of the 15 neuroimaging measures.
## Must supply a dataset without missing values, so we use the original dataset here.
pcor.mat <- ppcor::pcor(BM) # not BM.mis
corrplot::corrplot(pcor.mat$estimate, p.mat = pcor.mat$p.value,
sig.level = 0.01, insig = "blank",
tl.cex = 0.5, tl.col = "black", mar = c(.5, .1, 2, .1),
title = "Partial correlation matrix (unclustered)")
The conditional network structure \(\hat{\mathbf{\Theta}}\) can be represented
by an undirected graph \(\mathcal{G}\),
consisting of a set of vertices \(V\)
and a set of edges \(E\), where an edge
connects a pair of variables (\(Y_j\)
and \(Y_k\)) if and only if they are
conditionally dependent (\(\hat{\mathbf{\Theta}}_{jk} \neq 0\)). One
important feature of missoNet is that the program
supports fitting a relaxed conditional graphical model upon response
variables (by setting 'fit.relax' = TRUE
). The active edges
(non-zero off-diagonal elements) of \(\hat{\mathbf{\Theta}}\) will be
re-estimated without penalization ('lambda.Theta' = 0
),
such a debiased estimate of the network structure would be useful for
exploring the conditional interdependencies among traits of
interest.
The two figures in the second row above show the relaxed conditional network and the partial correlation matrix of the 15 neuroimaging measures. As will be readily seen that missoNet sharply simplifies dependencies compared with the naive partial correlation analysis below right (insignificant correlations with \(p>0.01\) were not plotted), by regressing on SNPs and adjusting for the genetic effects, which helps us better focus on those truly important, and non-genetic induced connections in brain. Moreover, we have to emphasize that unlike the traditional partial correlation analysis which requires complete datasets (although pair-wise complete cases can be used, the results tend to be biased), missoNet substantially reduces the difficulty of data acquisition and processing by allowing for missing values in the tasks.
If we cluster the estimated coefficient matrix \(\hat{\mathbf{B}}\) by rows (SNPs) in an unsupervised manner:
colnames(Beta_hat) <- colnames(BM) # brain measure names
rownames(Beta_hat) <- colnames(SNP) # SNP names
Gene_groups <- bgsmtr_example_data$SNP_groups[SNP_subset] # gene names
ComplexHeatmap::pheatmap(Beta_hat,
annotation_row = data.frame("Gene" = Gene_groups, row.names = rownames(Beta_hat)),
color = circlize::colorRamp2(c(-0.05,0,0.05), c("blue","white","red")), show_rownames = FALSE,
border_color = NA, border = TRUE, name = "values", cluster_rows = TRUE, cluster_cols = FALSE,
column_title = expression(paste(hat(bold(Beta)), " at `lambda.min` (clustered)")))
The figure above depicts a very large proportion of coefficients with zero values, and there is an obvious clustering of the contributing SNPs with respect to the genes (see row-wise annotation), implying potential pathways between certain SNPs and changes in the local structures of brain.
Finally, we end this section by refitting the model using the original dataset without missing values:
## Model III (the complete-data fit)
## All arguments are kept unchanged compared to model II, except BM.mis -> BM.
cl <- parallel::makeCluster(min(parallel::detectCores()-1, 2))
cvfit3.BM <- cv.missoNet(X = SNP[tr, ], Y = BM[tr, ], kfold = 5,
lamBeta.min.ratio = 0.01, lamTheta.min.ratio = 0.01,
n.lamBeta = 40, n.lamTheta = 40,
lamBeta.scale.factor = 3, lamTheta.scale.factor = 0.3,
fit.relax = TRUE, permute = TRUE, with.seed = 433,
parallel = TRUE, cl = cl, verbose = 1)
parallel::stopCluster(cl)
Then we can assess the prediction performance of the models trained using the corrupted vs. complete dataset.
## Predictions on the test set.
newy.model2 <- predict(cvfit2.BM, newx = SNP[tst, ], s = "lambda.min")
newy.model3 <- predict(cvfit3.BM, newx = SNP[tst, ], s = "lambda.min")
cat("MAE on the test set:
- Model II (corrupted):", mean(abs(BM[tst, ] - newy.model2)), "
- Model III (complete):", mean(abs(BM[tst, ] - newy.model3)))
#> MAE on the test set:
#> - Model II (corrupted): 2543.588
#> - Model III (complete): 2543.591
As expected, there is no obvious difference between the two models with regard to the mean absolute prediction error (MAE) on the test set, since we generated the missing data simply following the MCAR mechanism. Real missing data usually falls on a continuum between one extreme – where the systematic missingness pattern depends entirely on the observed data (pure MAR), and the other extreme – where the systematic missingness pattern depends entirely on the missing data (pure MNAR). Even though, missoNet can still provide good enough estimates compared with those algorithms that cannot tolerate missing values.