gesso
This vignette is a tutorial about how to use the gesso
package
The package is developed to fit a regularized regression model that we call gesso for the joint selection of gene-environment (GxE) interactions. The model focuses on a single environmental exposure and induces a main-effect-before-interaction hierarchical structure. We developed and implemented an efficient fitting algorithm and screening rules that can discard large numbers of irrelevant predictors with high accuracy.
gesso model induces hierarchical selection of the (GxE) interaction terms via overlapped group lasso structure. The model has two tuning parameters \(\lambda_1\) and \(\lambda_2\) allowing flexible, data-dependent control over group sparsity and additional interactions sparsity.
Response (outcome) variable can be either continuous or binary 0/1 variable. For a binary response, the IRLS procedure is used with the custom screening rules we developed.
The package supports sparse matrices dgCMatrix
and (filebacked) bigmatrix format from the bigmemory
package for large or out of RAM datasets.
The package allows
NOTE: RcppThread
cannot be installed with gcc
4.8.5 version https://github.com/tnagler/RcppThread/issues/13. In this case, we recommend updating gcc
to a more recent version.
Installation can take a couple of minutes because of the requirement to install dependent packages (dplyr
, Rcpp
, RcppEigen
, RcppThread
, BH
, bigmemory
)
## install.packages("devtools")
library(devtools)
devtools::install_github("NataliaZemlianskaia/gesso")
Attaching libraries for the examples below
First, we generate the dataset using data.gen()
function from the package. The function allows us to generate a binary (0/1) matrix of genotypes G
of a given sample_size
and p
number of features, binary vector E
of environmental measurements, and a response variable Y
\(\sim g(\beta_0 + \beta_E E + \beta_G G + \beta_{G\times E} G\times E)\) - either quantitative or binary depending on the family
parameter.
We can specify the number of non-zero main effects (n_g_non_zero
) and interactions (n_gxe_non_zero
) we want to generate. We also specified a strong_hierarchical mode for our dataset, which means that (1) if interaction effect is generated as non-zero, it’s respective genetic main effect is also generated as non-zero \((\beta_{G\times E} \ne 0 \longrightarrow \beta_G \ne 0)\), and (2) effect sizes of the main effects are larger than that of interaction effects \((|\beta_G| \ge |\beta_{G\times E}|)\).
family = "gaussian"
sample_size = 180; p = 400; n_g_non_zero = 10; n_gxe_non_zero = 5
data = data.gen(seed=1, sample_size=sample_size, p=p,
n_g_non_zero=n_g_non_zero,
n_gxe_non_zero=n_gxe_non_zero,
mode="strong_hierarchical",
family=family)
Let’s look at the dataset we generated data$G_train
, data$E_train
, and data$Y_train
## [1] 180 400
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0 0 0 0 0
## [2,] 0 0 0 0 0
## [3,] 0 0 0 0 0
## [4,] 1 0 0 1 0
## [5,] 0 0 0 0 1
## [6,] 1 0 0 0 0
## [1] 0 0 0 0 0 0 0 1 0 1
## [1] -9.546546 -2.713549 5.875298 1.984317 2.432890 6.798349
We generated model coefficients data$Beta_G
and data$Beta_GxE
such that we got sum(data$Beta_G != 0)
= 10 non-zero main effects and sum(data$Beta_GxE != 0)
= 5 non-zero interaction effects as we specified.
## [1] 10 5
We also imposed strong_hierarchical relationships between main effects and interaction effects
## [,1] [,2]
## [1,] -3 0.0
## [2,] -3 0.0
## [3,] 3 0.0
## [4,] 3 0.0
## [5,] 3 0.0
## [6,] 3 1.5
## [7,] 3 -1.5
## [8,] -3 -1.5
## [9,] -3 1.5
## [10,] 3 1.5
To tune the model over a 2D grid of hyper-parameters (lambda_1
and lambda_2
) we use the function gesso.cv()
, where we specify
G, E, Y
and family
parameter for the response variable; matrix G
should be a numeric matrix, vector E
should be a numeric vector. Vector Y
could either be a continuous or binary 0/1 numeric vectortolerance
for the convergence of the fitting algorithmgrid_size
to automatically generate grid values for cross-validationgrid_min_ratio
is an important parameter that controls the sparsity of the coefficients vector. For substantially high-dimensional problems where p
>> n
or when the goal is to select only a few interactions we recommend setting grid_min_ratio
to 0.1nfolds
to specify how many folds to use in cross-validationparallel
TRUE to enable parallel cross-validation (TRUE by default)seed
set the random seed to control random folds assignmentsnormalize
TRUE to normalize matrix G
(such that column-wise sd is equal to 1) and vector E
(TRUE by default)normalize_response
TRUE to normalize vector Y
(FALSE by default)There is an option to use a custom grid in gesso.cv()
function by specifying grid
parameter instead of grid_size
parameter. The user can also specify alpha
parameter to enable tuning only over lambda_1
while assuming lambda_2
= lambda_1
* alpha
.
verbose
parameter is set to TRUE by default in gesso.cv()
, but we can set it to FALSE to avoid outputting function messages about partial execution time.
start = Sys.time()
tune_model = gesso.cv(G=data$G_train, E=data$E_train, Y=data$Y_train,
family=family, grid_size=20, tolerance=1e-4,
grid_min_ratio=1e-2,
parallel=TRUE, nfolds=3,
normalize=TRUE,
normalize_response=TRUE,
seed=1,
max_iterations=10000)
## Compute grid:
## Time difference of 0.0009548664 secs
## Parallel cv:
## Time difference of 0.686111 secs
## Fit on the full dataset:
## Time difference of 0.6701159 secs
## Time difference of 1.367196 secs
gesso.coef()
functionTo obtain interaction and main effect coefficients corresponding to the model with a particular pair of tuning parameters we use gesso.coef()
function. We need to specify a model fit object and a pair of lambda
= (lambda_1
, lambda_2
) values organized in a tibble (ex: lambda = tibble(lambda_1=tune_model$grid[1], lambda_2=tune_model$grid[1])
).
Below we set fit = tune_model$fit
- model fit on the full dataset and lambda = tune_model$lambda_min
- the pair of tuning parameters corresponding to the minimum cross-validated error that gesso.cv()
function returns.
coefficients = gesso.coef(fit=tune_model$fit, lambda=tune_model$lambda_min)
gxe_coefficients = coefficients$beta_gxe
g_coefficients = coefficients$beta_g
Check if all non-zero interaction features were recovered by the model
## [,1] [,2]
## [1,] 1.5 1.4029323
## [2,] -1.5 -1.0089132
## [3,] -1.5 -1.1648283
## [4,] 1.5 0.7450604
## [5,] 1.5 1.3221858
Check that the largest estimated interaction effects correspond to the true non-zero coefficients
## [1] 1.5 1.5 -1.5 -1.5 1.5 0.0 0.0 0.0 0.0 0.0
Calculate principal selection metrics with selection.metrics()
funciton available in the package
selection_gesso = selection.metrics(true_b_g=data$Beta_G, true_b_gxe=data$Beta_GxE,
estimated_b_g=g_coefficients,
estimated_b_gxe=gxe_coefficients)
cbind(selection_gesso)
## selection_gesso
## b_g_non_zero 46
## b_gxe_non_zero 22
## mse_b_g 0.1957665
## mse_b_gxe 0.4391993
## sensitivity_g 1
## specificity_g 0.9076923
## precision_g 0.2173913
## sensitivity_gxe 1
## specificity_gxe 0.956962
## precision_gxe 0.2272727
## auc_g 1
## auc_gxe 1
Compare with the standard Lasso model (we use glmnet
package)
set.seed(1)
tune_model_glmnet = cv.glmnet(x=cbind(data$E_train, data$G_train,
data$G_train * data$E_train),
y=data$Y_train,
nfolds=3,
family=family)
coef_glmnet = coef(tune_model_glmnet, s=tune_model_glmnet$lambda.min)
g_glmnet = coef_glmnet[3: (p + 2)]
gxe_glmnet = coef_glmnet[(p + 3): (2 * p + 2)]
cbind(data$Beta_GxE[data$Beta_GxE != 0], gxe_glmnet[data$Beta_GxE != 0])
## [,1] [,2]
## [1,] 1.5 1.35563472
## [2,] -1.5 0.00000000
## [3,] -1.5 0.00000000
## [4,] 1.5 0.03736643
## [5,] 1.5 0.93958681
## [1] 1.5 1.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
selection_glmnet = selection.metrics(data$Beta_G, data$Beta_GxE, g_glmnet, gxe_glmnet)
cbind(selection_gesso, selection_glmnet)
## selection_gesso selection_glmnet
## b_g_non_zero 46 25
## b_gxe_non_zero 22 29
## mse_b_g 0.1957665 0.2679464
## mse_b_gxe 0.4391993 1.181034
## sensitivity_g 1 1
## specificity_g 0.9076923 0.9615385
## precision_g 0.2173913 0.4
## sensitivity_gxe 1 0.6
## specificity_gxe 0.956962 0.9341772
## precision_gxe 0.2272727 0.1034483
## auc_g 1 1
## auc_gxe 1 0.7756962
We see that the gesso model outperforms the Lasso model in terms of the detection of interaction terms in all important selection metrics, including GxE detection AUC (auc_gxe
), sensitivity (sensitivity_gxe
), and precision (precision_gxe
). The estimation bias of the interaction coefficients is also substantially lower for the gesso model (mse_b_gxe
).
gesso.coefnum()
functionTo control how parsimonious is our model with respect to the selected non-zero interaction terms we can use gesso.coefnum()
function. We obtain interaction and main effect coefficients corresponding to the target GxE model, where instead of specifying tuning parameters lambda
, we set target_b_gxe_non_zero
parameter which is the number of at most (specify less_than=TRUE
- default value) or at least (specify less_than=FALSE
) non-zero interactions we want to include in the model, depending on the parameter less_than
.
The target model is selected based on the averaged cross-validation (cv) results (tune_model$cv_result): for each pair of parameters lambda
=(lambda_1
, lambda_2
) in the grid and for each cv fold we obtain a number of non-zero estimated interaction terms, then average cv results by lambda and choose the tuning parameters corresponding to the minimum average cv loss that have at most or at least target_b_gxe_non_zero
non-zero interaction terms. Returned coefficients are obtained by fitting the model on the full dataset with the selected tuning parameters.
Note that the number of estimated non-zero interactions will only approximately reflect the numbers obtained on cv datasets.
coefficients = gesso.coefnum(cv_model=tune_model, target_b_gxe_non_zero=5)
gxe_coefficients = coefficients$beta_gxe
g_coefficients = coefficients$beta_g
Calculate principal selection metrics
selection_gesso = selection.metrics(data$Beta_G, data$Beta_GxE, g_coefficients,
gxe_coefficients)
cbind(selection_gesso, selection_glmnet)
## selection_gesso selection_glmnet
## b_g_non_zero 34 25
## b_gxe_non_zero 5 29
## mse_b_g 0.341325 0.2679464
## mse_b_gxe 1.153749 1.181034
## sensitivity_g 1 1
## specificity_g 0.9384615 0.9615385
## precision_g 0.2941176 0.4
## sensitivity_gxe 0.8 0.6
## specificity_gxe 0.9974684 0.9341772
## precision_gxe 0.8 0.1034483
## auc_g 1 1
## auc_gxe 0.8992405 0.7756962
Here we see that the number of non-zero interactions in the model (b_gxe_non_zero
) is 5 (vs 22 non-zero terms in the model corresponding to the minimal cross-validated error), leading to substantially increased precision and decreased sensitivity for GxE detection.
With gesso.coef()
we obtain coefficients corresponding to the best model in terms of minimal cross-validated error and use gesso.predict()
function to apply our estimated model to a new test dataset that can also be generated by the function data.gen()
.
We calculate test R-squared to assess model performance.
coefficients = gesso.coef(tune_model$fit, tune_model$lambda_min)
beta_0 = coefficients$beta_0; beta_e = coefficients$beta_e
beta_g = coefficients$beta_g; beta_gxe = coefficients$beta_gxe
new_G = data$G_test; new_E = data$E_test
new_Y = gesso.predict(beta_0, beta_e, beta_g, beta_gxe, new_G, new_E)
test_R2_gesso = cor(new_Y, data$Y_test)^2
Compare with the standard Lasso (we use glmnet
package)
new_Y_glmnet = predict(tune_model_glmnet, newx=cbind(new_E, new_G, new_G * new_E),
s=tune_model_glmnet$lambda.min)
test_R2_glmnet = cor(new_Y_glmnet[,1], data$Y_test)^2
cbind(test_R2_gesso, test_R2_glmnet)
## test_R2_gesso test_R2_glmnet
## [1,] 0.983552 0.9662947
The package allows adding unpenalized covariates to the model (for example, important adjustment demographic variables like age and gender). In this example, we first generate data with additional covariates (specify n_confounders
parameter in data.gen()
) and then show how to fit the model, the user just needs to specify numeric matrix C
of covariates organized by columns.
family = "gaussian"
sample_size = 180; p = 400; n_g_non_zero = 10; n_gxe_non_zero = 5
n_confounders = 2
grid = 10^seq(-3, log10(1), length.out = 20)
data = data.gen(seed=1, sample_size=sample_size, p=p,
n_g_non_zero=n_g_non_zero,
n_gxe_non_zero=n_gxe_non_zero,
mode="strong_hierarchical",
family=family,
n_confounders=n_confounders)
tune_model = gesso.cv(G=data$G_train, E=data$E_train, Y=data$Y_train,
C=data$C_train,
family=family, grid=grid, tolerance=1e-4,
parallel=TRUE, nfolds=3,
normalize=TRUE,
normalize_response=TRUE,
verbose=FALSE,
seed=1)
To convert data matrix to the sparse format use as
function.
G_train_sparse = as(data$G_train, "dgCMatrix")
start = Sys.time()
fit = gesso.fit(G=G_train_sparse, E=data$E_train, Y=data$Y_train,
tolerance=1e-4,
grid_size=20, grid_min_ratio=1e-1,
normalize=TRUE,
normalize_response=TRUE)
time_sparse = difftime(Sys.time(), start, units="secs"); time_sparse
## Time difference of 0.07574105 secs
bigmemory
package) option exampleFor out of RAM objects function attach.big.matrix("g_train.desc")
can be used to upload the data given that the files are already created in a correct format (g_train.desc
, g_train.bin
).
## how to create filebacked matrix
G_train = filebacked.big.matrix(nrow=dim(data$G_train)[1],
ncol=dim(data$G_train)[2],
backingpath="./",
backingfile="g_train.bin",
descriptorfile="g_train.desc")
for (i in 1:dim(data$G_train)[2]){
G_train[, i] = data$G_train[, i]
}
## how to attach filebacked matrix
G_train = attach.big.matrix("g_train.desc")
## is.filebacked(G_train) should return TRUE
fit_bm = gesso.fit(G_train, data$E_train, data$Y_train, family=family)
tune_model_bm = gesso.cv(G_train, data$E_train, data$Y_train, family=family)
We use the same dataset we generated to demonstrate how our screening rules work.
Working sets (WS) are the sets of variables left after we applied our screening rules to the full set of predictors. Histogram of the working set sizes shows that most of the time we have to fit only a handful of variables.
2-dimensional plot below shows the log10(WS size) for each (\(\lambda_1\), \(\lambda_2\)) fit.
df = data.frame(lambda_1_factor = factor(fit$lambda_1),
lambda_2_factor = factor(fit$lambda_2),
ws = fit$working_set_size)
log_0 = function(x){
return(ifelse(x == 0, 0, log10(x)))
}
ggplot(df, aes(lambda_1_factor, lambda_2_factor, fill=log_0(ws))) +
scale_fill_distiller(palette = "RdBu") +
scale_x_discrete("lambda_1", breaks=c(1)) +
scale_y_discrete("lambda_2", breaks=c(1)) +
labs(fill='log WS') +
geom_tile()