Getting started with gesso

Natalia Zemlianskaia

2021-11-28

Purpose of this vignette

This vignette is a tutorial about how to use the gesso package

Overview

The package is developed to fit a regularized regression model that we call gesso for the joint selection of gene-environment (GxE) interactions. The model focuses on a single environmental exposure and induces a main-effect-before-interaction hierarchical structure. We developed and implemented an efficient fitting algorithm and screening rules that can discard large numbers of irrelevant predictors with high accuracy.

gesso model induces hierarchical selection of the (GxE) interaction terms via overlapped group lasso structure. The model has two tuning parameters \(\lambda_1\) and \(\lambda_2\) allowing flexible, data-dependent control over group sparsity and additional interactions sparsity.

Response (outcome) variable can be either continuous or binary 0/1 variable. For a binary response, the IRLS procedure is used with the custom screening rules we developed.

The package supports sparse matrices dgCMatrix and (filebacked) bigmatrix format from the bigmemory package for large or out of RAM datasets.

Package highlights

The package allows

System requirements

NOTE: RcppThread cannot be installed with gcc 4.8.5 version https://github.com/tnagler/RcppThread/issues/13. In this case, we recommend updating gcc to a more recent version.

Installation

Installation can take a couple of minutes because of the requirement to install dependent packages (dplyr, Rcpp, RcppEigen, RcppThread, BH, bigmemory)

## install.packages("devtools")

library(devtools)
devtools::install_github("NataliaZemlianskaia/gesso")

Attaching libraries for the examples below

library(gesso)
library(glmnet)
library(ggplot2)
library(bigmemory)

Data generation

First, we generate the dataset using data.gen() function from the package. The function allows us to generate a binary (0/1) matrix of genotypes G of a given sample_size and p number of features, binary vector E of environmental measurements, and a response variable Y \(\sim g(\beta_0 + \beta_E E + \beta_G G + \beta_{G\times E} G\times E)\) - either quantitative or binary depending on the family parameter.

We can specify the number of non-zero main effects (n_g_non_zero) and interactions (n_gxe_non_zero) we want to generate. We also specified a strong_hierarchical mode for our dataset, which means that (1) if interaction effect is generated as non-zero, it’s respective genetic main effect is also generated as non-zero \((\beta_{G\times E} \ne 0 \longrightarrow \beta_G \ne 0)\), and (2) effect sizes of the main effects are larger than that of interaction effects \((|\beta_G| \ge |\beta_{G\times E}|)\).

family = "gaussian"
sample_size = 180; p = 400; n_g_non_zero = 10; n_gxe_non_zero = 5

data = data.gen(seed=1, sample_size=sample_size, p=p, 
                n_g_non_zero=n_g_non_zero, 
                n_gxe_non_zero=n_gxe_non_zero, 
                mode="strong_hierarchical",
                family=family)

Let’s look at the dataset we generated data$G_train, data$E_train, and data$Y_train

dim(data$G_train)
## [1] 180 400
head(data$G_train[,1:5])
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    0    0    0    0    0
## [2,]    0    0    0    0    0
## [3,]    0    0    0    0    0
## [4,]    1    0    0    1    0
## [5,]    0    0    0    0    1
## [6,]    1    0    0    0    0
data$E_train[1:10]
##  [1] 0 0 0 0 0 0 0 1 0 1
head(data$Y_train)
## [1] -9.546546 -2.713549  5.875298  1.984317  2.432890  6.798349

We generated model coefficients data$Beta_G and data$Beta_GxE such that we got sum(data$Beta_G != 0) = 10 non-zero main effects and sum(data$Beta_GxE != 0) = 5 non-zero interaction effects as we specified.

c(sum(data$Beta_G != 0), sum(data$Beta_GxE != 0))
## [1] 10  5

We also imposed strong_hierarchical relationships between main effects and interaction effects

cbind(data$Beta_G[data$Beta_G != 0], data$Beta_GxE[data$Beta_G != 0])
##       [,1] [,2]
##  [1,]   -3  0.0
##  [2,]   -3  0.0
##  [3,]    3  0.0
##  [4,]    3  0.0
##  [5,]    3  0.0
##  [6,]    3  1.5
##  [7,]    3 -1.5
##  [8,]   -3 -1.5
##  [9,]   -3  1.5
## [10,]    3  1.5

Selection example

To tune the model over a 2D grid of hyper-parameters (lambda_1 and lambda_2) we use the function gesso.cv(), where we specify

There is an option to use a custom grid in gesso.cv() function by specifying grid parameter instead of grid_size parameter. The user can also specify alpha parameter to enable tuning only over lambda_1 while assuming lambda_2 = lambda_1 * alpha.

verbose parameter is set to TRUE by default in gesso.cv(), but we can set it to FALSE to avoid outputting function messages about partial execution time.

start = Sys.time()
tune_model = gesso.cv(G=data$G_train, E=data$E_train, Y=data$Y_train, 
                      family=family, grid_size=20, tolerance=1e-4,
                      grid_min_ratio=1e-2,
                      parallel=TRUE, nfolds=3,
                      normalize=TRUE,
                      normalize_response=TRUE,
                      seed=1,
                      max_iterations=10000)
## Compute grid:  
## Time difference of 0.0009548664 secs
## Parallel cv: 
## Time difference of 0.686111 secs
## Fit on the full dataset: 
## Time difference of 0.6701159 secs
Sys.time() - start
## Time difference of 1.367196 secs

Obtain best model with gesso.coef() function

To obtain interaction and main effect coefficients corresponding to the model with a particular pair of tuning parameters we use gesso.coef() function. We need to specify a model fit object and a pair of lambda = (lambda_1, lambda_2) values organized in a tibble (ex: lambda = tibble(lambda_1=tune_model$grid[1], lambda_2=tune_model$grid[1])).

Below we set fit = tune_model$fit - model fit on the full dataset and lambda = tune_model$lambda_min - the pair of tuning parameters corresponding to the minimum cross-validated error that gesso.cv() function returns.

Check if all non-zero interaction features were recovered by the model

##      [,1]       [,2]
## [1,]  1.5  1.4029323
## [2,] -1.5 -1.0089132
## [3,] -1.5 -1.1648283
## [4,]  1.5  0.7450604
## [5,]  1.5  1.3221858

Check that the largest estimated interaction effects correspond to the true non-zero coefficients

##  [1]  1.5  1.5 -1.5 -1.5  1.5  0.0  0.0  0.0  0.0  0.0

Calculate principal selection metrics with selection.metrics() funciton available in the package

##                 selection_gesso
## b_g_non_zero    46             
## b_gxe_non_zero  22             
## mse_b_g         0.1957665      
## mse_b_gxe       0.4391993      
## sensitivity_g   1              
## specificity_g   0.9076923      
## precision_g     0.2173913      
## sensitivity_gxe 1              
## specificity_gxe 0.956962       
## precision_gxe   0.2272727      
## auc_g           1              
## auc_gxe         1

Compare with the standard Lasso model (we use glmnet package)

##      [,1]       [,2]
## [1,]  1.5 1.35563472
## [2,] -1.5 0.00000000
## [3,] -1.5 0.00000000
## [4,]  1.5 0.03736643
## [5,]  1.5 0.93958681
##  [1] 1.5 1.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
##                 selection_gesso selection_glmnet
## b_g_non_zero    46              25              
## b_gxe_non_zero  22              29              
## mse_b_g         0.1957665       0.2679464       
## mse_b_gxe       0.4391993       1.181034        
## sensitivity_g   1               1               
## specificity_g   0.9076923       0.9615385       
## precision_g     0.2173913       0.4             
## sensitivity_gxe 1               0.6             
## specificity_gxe 0.956962        0.9341772       
## precision_gxe   0.2272727       0.1034483       
## auc_g           1               1               
## auc_gxe         1               0.7756962

We see that the gesso model outperforms the Lasso model in terms of the detection of interaction terms in all important selection metrics, including GxE detection AUC (auc_gxe), sensitivity (sensitivity_gxe), and precision (precision_gxe). The estimation bias of the interaction coefficients is also substantially lower for the gesso model (mse_b_gxe).

Obtain target model with gesso.coefnum() function

To control how parsimonious is our model with respect to the selected non-zero interaction terms we can use gesso.coefnum() function. We obtain interaction and main effect coefficients corresponding to the target GxE model, where instead of specifying tuning parameters lambda, we set target_b_gxe_non_zero parameter which is the number of at most (specify less_than=TRUE - default value) or at least (specify less_than=FALSE) non-zero interactions we want to include in the model, depending on the parameter less_than.

The target model is selected based on the averaged cross-validation (cv) results (tune_model$cv_result): for each pair of parameters lambda=(lambda_1, lambda_2) in the grid and for each cv fold we obtain a number of non-zero estimated interaction terms, then average cv results by lambda and choose the tuning parameters corresponding to the minimum average cv loss that have at most or at least target_b_gxe_non_zero non-zero interaction terms. Returned coefficients are obtained by fitting the model on the full dataset with the selected tuning parameters.

Note that the number of estimated non-zero interactions will only approximately reflect the numbers obtained on cv datasets.

Calculate principal selection metrics

##                 selection_gesso selection_glmnet
## b_g_non_zero    34              25              
## b_gxe_non_zero  5               29              
## mse_b_g         0.341325        0.2679464       
## mse_b_gxe       1.153749        1.181034        
## sensitivity_g   1               1               
## specificity_g   0.9384615       0.9615385       
## precision_g     0.2941176       0.4             
## sensitivity_gxe 0.8             0.6             
## specificity_gxe 0.9974684       0.9341772       
## precision_gxe   0.8             0.1034483       
## auc_g           1               1               
## auc_gxe         0.8992405       0.7756962

Here we see that the number of non-zero interactions in the model (b_gxe_non_zero) is 5 (vs 22 non-zero terms in the model corresponding to the minimal cross-validated error), leading to substantially increased precision and decreased sensitivity for GxE detection.

Prediction example

With gesso.coef() we obtain coefficients corresponding to the best model in terms of minimal cross-validated error and use gesso.predict() function to apply our estimated model to a new test dataset that can also be generated by the function data.gen().

We calculate test R-squared to assess model performance.

coefficients = gesso.coef(tune_model$fit, tune_model$lambda_min)
beta_0 = coefficients$beta_0; beta_e = coefficients$beta_e                   
beta_g = coefficients$beta_g; beta_gxe = coefficients$beta_gxe     

new_G = data$G_test; new_E = data$E_test
new_Y = gesso.predict(beta_0, beta_e, beta_g, beta_gxe, new_G, new_E)
test_R2_gesso = cor(new_Y, data$Y_test)^2

Compare with the standard Lasso (we use glmnet package)

new_Y_glmnet = predict(tune_model_glmnet, newx=cbind(new_E, new_G, new_G * new_E), 
                       s=tune_model_glmnet$lambda.min)
test_R2_glmnet = cor(new_Y_glmnet[,1], data$Y_test)^2
cbind(test_R2_gesso, test_R2_glmnet)
##      test_R2_gesso test_R2_glmnet
## [1,]      0.983552      0.9662947

Adding unpenalized covariates to the model

The package allows adding unpenalized covariates to the model (for example, important adjustment demographic variables like age and gender). In this example, we first generate data with additional covariates (specify n_confounders parameter in data.gen()) and then show how to fit the model, the user just needs to specify numeric matrix C of covariates organized by columns.

family = "gaussian"
sample_size = 180; p = 400; n_g_non_zero = 10; n_gxe_non_zero = 5
n_confounders = 2

grid = 10^seq(-3, log10(1), length.out = 20)

data = data.gen(seed=1, sample_size=sample_size, p=p, 
                n_g_non_zero=n_g_non_zero, 
                n_gxe_non_zero=n_gxe_non_zero, 
                mode="strong_hierarchical",
                family=family,
                n_confounders=n_confounders)

tune_model = gesso.cv(G=data$G_train, E=data$E_train, Y=data$Y_train, 
                      C=data$C_train,
                      family=family, grid=grid, tolerance=1e-4,
                      parallel=TRUE, nfolds=3,
                      normalize=TRUE,
                      normalize_response=TRUE,
                      verbose=FALSE,
                      seed=1)

Sparse matrix option example

To convert data matrix to the sparse format use as function.

G_train_sparse = as(data$G_train, "dgCMatrix")

start = Sys.time()
fit = gesso.fit(G=G_train_sparse, E=data$E_train, Y=data$Y_train, 
                tolerance=1e-4,
                grid_size=20, grid_min_ratio=1e-1,
                normalize=TRUE,
                normalize_response=TRUE)
time_sparse = difftime(Sys.time(), start, units="secs"); time_sparse
## Time difference of 0.07574105 secs

Bigmatrix (bigmemory package) option example

For out of RAM objects function attach.big.matrix("g_train.desc") can be used to upload the data given that the files are already created in a correct format (g_train.desc, g_train.bin).


## how to create filebacked matrix
G_train = filebacked.big.matrix(nrow=dim(data$G_train)[1], 
                                ncol=dim(data$G_train)[2],
                                backingpath="./",
                                backingfile="g_train.bin",
                                descriptorfile="g_train.desc")

for (i in 1:dim(data$G_train)[2]){
    G_train[, i] = data$G_train[, i]
}

## how to attach filebacked matrix 
G_train = attach.big.matrix("g_train.desc")
## is.filebacked(G_train) should return TRUE

fit_bm = gesso.fit(G_train, data$E_train, data$Y_train, family=family)
tune_model_bm = gesso.cv(G_train, data$E_train, data$Y_train, family=family)                     

Working sets

We use the same dataset we generated to demonstrate how our screening rules work.

Working sets (WS) are the sets of variables left after we applied our screening rules to the full set of predictors. Histogram of the working set sizes shows that most of the time we have to fit only a handful of variables.

hist(fit$working_set_size, breaks = 100, col="blue")

2-dimensional plot below shows the log10(WS size) for each (\(\lambda_1\), \(\lambda_2\)) fit.

df = data.frame(lambda_1_factor = factor(fit$lambda_1),
                lambda_2_factor = factor(fit$lambda_2),
                ws = fit$working_set_size)

log_0 = function(x){
  return(ifelse(x == 0, 0, log10(x)))
}

ggplot(df, aes(lambda_1_factor, lambda_2_factor, fill=log_0(ws))) + 
  scale_fill_distiller(palette = "RdBu") +
  scale_x_discrete("lambda_1", breaks=c(1)) +
  scale_y_discrete("lambda_2", breaks=c(1)) +
  labs(fill='log WS') +
  geom_tile()