ProSGPV
is a package that performs variable selection with Second-Generation P-Values (SGPV). This document illustrates how ProSGPV
performs variable selection with data from generalized linear models (GLM) and Cox proportional hazards models. Technical details about this algorithm can be found at Zuo, Stewart, and Blume (2020).
We will use Logistic regression to illustrate how ProSGPV
selects variables with a low-dimensional (\(n>p\)) real-world data example and how to choose models when each run gives you slightly different results. We will use Poisson regression to show how ProSGPV
works with simulated high-dimensional Poisson data and a visualization tool for the selection process. We will use Cox proportional hazards model to show how a fast one-stage algorithm works with simulated data together with a visualization tool for the one-stage ProSGPV
.
To install the ProSGPV
pacKakge from CRAN, you can do
install("ProSGPV")
Alternatively, you can install a development version of ProSGPV
by doing
::install_github("zuoyi93/ProSGPV") devtools
Once the package is installed, we can load the package to the current environment.
library(ProSGPV)
#> Loading required package: glmnet
#> Loading required package: Matrix
#> Loaded glmnet 4.1-2
#> Loading required package: brglm2
#> Package 'ProSGPV' version 1.0.0
Traditionally, maximum likelihood (ML) has been used to fit a Logistic model. However, when sample size is small or effect sizes are strong, complete/quasi-complete separation may happen, which may inflate the parameter estimation bias by a lot (Albert and Anderson 1984; Rahman and Sultana 2017). Recently, Kosmidis and Firth (2021) proposed to use a Jeffreys prior to address the separation issue in the Logistic regression, and that is implemented in ProSGPV
package.
In this section, we will use a real-world data example to showcase how ProSGPV
performs variable selection when the outcome is binary. Lower back pain can be caused by a variety of problems with any parts of the complex, interconnected network of spinal muscles, nerves, bones, discs or tendons in the lumbar spine. Dr. Henrique da Mota collected 12 biomechanical attributes from 310 patients, of whom 100 are normal and 210 are abnormal (Disk Hernia or Spondylolisthesis). The goal is to differentiate the normal patients from the abnormal using those 12 variables. The biomechanical attributes include pelvic incidence, pelvic tilt, lumbar lordosis angle, sacral slope, pelvic radius, degree of spondylolisthesis, pelvic slope, direct tilt, thoracic slope cervical tilt, sacrum angle, and scoliosis slope. The data set spine
can be accessed in the ProSGPV
package.
<- spine[,-ncol(spine)]
x <- spine[,ncol(spine)]
y
.2s.l <- pro.sgpv(x,y,family="binomial")
sgpv#> Warning: brglmFit: fitted probabilities numerically 0 or 1 occurred
.2s.l
sgpv#> Selected variables are sacral_slope pelvic_radius degree_spondylolisthesis
ProSGPV
selects three out of 12 variables and they are acral slope, pelvic radius, and degree of spondylolisthesis.
We can get a summary of the model that we selected.
summary(sgpv.2s.l)
#> Warning: brglmFit: fitted probabilities numerically 0 or 1 occurred
#>
#> Call:
#> glm(formula = Response ~ ., family = "binomial", data = data.d,
#> method = "brglmFit", type = "MPL_Jeffreys")
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -2.78904 -0.48146 0.04248 0.33705 2.26768
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 17.42056 3.15274 5.526 3.28e-08 ***
#> sacral_slope -0.11300 0.02214 -5.105 3.31e-07 ***
#> pelvic_radius -0.11716 0.02254 -5.197 2.03e-07 ***
#> degree_spondylolisthesis 0.15797 0.02162 7.306 2.75e-13 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 389.86 on 309 degrees of freedom
#> Residual deviance: 184.42 on 306 degrees of freedom
#> AIC: 192.42
#>
#> Number of Fisher Scoring iterations: 5
Coefficient estimates can be extracted by using the coef
function.
coef(sgpv.2s.l)
#> Warning: brglmFit: fitted probabilities numerically 0 or 1 occurred
#> [1] 0.0000000 0.0000000 0.0000000 -0.1130012 -0.1171616 0.1579717
#> [7] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
Prediction in probability can be made by using the predict
function.
head(predict(sgpv.2s.l))
#> Warning: brglmFit: fitted probabilities numerically 0 or 1 occurred
#> 1 2 3 4 5 6
#> 0.7765972 0.8116983 0.3053596 0.9012994 0.8132581 0.3842353
gen.sim.data
function is used to generate some high-dimensional (\(p>n\)) simulation data from Poisson distribution in this section. There are many arguments in the gen.sim.data
function, as it can generate data with continuous, binary, count, and survival outcomes.
n
is the number of observations, p
is the number of variables, s
is the number of true signals, family
can take the value of "gaussian"
, "binomial"
, "poisson"
, and "cox"
, beta.min
is the smallest effect size in absolute value, beta.max
is the largest effect size, rho
is the autocorrelation coefficient in the design matrix, nu
is the signal-to-noise ratio in the continuous outcome case, sig
is the standard deviation in the covariance matrix of the design matrix, intercept
is used in the linear mean vector of GLM, scale
and shape
are parameters in Weibull survival time, and rateC
is the rate of censoring.
In this section, we simulate 80 observations with 100 explanatory variables and one count outcome. The number of true signals is four. The smallest log rate ratio is 0.5 and the largest log rate ratio is 1.5. When \(p>n\), only the two-stage ProSGPV
will be used, even if stage
is set to be 1.
set.seed(1)
<- gen.sim.data(n=80,p=100,s=4,beta.min=0.5,beta.max=1.5,family="poisson")
data.log
<- data.log[[1]]
x <- data.log[[2]]
y <- data.log[[3]])
(true.index #> [1] 3 43 89 100
<- data.log[[4]])
(true.beta #> [1] 0.0000000 0.0000000 0.5000000 0.0000000 0.0000000 0.0000000
#> [7] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
#> [13] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
#> [19] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
#> [25] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
#> [31] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
#> [37] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
#> [43] 0.8333333 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
#> [49] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
#> [55] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
#> [61] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
#> [67] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
#> [73] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
#> [79] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
#> [85] 0.0000000 0.0000000 0.0000000 0.0000000 -1.1666667 0.0000000
#> [91] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
#> [97] 0.0000000 0.0000000 0.0000000 -1.5000000
.2s.p <- pro.sgpv(x,y,family="poisson")
sgpv.2s.p
sgpv#> Selected variables are V3 V43 V89 V100
We see that the two-stage ProSGPV
recovered the true support.
Similarly, the summary of the final model is available by calling summary
.
summary(sgpv.2s.p)
#>
#> Call:
#> glm(formula = Response ~ ., family = "poisson", data = data.d)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -1.7468 -0.6263 -0.2324 0.3456 1.7695
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -0.15872 0.12192 -1.302 0.193
#> V3 0.59830 0.04679 12.786 <2e-16 ***
#> V43 0.89750 0.06422 13.975 <2e-16 ***
#> V89 -1.15970 0.05135 -22.583 <2e-16 ***
#> V100 -1.57528 0.08177 -19.265 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for poisson family taken to be 1)
#>
#> Null deviance: 1018.60 on 79 degrees of freedom
#> Residual deviance: 47.37 on 75 degrees of freedom
#> AIC: 243.92
#>
#> Number of Fisher Scoring iterations: 5
Coefficients can be extracted by coef
. We see the estimates are pretty close to the truth.
coef(sgpv.2s.p)
#> [1] 0.0000000 0.0000000 0.5982981 0.0000000 0.0000000 0.0000000
#> [7] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
#> [13] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
#> [19] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
#> [25] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
#> [31] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
#> [37] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
#> [43] 0.8974952 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
#> [49] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
#> [55] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
#> [61] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
#> [67] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
#> [73] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
#> [79] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
#> [85] 0.0000000 0.0000000 0.0000000 0.0000000 -1.1597040 0.0000000
#> [91] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
#> [97] 0.0000000 0.0000000 0.0000000 -1.5752756
In-sample prediction can be done as follows.
head(predict(sgpv.2s.p))
#> 1 2 3 4 5 6
#> 0.09056540 1.52559015 0.09082897 2.36900015 3.64998025 2.35678653
Out-of-sample prediction is also available by feeding data into the newdata
argument.
set.seed(1)
<- gen.sim.data(n=10,p=100,s=4,beta.min=0.5,beta.max=1.5,family="poisson")
data.log
<- data.log[[1]]
x.new <- data.log[[2]]
y.new
data.frame(Observed=y.new,Predicted=predict(sgpv.2s.p,newdata=x.new))
#> Observed Predicted
#> 1 1 0.2511760
#> 2 0 0.7257742
#> 3 0 0.2668769
#> 4 7 6.4782805
#> 5 9 10.2552082
#> 6 0 0.2627837
#> 7 1 1.0470441
#> 8 0 0.4297897
#> 9 4 1.2498434
#> 10 7 5.5053366
The prediction agrees with the truth well.
The selection process can be visualized below. lambda.max
sets the limit of X-axis. Only effects whose lower bound doesn’t reach the null regions are kept. Selected variables are labeled blue on the Y-axis. The vertical dotted line is the \(\lambda\) selected by generalized information criterion (Fan and Tang (2013)).
plot(sgpv.2s.p,lambda.max = 0.5)
You can also chooses to view only one line per variable by setting lpv
to be 1. That line is the 95% confidence bound that is closer to the null.
plot(sgpv.2s.p,lambda.max = 0.5,lpv=1)
In this section, we will use simulated low-dimensional survival data to illustrate how one-stage and two-stage ProSGPV
select variables, and additional visualization for the one-stage algorithm’s selection process.
gen.sim.data
is used to simulate 100 observations.
set.seed(1)
<- gen.sim.data(n=100, p=20, s=4, family="cox",
data.cox beta.min=0.5, beta.max=1.5, sig=2)
<- data.cox[[1]]
x <- data.cox[[2]]
y <- data.cox[[3]])
(true.index #> [1] 1 3 4 10
<- data.cox[[4]]
true.beta
.2s.c <- pro.sgpv(x,y,stage=2,family="cox")
sgpv.2s.c
sgpv#> Selected variables are V1 V3 V4 V10
The two-stage algorithm successfully recovered the true support. We can also run a fast one-stage algorithm by setting stage
to be 1.
.1s.c <- pro.sgpv(x,y,stage=1,family="cox")
sgpv.1s.c
sgpv#> Selected variables are V1 V3 V4 V10
It succeeded as well. summary
, coef
, and predict
are also available for the one-stage ProSGPV
object.
We can visualize the one-stage selection process by calling the plot
function.
plot(sgpv.1s.c)
Effects estimates (in this case log hazards ratios) and corresponding 95% confidence intervals are plotted for each variable in the full Cox model. Two vertical green bars indicate the null region. Only effects whose confidence bounds do not overlap with the null regions are kept. Selected variables are colored in blue.
When the design matrix has high within correlation, or signals are known to be dense in the data, the ProSGPV algorithm yields a null bound that is slightly higher than the noise level, which subsequently affects the support recovery performance by missing some small true effects. One way to address this issue is to replace the constant null bound in the ProSGPV with a generalized variance inflation factor (GVIF)-adjusted null bound. Please see Fox and Monette (1992) for more details on how to calculate the GVIF for each variable in the design matrix. Essentially, we deflate the coefficient estimate standard error of each variable by its GVIF, and thus produce a smaller null bound, which includes more variables in the final selection set. This adjustment is found to be helpful when signals are dense, too.