Marta Azevedo
Center of Mathematics, University of Minho, Guimarães,
Portugal
Email: marta.vasconcelos4@gmail.com
Luís Meira-Machado
Center of Mathematics, University of Minho, Guimarães,
Portugal
Email: lmachado@math.uminho.pt
Artur Araújo
University of Vigo, Vigo, Spain
Understanding the relationship between continuous predictors and binary outcomes is a common challenge in statistical analysis. Traditional parametric regression methods often impose rigid functional forms, limiting their adaptability to complex datasets. To address this constraint and enhance flexibility, various smoothing techniques, particularly those based on splines, have been widely explored. Expressing results in terms of splines-based odds ratio (OR) curves provides a subtle understanding of the impact each continuous covariate has on the outcome.
In this vignette, we present flexOR, an R package designed to offer a robust framework for pointwise nonparametric estimation of odds ratio curves for continuous predictors. To better understand the effects that each continuous covariate has on the outcome, results can be expressed in terms of splines-based odds ratio (OR) curves, taking a specific covariate value as reference. flexOR enables users to estimate odds ratio curves without imposing severe assumptions on the underlying functional form, allowing for a refined exploration of relationships with binary outcomes. The package incorporates various options for automatically selecting degrees of freedom in multivariable models, enhancing its adaptability to diverse datasets. Additionally, flexOR features intuitive visualization functions, facilitating the interpretation and presentation of estimated odds ratio curves.
Key words: logistic models; generalized additive models; odds ratio; reference value; smoothing splines.
Logistic regression models (Hosmer and Lemeshow, 2013) stand as essential tools in statistical analysis, especially when dealing with binary outcome variables. Unlike linear regression tailored for continuous dependent variables, logistic regression is specifically designed to predict the probability of an event. This makes it particularly useful in scenarios such as estimating the likelihood of a patient developing coronary disease or experiencing a specific medical outcome.
Within logistic regression, addressing the nonlinear effects of continuous predictors is a crucial challenge. Traditional approaches involve either categorizing predictors with dummy variables or incorporating them into a polynomial model. However, these methods have limitations, leading to challenges in determining optimal cutpoints and risking power loss when estimating odds ratios.
To address these issues, generalized additive models (GAMs) (Hastie and Tibshirani, 1990; Wood, 2017) offer an alternative solution. Particularly, the use of smoothing splines within GAMs enhances the capacity to model nonlinear effects more effectively. Splines provide flexibility in representing nonlinear relationships, mitigating the risk of overfitting associated with high-order polynomials. Smoothing splines offer a balance between flexibility and simplicity by applying a penalty term to control the smoothness of the fitted curve.
This paper explores the use of additive models, specifically focusing on the utilization of smoothing splines, as a recognized approach to capture nonlinear effects in logistic regression analyses.
Despite the advantages of splines and smoothing splines, the challenge of selecting the number and placement of knots defining the smooth line remains. This arbitrary determination may obscure crucial features in the dataset, requiring a careful balance to avoid oversmoothing or undersmoothing. Various methods, including Akaike’s Information Criterion (AIC) (Akaike, 1974) and Bayesian Information Criterion (BIC) (Schwarz, 1978), have been proposed to address this issue. However, these criteria pose challenges in multivariable settings.
The odds ratio (OR) function emerges as a predominant metric to characterize the impact of continuous covariates in the additive logistic model. Building on previous work in clinical survival studies (Meira-Machado, 2013), we aim to estimate odds ratios and their confidence intervals through a nonparametric approach, taking a specific reference value for the covariate. The flexor package introduces the dfgam function, offering a systematic means to determine the optimal number of degrees of freedom in the multivariable additive logistic model.
The dfgam function allows users to obtain degrees of freedom using criteria such as AIC, AICc, and BIC, providing a versatile tool for model selection. It can also extract degrees of freedom using other criteria implemented in the mgcv R package from Simon Wood (Wood, 2006; Wood, 2017), including Restricted Maximum Likelihood (REML), and Generalized Cross Validation (GCV) with a Cp penalty (GCV.Cp). The choice among these criteria depends on the data’s characteristics and modeling objectives, emphasizing the importance of considering factors such as sample size and the nature of relationships being modeled.
The next section of the paper describes the features and functions of the package flexOR using a real data application from diabetes.
This section explores the Pima Indians Diabetes Database, a widely recognized dataset in the domains of machine learning and statistics. Originating from a study conducted by the National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK) in the 1990s, the dataset focuses on the Pima Indian population in Arizona, USA.
Comprising various demographic, clinical, and diagnostic measurements, the dataset includes information on whether each individual developed diabetes. Researchers and data scientists frequently employ this dataset to construct and evaluate predictive models for diabetes, utilizing features such as age (years), BMI (Body Mass Index), blood pressure, diabetes pedigree function, and other health-related variables. The dataset is conveniently accessible through the mlbench R package.
This dataset comprises 768 observations across 9 variables, encompassing details on pregnancy, glucose levels, blood pressure, triceps thickness, insulin levels, body mass index, pedigree, age, and diabetes status. The subsequent presentation of the R object structure provides comprehensive information regarding data types and values within each variable.
library(mlbench)
#> Warning: package 'mlbench' was built under R version 4.2.3
data("PimaIndiansDiabetes2")
str(PimaIndiansDiabetes2)
#> 'data.frame': 768 obs. of 9 variables:
#> $ pregnant: num 6 1 8 1 0 5 3 10 2 8 ...
#> $ glucose : num 148 85 183 89 137 116 78 115 197 125 ...
#> $ pressure: num 72 66 64 66 40 74 50 NA 70 96 ...
#> $ triceps : num 35 29 NA 23 35 NA 32 NA 45 NA ...
#> $ insulin : num NA NA NA 94 168 NA 88 NA 543 NA ...
#> $ mass : num 33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 NA ...
#> $ pedigree: num 0.627 0.351 0.672 0.167 2.288 ...
#> $ age : num 50 31 32 21 33 30 26 29 53 54 ...
#> $ diabetes: Factor w/ 2 levels "neg","pos": 2 1 2 1 2 1 2 1 2 2 ...
After identifying the variables to be included in the model and determining those requiring a nonlinear effect through smoothing splines, we utilized the dfgam function. This function enabled us to obtain optimal degrees of freedom, minimizing the AIC of the model. The resulting degrees of freedom for the nonlinear predictors, namely age and body mass index (mass), were determined to be 3.3 and 4.1, respectively. Subsequently, these optimal degrees of freedom were incorporated into the generalized additive model (GAM) using the gam function:
library(flexOR)
library(gam)
#> Warning: package 'gam' was built under R version 4.2.3
#> Loading required package: splines
#> Loading required package: foreach
#> Warning: package 'foreach' was built under R version 4.2.3
#> Loaded gam 1.22-3
df2 <- dfgam(response="diabetes",
nl.predictors=c("age","mass"),
other.predictors=c("pedigree"),
smoother="s",
method="AIC",
data = PimaIndiansDiabetes2)
df2$df
#> df
#> age 3.3
#> mass 4.1
m2 <- gam(diabetes ~ s(age, df=3.3) + s(mass, df=4.1) + pedigree,
data=PimaIndiansDiabetes2, family=binomial)
Subsequently, we employ the flexOR function, which, in turn, provides information used to generate a plot illustrating the smooth log odds ratio curve. This curve offers valuable insights into the relationship between the risk of diabetes and body mass index.
or2 <- flexOR(data = PimaIndiansDiabetes2,
response = "diabetes",
formula = ~s(age, 3.3) + s(mass, 4.1) + pedigree)
plot(
x = or2,
predictor = "mass",
ref.value = 40,
ref.label = "Ref. value",
col.area = c("grey75", "grey90"),
main = "Smooth odds ratio for mass",
xlab = "Body mass index",
ylab = "Log Odds Ratio (Ln OR)",
lty = c(1,2,2,3,3),
round.x = 1,
conf.level = c(0.8, 0.95)
)
This figure offers a comprehensive illustration of the correlation between body mass index (BMI) and the risk of diabetes within the Pima Indian population in Arizona, USA. The Log Odds Ratio (LnOR) is visually represented, accompanied by 80% (depicted in grey) and 95% (light grey) confidence bands, utilizing a reference value of 40 for BMI. Users have the option to choose a single confidence level, although two are also feasible, as demonstrated in the input command above. Additionally, the argument ‘ylog’ provides the flexibility to generate a plot that is not on the log scale.
It’s important to note that normal BMI values typically fall within the range of 19 to 25. In our context, individuals with a BMI lower than 40 manifest a lower risk of diabetes. The risk of diabetes, follows a distinctive pattern: there is a rapid increase until a BMI value of 30, followed by a relatively stable risk between 30 and 40. However, beyond a BMI of 40, there is a notable and accelerated rise in the risk of diabetes.
Plotly’s R graphing library offers a powerful tool for creating interactive, publication-quality graphs that enhance data visualization experiences. The library’s versatility allows users to go beyond static representations and delve into dynamic visualizations, enabling a more engaging exploration of the data. In the example provided, the input commands showcase the library’s capability to generate a smoothed log-odds curve with two confidence bands. The interactive nature of these plots facilitates a deeper understanding of the underlying patterns by allowing users to zoom in, pan, and hover over data points for detailed insights. This interactivity not only enhances the overall user experience but also promotes a more nuanced and insightful interpretation of the graphed information. Below are the input commands, along with the corresponding figure.
library(plotly)
#> Warning: package 'plotly' was built under R version 4.2.3
#> Loading required package: ggplot2
#>
#> Attaching package: 'plotly'
#> The following object is masked from 'package:ggplot2':
#>
#> last_plot
#> The following object is masked from 'package:stats':
#>
#> filter
#> The following object is masked from 'package:graphics':
#>
#> layout
p <- plot(
x = or2,
predictor = "mass",
ref.value = 40,
ref.label = "Reference Label",
main = "Smooth odds ratio for mass",
xlab = "Mass",
ylab = "Log Odds Ratio (Ln OR)",
lty = c(1,2,2,3,3),
#xlim = c(18, 67),
#ylim = c(-4.5, 4.5),
round.x = 1,
conf.level = c(0.8, 0.95)
)
tmat <- p$estimates
xref <- p$xref
mdata <- or2$dataset
jj <- match(sort(unique(mdata$mass)), mdata$mass)
# Plotly to get shaded (two-levels) confidence bands
fig <- plot_ly(x=mdata$mass[jj],
y=tmat[jj,5],
type = 'scatter', mode = 'lines',
line = list(color = 'transparent'),
showlegend = FALSE, name = '80%UCI')
fig <- fig %>% add_trace(y = ~tmat[jj,3], type = 'scatter', mode = 'lines',
fill = 'tonexty', fillcolor='rgba(0,100,80,0.3)',
line = list(color = 'transparent'),
showlegend = FALSE, name = '95%UCI')
fig <- fig %>% add_trace(y = ~tmat[jj,2], type = 'scatter', mode = 'lines',
fill = 'tonexty', fillcolor='rgba(0,100,80,0.3)',
line = list(color = 'transparent'),
showlegend = FALSE, name = '95%LCI')
fig <- fig %>% add_trace(y = ~tmat[jj,4], type = 'scatter', mode = 'lines',
fill = 'tonexty', fillcolor='rgba(0,100,80,0.3)',
line = list(color = 'transparent'),
showlegend = FALSE, name = '80%LCI')
fig <- fig %>% add_trace(y = ~tmat[jj,1], type = 'scatter', mode = 'lines',
line = list(color='rgb(0,100,80)'),
showlegend = FALSE, name = 'LnOR')
fig <- fig %>% add_annotations( x = xref,
#y = floor(min(tmat[jj,])),
#y = min(tmat[jj,]),
y = floor_to(min(tmat[jj,]), to=0.5),
xref = "x", yref = "y",
axref = "x", ayref = "y",
text = paste("Ref. value =",xref),
showarrow = T,
ax = xref,
ay = max(tmat[jj,])/2)
fig <- fig %>% layout(#title = ""
plot_bgcolor='rgb(229,229,229)',
xaxis = list(title = "Body mass index",
gridcolor = 'rgb(255,255,255)',
showgrid = TRUE,
showline = FALSE,
showticklabels = TRUE,
tickcolor = 'rgb(127,127,127)',
ticks = 'outside',
zeroline = FALSE),
yaxis = list(title = "Log Odds Ratio (Ln OR)",
gridcolor = 'rgb(255,255,255)',
showgrid = TRUE,
showline = FALSE,
showticklabels = TRUE,
tickcolor = 'rgb(127,127,127)',
ticks = 'outside',
#range = c(-4.5,4.5),
zeroline = FALSE))
fig
This figures illuminate the intricate relationship between body mass index and the risk of diabetes among Pima Indian population in Arizona, USA. The Log Odds Ratio (LnOR) is presented alongside 80% and 95% confidence bands, all referenced to a reference value of 40 for BMI.
The flexOR package also enables users to generate predictions based on the object or2 obtained from the flexOR function. The output provides predicted values along with confidence intervals for the log odds ratio at different levels of the predictor variable body mass index. The reference value is set at 40, and the confidence level is specified as 95%. The resulting table displays the reference value, log odds ratios, and corresponding lower and upper bounds for the given prediction values.
pdval <- c (20, 25, 30, 35, 40, 45, 50, 55, 60, 65)
predict(or2, predictor = "mass", ref.value = 40, conf.level = 0.95,
prediction.values = pdval, ref.label = "Ref.")
#> Ref. LnOR lower .95 upper .95
#> 20 -3.20826636 -3.7680373 -2.64849542
#> 25 -1.61356211 -2.0333903 -1.19373390
#> 30 -0.40263002 -0.6825155 -0.12274455
#> 35 -0.07505977 -0.2150025 0.06488297
#> 40 0.00000000 0.0000000 0.00000000
#> 45 0.45600760 0.3160649 0.59595034
#> 50 1.05087046 0.7709850 1.33075593
#> 55 1.63284725 1.2130190 2.05267546
#> 60 2.21353442 1.6537635 2.77330536
#> 65 2.83405429 2.1343406 3.53376797
The package primarily relies on the gam package by Trevor Hastie. This choice is driven by the package’s capability to empower users to specify the desired degrees of freedom for each nonlinear term. The following input commands yield results for the fitted model, employing degrees of freedom obtained through the AIC method. The resulting output provides detailed information about the model and its components. The summary includes deviance residuals, null and residual deviance, AIC, and ANOVA tables for both parametric and nonparametric effects. This provides a comprehensive overview of the model’s performance and the significance of each predictor.
m2 <- gam(diabetes ~ s(age, df=3.3) + s(mass, df=4.1) + pedigree,
data=PimaIndiansDiabetes2, family=binomial)
summary(m2)
#>
#> Call: gam(formula = diabetes ~ s(age, df = 3.3) + s(mass, df = 4.1) +
#> pedigree, family = binomial, data = PimaIndiansDiabetes2)
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -2.0298 -0.8524 -0.4506 1.0021 2.5324
#>
#> (Dispersion Parameter for binomial family taken to be 1)
#>
#> Null Deviance: 981.5278 on 756 degrees of freedom
#> Residual Deviance: 793.0878 on 747.5998 degrees of freedom
#> AIC: 811.8882
#> 11 observations deleted due to missingness
#>
#> Number of Local Scoring Iterations: NA
#>
#> Anova for Parametric Effects
#> Df Sum Sq Mean Sq F value Pr(>F)
#> s(age, df = 3.3) 1.0 26.00 26.003 27.803 1.760e-07 ***
#> s(mass, df = 4.1) 1.0 41.24 41.242 44.097 6.012e-11 ***
#> pedigree 1.0 14.09 14.089 15.064 0.0001131 ***
#> Residuals 747.6 699.21 0.935
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Anova for Nonparametric Effects
#> Npar Df Npar Chisq P(Chi)
#> (Intercept)
#> s(age, df = 3.3) 2.3 26.264 3.159e-06 ***
#> s(mass, df = 4.1) 3.1 18.827 0.0003328 ***
#> pedigree
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
A good alternative is provided by method=“GCV.Cp” parameter, which imparts comparable degrees of freedom. Below we show how one can use the mgcv R package.
m2.mgcv <- mgcv::gam(diabetes ~ s(age) + s(mass) + pedigree,
data=PimaIndiansDiabetes2,
family=binomial,
method="GCV.Cp")
m2.mgcv
#>
#> Family: binomial
#> Link function: logit
#>
#> Formula:
#> diabetes ~ s(age) + s(mass) + pedigree
#>
#> Estimated degrees of freedom:
#> 3.30 3.96 total = 9.26
#>
#> UBRE score: 0.07247985
summary(m2.mgcv)$edf
#> [1] 3.303847 3.960116
summary(m2.mgcv)
#>
#> Family: binomial
#> Link function: logit
#>
#> Formula:
#> diabetes ~ s(age) + s(mass) + pedigree
#>
#> Parametric coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -1.3344 0.1646 -8.108 5.14e-16 ***
#> pedigree 0.9883 0.2650 3.730 0.000192 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Approximate significance of smooth terms:
#> edf Ref.df Chi.sq p-value
#> s(age) 3.304 4.13 61.31 <2e-16 ***
#> s(mass) 3.960 4.98 54.54 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> R-sq.(adj) = 0.212 Deviance explained = 19.2%
#> UBRE = 0.07248 Scale est. = 1 n = 757
The outputs shown above presents the results to a logistic additive regression model that incorporates smooth terms for age and mass, along with a parametric term for pedigree. The estimated degrees of freedom for the smooth terms are 3.304 for age and 3.960 for mass, contributing to a total estimated degrees of freedom of 9.26. The Un-Biased Risk Estimate (UBRE) score is low at 0.07247985, suggesting a favorable model fit. Parametric coefficients reveal that the intercept is -1.3344, and the pedigree coefficient is 0.9883, both statistically significant. The approximate significance of the smooth terms demonstrates highly significant contributions of age and mass to the model. The adjusted R-squared is 0.212, indicating that 21.2% of the deviance is explained by the model, with a 19.2% overall deviance explained. The model’s performance metrics and the significance of the predictors suggest that it provides a reasonable fit to the data, as supported by the low p-values and the UBRE score.
The following input commands can be utilized to obtain the smooth odds ratio (not on a log scale) for age.
p2 <- plot(
x = or2,
predictor = "age",
ref.value = 50,
ref.label = "Reference Label",
main = "Smooth odds ratio for age",
xlab = "Age (years)",
ylab = "Log Odds Ratio (Ln OR)",
lty = c(1,2,2,3,3),
round.x = 1,
conf.level = 0.95
)
tmat <- exp(p2$estimates)
xref <- p2$xref
mdata <- or2$dataset
jj <- match(sort(unique(mdata$age)), mdata$age)
fig <- plot_ly(x=mdata$age[jj],y=tmat[jj,3],
type = 'scatter', mode = 'lines',
line = list(color = 'transparent'),
showlegend = FALSE, name = '95%UCI')
fig <- fig %>% add_trace(y = ~tmat[jj,1], type = 'scatter', mode = 'lines',
fill = 'tonexty', fillcolor='rgba(0,100,80,0.3)',
line = list(color='rgb(0,100,80)'),
showlegend = FALSE, name = 'LnOR')
fig <- fig %>% add_trace(y = ~tmat[jj,2], type = 'scatter', mode = 'lines',
fill = 'tonexty', fillcolor='rgba(0,100,80,0.3)',
line = list(color = 'transparent'),
showlegend = FALSE, name = '95%LCI')
fig <- fig %>% add_annotations( x = xref,
y = floor_to(min(tmat[jj,]), to=0.5),
xref = "x", yref = "y",
axref = "x", ayref = "y",
text = paste("Ref. value =",xref),
showarrow = T,
ax = xref,
ay = 1.05)
fig <- fig %>% layout(#title = ""
plot_bgcolor='rgb(229,229,229)',
xaxis = list(title = " ",
gridcolor = 'rgb(255,255,255)',
showgrid = TRUE,
showline = FALSE,
showticklabels = TRUE,
tickcolor = 'rgb(127,127,127)',
ticks = 'outside',
tickvals = c(20,30,40,50,60,70,80),
zeroline = FALSE),
yaxis = list(title = "Log Odds Ratio (Ln OR)",
gridcolor = 'rgb(255,255,255)',
showgrid = TRUE,
showline = FALSE,
showticklabels = TRUE,
tickcolor = 'rgb(127,127,127)',
ticks = 'outside',
#range = c(-0.5,3.5),
zeroline = FALSE))
fig
The authors gratefully acknowledge financial support from Portuguese funds through FCT - Fundação para a Ciência e a Tecnologia, under Projects UIDB/00013/2020, UIDP/00013/2020, and EXPL/MAT-STA/0956/2021.