Quick start

Jorge Cabral

Introduction

In this vignette, the theoretical considerations are skipped and simple examples of several GCEstim functions are given.

lmgce()

Consider dataThesis (see “Generalized Maximum Entropy framework”).

#>       X001                X002               X003                X004         
#>  Min.   :-0.632481   Min.   :-0.33234   Min.   :-0.176809   Min.   :-0.22971  
#>  1st Qu.:-0.106850   1st Qu.:-0.07391   1st Qu.:-0.056879   1st Qu.:-0.08982  
#>  Median :-0.003362   Median :-0.01655   Median :-0.004816   Median :-0.01277  
#>  Mean   : 0.016769   Mean   :-0.01209   Mean   :-0.007288   Mean   :-0.01441  
#>  3rd Qu.: 0.128388   3rd Qu.: 0.06120   3rd Qu.: 0.043791   3rd Qu.: 0.06338  
#>  Max.   : 0.494303   Max.   : 0.23579   Max.   : 0.131458   Max.   : 0.19744  
#>        y          
#>  Min.   :-11.196  
#>  1st Qu.: -5.017  
#>  Median : -3.595  
#>  Mean   : -3.702  
#>  3rd Qu.: -2.225  
#>  Max.   :  1.780

Suppose we want to obtain the estimates for the model

\[\begin{equation} \mathbf{y}=\beta_0\mathbf{1}_N + \beta_1\mathbf{X001} + \beta_2\mathbf{X002} + \beta_3\mathbf{X003} + \beta_4\mathbf{X004}. \end{equation}\]

Using the function lmgce() and setting:

the desired model is computed in a Two-stage Adaptive Standardized-coefficients-informed Weighted Generalized Cross Entropy (TASW-GCE) Framework. In particular, with 5 points for the signal support space, 3 points for the noise support space, a weight of 0.5 and uniform priors for both signal and support (help("lmgce")).

res.lmgce.v01 <-
  lmgce(
    formula = y ~ X001 + X002 + X003 + X004,
    data = dataThesis)
#> Warning in lmgce(formula = y ~ X001 + X002 + X003 + X004, data = dataThesis): 
#> 
#> The minimum error was found for the highest upper limit of the
#>             support. Confirm if higher values should be tested.

If we would like to see a concise summary of the result we can simply type,

res.lmgce.v01
#> 
#> Call:
#> lmgce(formula = y ~ X001 + X002 + X003 + X004, data = dataThesis)
#> 
#> Coefficients:
#> (Intercept)         X001         X002         X003         X004  
#>     -3.6957       0.4688     -11.6308     -11.4579      -4.0745

To obtain a more detailed evaluation of the fitted model, we can use summary()

summary(res.lmgce.v01)
#> 
#> Call:
#> lmgce(formula = y ~ X001 + X002 + X003 + X004, data = dataThesis)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -3.3538 -0.8861 -0.1564  0.3509  1.8425 
#> 
#> Coefficients:
#>               Estimate Std. Deviation   z value Pr(>|t|)    
#> (Intercept) -3.696e+00      6.455e-06 -572528.9   <2e-16 ***
#> X001         4.688e-01      6.787e-04     690.7   <2e-16 ***
#> X002        -1.163e+01      7.944e-04  -14640.3   <2e-16 ***
#> X003        -1.146e+01      4.761e-03   -2406.9   <2e-16 ***
#> X004        -4.074e+00      2.575e-03   -1582.2   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Normalized Entropy:
#>              NormEnt  SupportLL SupportUL
#> (Intercept) 0.422679  -4.311289  4.311289
#> X001        0.998446  -9.377392  9.377392
#> X002        0.730749 -18.577182 18.577182
#> X003        0.922430 -32.878041 32.878041
#> X004        0.976360 -20.975389 20.975389
#> 
#> Residual standard error: 1.102 on 70 degrees of freedom
#> Chosen Upper Limit for Standardized Supports: 0.9059, Chosen Error: 1se
#> Multiple R-squared: 0.7364, Adjusted R-squared: 0.7214
#> NormEnt: 0.8101, CV-NormEnt:  0.81 (0.0009853)
#> RMSE: 1.064, CV-RMSE: 1.034 (0.281)

The estimated GCE coefficients are \(\widehat{\boldsymbol{\beta}}^{TASW\text{-}GCE}=\) (-3.696, 0.469, -11.631, -11.458, -4.074).

(coef.res.lmgce.v01 <- coef(res.lmgce.v01))
#> (Intercept)        X001        X002        X003        X004 
#>  -3.6957113   0.4687837 -11.6307903 -11.4579458  -4.0744707

The prediction root mean squared error (RMSE) is \(RMSE^{ y, TASW\text{-}GCE} \approx\) 1.064.

res.lmgce.v01$error.measure
#> [1] 1.064306

The prediction 5-fold cross-validation (CV) error is \(CV\text{-}RMSE^{y,TASW\text{-}GCE} \approx\) 1.034.

res.lmgce.v01$error.measure.cv.mean
#> [1] 1.033978

Z-score confidence intervals (CIs) are obtained with the confint S3 method for class lmgce

confint(res.lmgce.v01, level = 0.95)
#>                    2.5%      97.5%
#> (Intercept)  -3.6957240  -3.695699
#> X001          0.4674534   0.470114
#> X002        -11.6323474 -11.629233
#> X003        -11.4672763 -11.448615
#> X004         -4.0795179  -4.069424

and can be visualized with plot

plot(res.lmgce.v01, which = 1)$p1

Color-filled dots represent the GCE estimates while non-filled dots represent the OLS estimates (lmgce uses by default OLS = TRUE).

Bootstrap CIs are available by setting method = "percentile" or method = "basic", defining a number of replicates boot.B = 1000 and bootstrap method (three methods available, see help("confint.lmgce"))

res.lmgce.v01.confint <- 
  confint(
    res.lmgce.v01,
    level = 0.95,
    method = "basic",
    boot.B = 100,
    boot.method = "residuals"
  )
res.lmgce.v01.confint
#>                     2.5%      97.5%
#> (Intercept)  -3.85996624  -3.552136
#> X001          0.07113198   1.331706
#> X002        -15.17611611 -12.038166
#> X003        -14.34763341 -11.293882
#> X004         -5.89776869  -3.157915

or

res.lmgce.v02 <-
  update(res.lmgce.v01,
         boot.B = 100,
         boot.method = "residuals")
#> Warning in lmgce(formula = y ~ X001 + X002 + X003 + X004, data = dataThesis, : 
#> 
#> The minimum error was found for the highest upper limit of the
#>             support. Confirm if higher values should be tested.
res.lmgce.v02.confint <-
  confint(
    res.lmgce.v02,
    level = 0.95,
    method = "basic"
  )
res.lmgce.v02.confint
#>                     2.5%      97.5%
#> (Intercept)  -3.85996624  -3.552136
#> X001          0.07113198   1.331706
#> X002        -15.17611611 -12.038166
#> X003        -14.34763341 -11.293882
#> X004         -5.89776869  -3.157915
plot(res.lmgce.v02, which = 1, ci.method = "basic")$p1

From the previous summary we can see that the standardized parameter chosen was 0.9059

res.lmgce.v01$support.stdUL
#> [1] 0.9059

and corresponds to the support spaces which cross-validation RMSE (CV-RMSE) is not greater than the minimum CV-RMSE plus one standard error (1se)

res.lmgce.v01$support.signal.1se
#> [1] 0.9059

During estimation, standardized limits are reverted to the original scale

res.lmgce.v01$support.matrix #original scale
#>              SupportLL SupportUL
#> (Intercept)  -4.311289  4.311289
#> X001         -9.377392  9.377392
#> X002        -18.577182 18.577182
#> X003        -32.878041 32.878041
#> X004        -20.975389 20.975389

lmgce uses by default support.signal = NULL and support.signal.vector = NULL meaning that it will evaluate the errormeasure = "RMSE" with cross-validation (cv = TRUE, cv.nfolds = 5) in a set of support.signal.vector.n = 20 supports spaces logarithmically between support.signal.vector.min = 0.3 and support.signal.vector.min = 20. It will then choose the minimum, one standard error or elbow error (see help("lmgce")).

plot(res.lmgce.v01, which = 2)$p2

Red dots represent the CV-error and green dots the CV-Normalized entropy. Whiskers have the length of two standard errors of the CV mean for each of the 20 support spaces tested. The dotted horizontal line is the OLS CV-error. The black vertical dotted line corresponds to the support spaces that produced the lowest CV-error. The black vertical dashed line corresponds to the support spaces that produced the 1se CV-error. The red vertical dotted line corresponds to the support spaces that produced the elbow CV-error.

With plot it is possible to obtain the trace of the estimates

plot(res.lmgce.v01, which = 3)$p3

In the last two plots are depicted the final solutions. That is to say that after choosing the support spaces limits based on the defined error, the number of points of the support spaces and their probability support.signal.points = c(1/5, 1/5, 1/5, 1/5, 1/5) and support.signal.points = c(1/3, 1/3, 1/3), twosteps.n = 1 extra estimation(s) is(are) performed. This estimation uses the GCE framework even if the previous steps were by default on the GME framework. The distribution of probabilities used is the one estimated for the chosen support spaces.

res.lmgce.v01$p0
#>                   p_1       p_2        p_3        p_4        p_5
#> (Intercept) 0.7765862 0.1738391 0.03891393 0.00871090 0.00194994
#> X001        0.1805333 0.1897807 0.19950181 0.20972083 0.22046329
#> X002        0.5508961 0.2536953 0.11683017 0.05380191 0.02477652
#> X003        0.3663295 0.2533743 0.17524802 0.12121147 0.08383673
#> X004        0.2856466 0.2344356 0.19240585 0.15791119 0.12960076
res.lmgce.v01$w0[1:5, ] # for the first 5 lines
#>         p_1       p_2       p_3
#> 1 0.2672761 0.3286375 0.4040863
#> 2 0.3305380 0.3333255 0.3361365
#> 3 0.2975938 0.3320069 0.3703994
#> 4 0.4073613 0.3282064 0.2644323
#> 5 0.3513099 0.3330158 0.3156743

The trace of the CV-error with the reestimations can be obtained with plot

plot(res.lmgce.v01, which = 6)$p6

The pre reestimation CV-error is depicted by the red dot and final/reestimated CV-error corresponds to the dark red dot. The horizontal dotted line represents again the OLS CV-error. The final estimated vectors of probabilities is

res.lmgce.v01$p
#>                   p_1       p_2        p_3        p_4        p_5
#> (Intercept) 0.7766072 0.1738273 0.03890761 0.00870866 0.00194925
#> X001        0.1804988 0.1897617 0.19950000 0.20973803 0.22050147
#> X002        0.5509065 0.2536937 0.11682650 0.05379887 0.02477450
#> X003        0.3662867 0.2533674 0.17525898 0.12122992 0.08385701
#> X004        0.2856185 0.2344269 0.19241041 0.15792455 0.12961962
res.lmgce.v01$w[1:5, ]
#>         p_1       p_2       p_3
#> 1 0.2672858 0.3286390 0.4040752
#> 2 0.3305303 0.3333255 0.3361442
#> 3 0.2975896 0.3320065 0.3704039
#> 4 0.4073550 0.3282073 0.2644377
#> 5 0.3513128 0.3330157 0.3156716

After looking at the results and traces we may feel the need to select a different support, lets say the minimum. That can obviously be done using, for instance

lmgce(y ~ X001 + X002 + X003 + X004,
      data = dataThesis,
      errormeasure.which = "min")
#or
update(res.lmgce.v01, errormeasure.which = "min")

but that implies a complete reestimation and can be very time-costly. Since the results of all the evaluated support spaces are stored, choosing a different support should be done with changesupport

res.lmgce.v01.min <- changesupport(res.lmgce.v01, "min")
summary(res.lmgce.v01.min)
#> 
#> Call:
#> lmgce(formula = y ~ X001 + X002 + X003 + X004, data = dataThesis)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -2.6246 -0.5646  0.1015  0.5219  2.0774 
#> 
#> Coefficients:
#>               Estimate Std. Deviation   z value Pr(>|t|)    
#> (Intercept) -4.045e+00      1.397e-05 -289577.5   <2e-16 ***
#> X001        -2.022e+00      1.469e-03   -1376.7   <2e-16 ***
#> X002        -1.773e+01      1.719e-03  -10313.2   <2e-16 ***
#> X003         6.359e+00      1.030e-02     617.3   <2e-16 ***
#> X004        -1.518e+01      5.573e-03   -2724.2   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Normalized Entropy:
#>              NormEnt  SupportLL SupportUL
#> (Intercept) 0.964998  -17.14915  17.14915
#> X001        0.999941 -207.02929 207.02929
#> X002        0.998838 -410.13759 410.13759
#> X003        0.999952 -725.86468 725.86468
#> X004        0.999332 -463.08398 463.08398
#> 
#> Residual standard error: 0.9815 on 70 degrees of freedom
#> Chosen Upper Limit for Standardized Supports: 20, Chosen Error: min
#> Multiple R-squared: 0.8361, Adjusted R-squared: 0.8267
#> NormEnt: 0.9926, CV-NormEnt: 0.9926 (4.881e-06)
#> RMSE: 0.9483, CV-RMSE: 0.9242 (0.2376)
plot(res.lmgce.v01.min)
#> $p1

#> 
#> $p2

#> 
#> $p3

#> 
#> $p4

#> 
#> $p6

Note that the GCE estimates are now “closer” to the OLS estimates.
In this example the true coefficients are known and we can also observe that the 1se approach gives estimates “closer” to those coefficients.

data.frame("Supp_1se" = coef(res.lmgce.v01),
           "Supp_min" = coef(res.lmgce.v01.min),
           "OLS" = coef(res.lmgce.v01$results$OLS$res),
           "TRUE" = coef.dataThesis)
#>                Supp_1se   Supp_min        OLS TRUE.
#> (Intercept)  -3.6957113  -4.045302  -4.060541    -4
#> X001          0.4687837  -2.022251  -7.470739     0
#> X002        -11.6307903 -17.731222 -24.117970   -16
#> X003        -11.4579458   6.359279  44.677570   -12
#> X004         -4.0744707 -15.181731 -35.934748    -5

Several generic functions can used with lmgce class objects, for instance

fitted(res.lmgce.v01)[1:5]
#>         1         2         3         4         5 
#> -1.923023 -5.968547 -4.174583 -5.388990 -3.342422
predict(res.lmgce.v01, dataGCE[1,])
#>         1 
#> -4.127846


lmgceAddin()

An add-in to easily generate the R code for a lmgce analysis and to perform that analysis can be accessed with

lmgceAddin()

Data can be imported from a file or be called from the Environment.



The parameters of lmgce() can be changed.



Several outputs can be visualized.



The expression for lmgce() can be exported.



cv.lmgce()

lmgce() allows for the evaluation of several support spaces and, given a certain criterion, selects one of them. But GCE estimation is also sensitive to the number of points of the signal and noise support spaces and to the weight that each of the spaces has in the loss function. cv.lmgce() tests the combination of these parameters.

res.cv.lmgce <-
  cv.lmgce(
    y ~ X001 + X002 + X003 + X004,
    data = dataThesis,
    support.signal.points = c(3, 5, 7),
    support.noise.points = c(3, 5, 7),
    weight = c(0.1, 0.5, 0.9))
res.cv.lmgce
#> 
#> Call:
#> NULL
#> 
#> 
#> Best combination:
#> 5 points for the signal support; 7 points for the noise support; a weight of 0.1 for the noise support.
#> 
#> Coefficients:
#> (Intercept)         X001         X002         X003         X004  
#>     -3.7180       0.5083     -14.0683     -12.0299      -4.6681

It returns CV-errors, convergence of the optimization algorithm (0 means that convergence was obtained) and time.

res.cv.lmgce$results[order(res.cv.lmgce$results$error.measure.cv.mean),][1:10,-6] 
#>    support.signal.points support.noise.points weight error.measure
#> 8                      5                    7    0.1      1.005753
#> 4                      3                    5    0.1      1.014175
#> 1                      3                    3    0.1      1.017924
#> 5                      5                    5    0.1      1.018902
#> 9                      7                    7    0.1      1.019524
#> 2                      5                    3    0.1      1.020289
#> 7                      3                    7    0.1      1.029128
#> 18                     7                    7    0.5      1.029072
#> 6                      7                    5    0.1      1.035479
#> 3                      7                    3    0.1      1.036940
#>    error.measure.cv.mean convergence     time
#> 8              0.9699387           0 8.786073
#> 4              0.9778044           0 8.073933
#> 1              0.9813602           0 7.877780
#> 5              0.9822721           0 8.322070
#> 9              0.9828507           0 8.528754
#> 2              0.9836312           0 8.033961
#> 7              0.9919176           0 8.597163
#> 18             0.9979500           0 7.475526
#> 6              0.9981236           0 8.189654
#> 3              0.9995922           0 8.289415

The model with the combination that produces the lowest CV-error is kept.

summary(res.cv.lmgce$best)
#> 
#> Call:
#> cv.lmgce(formula = y ~ X001 + X002 + X003 + X004, data = dataThesis, 
#>     support.signal.points = c(3, 5, 7), support.noise.points = c(3, 
#>         5, 7), weight = c(0.1, 0.5, 0.9))
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -2.7959 -0.9089 -0.2023  0.2463  1.7122 
#> 
#> Coefficients:
#>               Estimate Std. Deviation z value Pr(>|t|)    
#> (Intercept) -3.718e+00      6.790e-06 -547568   <2e-16 ***
#> X001         5.083e-01      7.140e-04     712   <2e-16 ***
#> X002        -1.407e+01      8.357e-04  -16835   <2e-16 ***
#> X003        -1.203e+01      5.008e-03   -2402   <2e-16 ***
#> X004        -4.668e+00      2.709e-03   -1723   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Normalized Entropy:
#>              NormEnt   SupportLL  SupportUL
#> (Intercept) 0.864389   -8.155165   8.155165
#> X001        0.999966  -68.557748  68.557748
#> X002        0.993318 -135.817063 135.817063
#> X003        0.998443 -240.370090 240.370090
#> X004        0.999424 -153.350260 153.350260
#> 
#> Residual standard error: 1.041 on 70 degrees of freedom
#> Chosen Upper Limit for Standardized Supports: 6.623, Chosen Error: 1se
#> Multiple R-squared: 0.8074, Adjusted R-squared: 0.7964
#> NormEnt: 0.9711, CV-NormEnt: 0.9711 (9.783e-05)
#> RMSE: 1.006, CV-RMSE: 0.9699 (0.2987)

Using plot we obtain

plot(res.cv.lmgce, which = 1, ncol = 3)$p1 + ggplot2::ylim(0.96, 1.08) 

tsbootgce()

Consider the time series in the ts object moz_ts.

plot(moz_ts)

Let:

Let us obtain the estimated model

\[\begin{equation} CO2_t=\beta_0 + \beta_1 EPC_{t-1} + \beta_2 EUS_{t-2} + \beta_3 GDP_{t-1} \end{equation}\]

Using the function tsbootgce() and setting:

the desired model is computed. formula allows additional functions available in R package dynlm).

res.tsbootgce <- 
  tsbootgce(
    formula = CO2 ~ 1 + L(EPC, 1) + L(EUS, 2) + L(GDP, 0),
    data = moz_ts
    )

Note that by default tsbootgce uses reps = 1000 replicates as elements of an ensemble, which retain the shape of the original time series, as well as the time dependence structure of the autocorrelation and the partial autocorrelation functions. The plot function can give us pointwise percentile envelopes (red, green and blue areas) and the median distribution (dashed yellow line) of the generated time series as well as the original data (black line).

plot(res.tsbootgce, which = 1)$p1

A concise summary of the result can be obtained simply by typing,

res.tsbootgce
#> 
#> Call:
#> tsbootgce(formula = CO2 ~ 1 + L(EPC, 1) + L(EUS, 2) + L(GDP, 
#>     0), data = moz_ts)
#> 
#> Coefficients:
#>            (Intercept)  L(EPC, 1)   L(EUS, 2)   L(GDP, 0) 
#> tsbootgce   -33.92155      0.02398     0.26146     0.03765
#> MEbootOLS  -180.52201     -0.01012     0.54850     0.12507
#> lmgce       -30.04711      0.02333     0.25202     0.03879
#> lm         -176.86713     -0.02947     0.53022     0.14320

The tsbootgce coefficients displayed by default are coef.method = "mode".

coef(res.tsbootgce)
#>  (Intercept)    L(EPC, 1)    L(EUS, 2)    L(GDP, 0) 
#> -33.92154689   0.02398360   0.26145975   0.03765138

and the \(95\%\) highest density regions are

confint(res.tsbootgce)
#>                     2.5%        97.5%
#> (Intercept) -49.28332851 -18.59101696
#> L(EPC, 1)     0.01930974   0.02898740
#> L(EUS, 2)     0.22748595   0.29642204
#> L(GDP, 0)     0.03324949   0.04209782

The empirical distribution, confidence intervals and measure of central tendency of each estimate can be visualized by typing

plot(res.tsbootgce,
     which = 2,
     ci.levels = c(0.90, 0.95, 0.99),
     ci.method = c("hdr" #,"basic" #,"percentile"
       ))$p2

neagging()

The estimates obtained in each bootstrap repetition have a normalized entropy associated that can be used as a weight for an aggregation procedure named neagging.

lmgce object

lmgce objects keep the bootstrap result in object$results$bootstrap when the argument boot.B is greater or equal to 10. In the case of existence of these results we can call the function neagging setting only the object parameter.

res.neagging.lmgce <- neagging(res.lmgce.v02)

The trace of the in sample error can be obtained with plot.

plot(res.neagging.lmgce)

The dotted horizontal line represents the in sample error of the lmgce model. The neagging minimum in sample error, represented by the vertical dashed line, was obtained aggregating 76 models

which.min(res.neagging.lmgce$error)[[1]]
#> [1] 76

The trace of the estimates can also be visualized with plot, where the dotted horizontal lines represent the estimates from the lmgce model.

plot(res.neagging.lmgce, which = 2) 

The estimated coefficients that produce the lowest in sample error are \(\widehat{\boldsymbol{\beta}}^{neagging}=\) (-3.701, 0.485, -9.72, -10.134, -4.005).

coef(res.neagging.lmgce)
#> (Intercept)        X001        X002        X003        X004 
#>  -3.7008289   0.4847928  -9.7204382 -10.1339057  -4.0050171

The estimated coefficients, when \(1000\) models are aggregated, are \(\widehat{\boldsymbol{\beta}}^{neagging}=\) (-3.699, 0.502, -9.707, -10.084, -3.986).

coef(res.neagging.lmgce, which = ncol(res.neagging.lmgce$matrix))
#> (Intercept)        X001        X002        X003        X004 
#>  -3.6990340   0.5021118  -9.7069818 -10.0843435  -3.9856238

tsbootgce object

tsbootgce objects always keep the bootstrap result in object$results$bootstrap so we can call the function neagging setting only the object parameter.

res.neagging.tsbootgce <- neagging(res.tsbootgce)
plot(res.neagging.tsbootgce)

In this case the in sample error is lower in the neagging approach.

Acknowledgements

This work was supported by Fundação para a Ciência e Tecnologia (FCT) through CIDMA and projects https://doi.org/10.54499/UIDB/04106/2020 and https://doi.org/10.54499/UIDP/04106/2020.