User Guide

Stuart Lacy

2018-07-04

library(epitab)
library(dplyr)

epitab provides functionality for building contingency tables with a variety of additional configurations. It was initially designed for use in epidemiology, as an extension to the Epi::stat.table function. However, by identifying core components of a descriptive table, it is flexible enough to be used in a variety of disciplines and situations. This vignette provides an overview of the types of tables that can be built using epitab.

Data

For demonstration purposes, a simulated data set representing an observational study of a disease will be used. This fictitious disease primarily affects elderly people, and not every patient receives first-line treatment. The disease itself comes in two variants: A and B.

set.seed(17)
treat <- data.frame(age=abs(rnorm(100, 60, 20)),
                    sex=factor(sample(c("M", "F"), 100, replace=T)),
                    variant=factor(sample(c("A", "B"), 100, replace=T)),
                    treated=factor(sample(c("Yes", "No"), 100, replace=T), levels=c("Yes", "No")))
treat$agebin <- cut(treat$age, breaks=c(0, 40, 60, 80, 9999),labels=c("0-40", "41-60", "61-80", "80+")) 
treat %>%
    head() %>%
    knitr::kable()
age sex variant treated agebin
39.69983 F B Yes 0-40
58.40727 F A No 41-60
55.34026 M B No 41-60
43.65464 F B No 41-60
75.44182 F B Yes 61-80
56.68776 M B Yes 41-60

Contingency tables

Contingency tables are useful tools for exploratory analysis of a data set, and highlight relationships between one or more independent variables and (typically one) outcome of interest. The example code below shows how to build a basic contingency table to view how treatment varies by age group and sex.

Both the independent and outcome variables are passed in through lists, where the column names must be quoted strings (thereby allowing for these tables to be used in automated scripts). The list entry labels are used to provide the column and row labels. The crosstab_funcs argument specifies what summary measures should be calculated for each covariate / outcome combination. The freq function calculates the frequency of each of these cells, and (optionally) provides the proportion in parentheses.

contingency_table(independents=list("Age"="agebin",
                                    "Sex"="sex"),
                  outcomes=list("Treated"="treated"),
                  crosstab_funcs=list(freq()),
                  data=treat)
##         |          |            |Treated       |              |
##         |          |All         |Yes           |No            |
## ---------------------------------------------------------------
##         |          |            |              |              |
##         |Total     |100         |45 (100)      |55 (100)      |
##         |          |            |              |              |
## Age     |0-40      |13 (13)     |5 (11.1)      |8 (14.5)      |
##         |41-60     |34 (34)     |12 (26.7)     |22 (40)       |
##         |61-80     |39 (39)     |21 (46.7)     |18 (32.7)     |
##         |80+       |14 (14)     |7 (15.6)      |7 (12.7)      |
##         |          |            |              |              |
## Sex     |F         |49 (49)     |23 (51.1)     |26 (47.3)     |
##         |M         |51 (51)     |22 (48.9)     |29 (52.7)     |

Using this standard contingency table as a starting point, there are several ways to customise the table. The presence of the overall frequency column is controlled by the marginal argument. There are also options to freq that specify the formatting of the cross-tabulated frequencies (see ?freq for more details).

contingency_table(independents=list("Age"="agebin",
                                    "Sex"="sex"),
                  outcomes=list("Treated"="treated"),
                  crosstab_funcs=list(freq(proportion = "none")),
                  marginal=FALSE,
                  data=treat)
##         |          |Treated     |       |
##         |          |Yes         |No     |
## -----------------------------------------
##         |          |            |       |
##         |Total     |45          |55     |
##         |          |            |       |
## Age     |0-40      |5           |8      |
##         |41-60     |12          |22     |
##         |61-80     |21          |18     |
##         |80+       |7           |7      |
##         |          |            |       |
## Sex     |F         |23          |26     |
##         |M         |22          |29     |

Note that multiple outcomes can be selected, although it still results in a 2-way contingency table between all the covariates and the outcomes independently. It is not currently possible to produce a 3-way contingency table.

contingency_table(independents=list("Age"="agebin",
                                    "Sex"="sex"),
                  outcomes=list("Treated"="treated", "Variant"="variant"),
                  crosstab_funcs=list(freq()),
                  data=treat)
##         |          |            |Treated       |              |Variant       |              |
##         |          |All         |Yes           |No            |A             |B             |
## ---------------------------------------------------------------------------------------------
##         |          |            |              |              |              |              |
##         |Total     |100         |45 (100)      |55 (100)      |55 (100)      |45 (100)      |
##         |          |            |              |              |              |              |
## Age     |0-40      |13 (13)     |5 (11.1)      |8 (14.5)      |7 (12.7)      |6 (13.3)      |
##         |41-60     |34 (34)     |12 (26.7)     |22 (40)       |14 (25.5)     |20 (44.4)     |
##         |61-80     |39 (39)     |21 (46.7)     |18 (32.7)     |24 (43.6)     |15 (33.3)     |
##         |80+       |14 (14)     |7 (15.6)      |7 (12.7)      |10 (18.2)     |4 (8.9)       |
##         |          |            |              |              |              |              |
## Sex     |F         |49 (49)     |23 (51.1)     |26 (47.3)     |29 (52.7)     |20 (44.4)     |
##         |M         |51 (51)     |22 (48.9)     |29 (52.7)     |26 (47.3)     |25 (55.6)     |

Adding supplementary information

Additional statistics can be added to these contingency tables in two ways. Column-wise measures act on each outcome in turn without regard to the covariates, while row-wise measures are those that are calculated for every level of each independent variable.

Column-wise measures

It is often the case that in addition to the categorical variables included in the contingency table, there are continuous attributes that we are interested in. The col_funcs argument to contingency_table calculates summary measures for each outcome and can be used for this purpose.

The example below shows how to calculate mean age across treatment types, using the provided function summary_mean, to which the name of the continuous variable of interest is passed as a string.

##              |          |            |Treated       |              |
##              |          |All         |Yes           |No            |
## --------------------------------------------------------------------
##              |          |            |              |              |
##              |Total     |100         |45 (100)      |55 (100)      |
##              |          |            |              |              |
## Age          |0-40      |13 (13)     |5 (11.1)      |8 (14.5)      |
##              |41-60     |34 (34)     |12 (26.7)     |22 (40)       |
##              |61-80     |39 (39)     |21 (46.7)     |18 (32.7)     |
##              |80+       |14 (14)     |7 (15.6)      |7 (12.7)      |
##              |          |            |              |              |
## Sex          |F         |49 (49)     |23 (51.1)     |26 (47.3)     |
##              |M         |51 (51)     |22 (48.9)     |29 (52.7)     |
##              |          |            |              |              |
## Mean age     |          |60.05       |62.2          |58.28         |

As with crosstab_funcs, multiple summary values can be passed to col_funcs. The example below shows the use of the other column-wise function provided with epitab: summary_median.

##                |          |            |Treated       |              |
##                |          |All         |Yes           |No            |
## ----------------------------------------------------------------------
##                |          |            |              |              |
##                |Total     |100         |45 (100)      |55 (100)      |
##                |          |            |              |              |
## Age            |0-40      |13 (13)     |5 (11.1)      |8 (14.5)      |
##                |41-60     |34 (34)     |12 (26.7)     |22 (40)       |
##                |61-80     |39 (39)     |21 (46.7)     |18 (32.7)     |
##                |80+       |14 (14)     |7 (15.6)      |7 (12.7)      |
##                |          |            |              |              |
## Sex            |F         |49 (49)     |23 (51.1)     |26 (47.3)     |
##                |M         |51 (51)     |22 (48.9)     |29 (52.7)     |
##                |          |            |              |              |
## Mean age       |          |60.05       |62.2          |58.28         |
##                |          |            |              |              |
## Median age     |          |60.94       |62.79         |58.41         |

Row-wise measures

Another common addition is to display the coefficients of a regression model that relates the independent variables with an outcome (although not necessarily the same outcome displayed in the contingency table). For example, we may be interested to see how treatment varies by age group by looking at the odds ratios (ORs) of a univariate logistic regression. This functionality is provided by the row_funcs argument to contingency_table, which accepts a named list of functions that meet the correct requirements. The two functions provided with this package are odds_ratio and hazard_ratio, used to display coefficients resulting from logistic regression and Cox regression respectively.

The example below shows how to specify that the odds ratios should be calculated in addition to the cross-tabulated frequencies. The only required argument to odds_ratio is the name of the outcome variable.

##         |          |            |Treated       |              |                       |
##         |          |All         |Yes           |No            |OR                     |
## ---------------------------------------------------------------------------------------
##         |          |            |              |              |                       |
##         |Total     |100         |45 (100)      |55 (100)      |                       |
##         |          |            |              |              |                       |
## Age     |0-40      |13 (13)     |5 (11.1)      |8 (14.5)      |1                      |
##         |41-60     |34 (34)     |12 (26.7)     |22 (40)       |1.15 (0.29 - 4.25)     |
##         |61-80     |39 (39)     |21 (46.7)     |18 (32.7)     |0.54 (0.14 - 1.90)     |
##         |80+       |14 (14)     |7 (15.6)      |7 (12.7)      |0.63 (0.13 - 2.87)     |
##         |          |            |              |              |                       |
## Sex     |F         |49 (49)     |23 (51.1)     |26 (47.3)     |1                      |
##         |M         |51 (51)     |22 (48.9)     |29 (52.7)     |1.17 (0.53 - 2.58)     |

Additional arguments to odds_ratio allow the model to adjust for every other covariate included in independents, specify the largest group as the baseline, and select whether to include confidence intervals. Note that multiple functions can be provided to row_funcs. While the table below may not fit on the page of this document, it fits neatly within the standard R terminal output. Strategies for neatly displaying tables are discussed later on.

##         |          |            |Treated       |              |                       |                       |
##         |          |All         |Yes           |No            |OR                     |Adj OR                 |
## ---------------------------------------------------------------------------------------------------------------
##         |          |            |              |              |                       |                       |
##         |Total     |100         |45 (100)      |55 (100)      |                       |                       |
##         |          |            |              |              |                       |                       |
## Age     |0-40      |13 (13)     |5 (11.1)      |8 (14.5)      |1.87 (0.53 - 7.15)     |1.84 (0.51 - 7.16)     |
##         |41-60     |34 (34)     |12 (26.7)     |22 (40)       |2.14 (0.84 - 5.61)     |2.12 (0.83 - 5.60)     |
##         |61-80     |39 (39)     |21 (46.7)     |18 (32.7)     |1                      |1                      |
##         |80+       |14 (14)     |7 (15.6)      |7 (12.7)      |1.17 (0.34 - 4.03)     |1.17 (0.34 - 4.03)     |
##         |          |            |              |              |                       |                       |
## Sex     |F         |49 (49)     |23 (51.1)     |26 (47.3)     |0.86 (0.39 - 1.89)     |0.95 (0.42 - 2.14)     |
##         |M         |51 (51)     |22 (48.9)     |29 (52.7)     |1                      |1                      |

Another use case in epidemiology is when survival is the outcome of interest. Such data is more appropriately modelled using Cox regression, which can be specified with the hazard_ratio function. This requires the outcome to be specified as a string detailing a Surv object, for example hazard_ratio("Surv(time, status)"). See the help page ?hazard_ratio for further details.

Tips

There is no limit to the number of column-wise and row-wise functions that can be supplied, although too many can hinder readability and detract from the purpose of the table.

##                |          |            |Treated       |              |                       |                       |
##                |          |All         |Yes           |No            |OR                     |Adj OR                 |
## ----------------------------------------------------------------------------------------------------------------------
##                |          |            |              |              |                       |                       |
##                |Total     |100         |45 (100)      |55 (100)      |                       |                       |
##                |          |            |              |              |                       |                       |
## Age            |0-40      |13 (13)     |5 (11.1)      |8 (14.5)      |1.87 (0.53 - 7.15)     |1.84 (0.51 - 7.16)     |
##                |41-60     |34 (34)     |12 (26.7)     |22 (40)       |2.14 (0.84 - 5.61)     |2.12 (0.83 - 5.60)     |
##                |61-80     |39 (39)     |21 (46.7)     |18 (32.7)     |1                      |1                      |
##                |80+       |14 (14)     |7 (15.6)      |7 (12.7)      |1.17 (0.34 - 4.03)     |1.17 (0.34 - 4.03)     |
##                |          |            |              |              |                       |                       |
## Sex            |F         |49 (49)     |23 (51.1)     |26 (47.3)     |0.86 (0.39 - 1.89)     |0.95 (0.42 - 2.14)     |
##                |M         |51 (51)     |22 (48.9)     |29 (52.7)     |1                      |1                      |
##                |          |            |              |              |                       |                       |
## Mean age       |          |60.05       |62.2          |58.28         |                       |                       |
##                |          |            |              |              |                       |                       |
## Median age     |          |60.94       |62.79         |58.41         |                       |                       |

This flexibility of epitab allows for either simple summary tables that are used to highlight a trend within the data, or more complex reference tables that hold a large amount of summary statistics.

##                |          |            |Treated       |              |Disease variant     |              |                       |                       |
##                |          |All         |Yes           |No            |A                   |B             |Treatment OR           |Disease variant OR     |
## ----------------------------------------------------------------------------------------------------------------------------------------------------------
##                |          |            |              |              |                    |              |                       |                       |
##                |Total     |100         |45 (100)      |55 (100)      |55 (100)            |45 (100)      |                       |                       |
##                |          |            |              |              |                    |              |                       |                       |
## Age            |0-40      |13 (13)     |5 (11.1)      |8 (14.5)      |7 (12.7)            |6 (13.3)      |1.87 (0.53 - 7.15)     |1.37 (0.38 - 4.92)     |
##                |41-60     |34 (34)     |12 (26.7)     |22 (40)       |14 (25.5)           |20 (44.4)     |2.14 (0.84 - 5.61)     |2.29 (0.90 - 5.96)     |
##                |61-80     |39 (39)     |21 (46.7)     |18 (32.7)     |24 (43.6)           |15 (33.3)     |1                      |1                      |
##                |80+       |14 (14)     |7 (15.6)      |7 (12.7)      |10 (18.2)           |4 (8.9)       |1.17 (0.34 - 4.03)     |0.64 (0.15 - 2.30)     |
##                |          |            |              |              |                    |              |                       |                       |
## Sex            |F         |49 (49)     |23 (51.1)     |26 (47.3)     |29 (52.7)           |20 (44.4)     |0.86 (0.39 - 1.89)     |0.72 (0.32 - 1.58)     |
##                |M         |51 (51)     |22 (48.9)     |29 (52.7)     |26 (47.3)           |25 (55.6)     |1                      |1                      |
##                |          |            |              |              |                    |              |                       |                       |
## Mean age       |          |60.05       |62.2          |58.28         |62.27               |57.32         |                       |                       |
##                |          |            |              |              |                    |              |                       |                       |
## Median age     |          |60.94       |62.79         |58.41         |63.76               |56.45         |                       |                       |

contingency_table can even be used when there is no cross-tabulation, for example as a means of displaying regression coefficients.

##         |          |OR                     |Adj OR                 |
## --------------------------------------------------------------------
##         |          |                       |                       |
## Age     |0-40      |1.87 (0.53 - 7.15)     |1.84 (0.51 - 7.16)     |
##         |41-60     |                       |                       |
##         |61-80     |1                      |1                      |
##         |80+       |1.17 (0.34 - 4.03)     |1.17 (0.34 - 4.03)     |
##         |          |                       |                       |
## Sex     |F         |0.86 (0.39 - 1.89)     |0.95 (0.42 - 2.14)     |
##         |M         |1                      |1                      |

Publication quality tables

The default print method of these contingency tables is designed for a standard wide R console, where the entire table fits width-wise. However, for situations where a table is being produced for distribution or publication of any type, greater attention to detail and appearance is required. epitab provides several options for exporting clean-looking tables.

neat_table to HTML and PDF

The neat_table function provided in epitab builds a cleanly formatted table for output to HMTL or LaTeX, using knitr::kable and the kableExtra package. The output of neat_table is a kable object and so can be passed to kableExtra::kable_styling(), allowing for the specification of various cosmetic settings. See the help files for both neat_table and kableExtra::kable_styling for further details.

Treated
All Yes No OR Adj OR
Total 100 45 (100) 55 (100)
Age
0-40 13 (13) 5 (11.1) 8 (14.5) 1 1
41-60 34 (34) 12 (26.7) 22 (40) 1.15 (0.29 - 4.25) 1.15 (0.29 - 4.31)
61-80 39 (39) 21 (46.7) 18 (32.7) 0.54 (0.14 - 1.90) 0.54 (0.14 - 1.96)
80+ 14 (14) 7 (15.6) 7 (12.7) 0.63 (0.13 - 2.87) 0.63 (0.13 - 2.96)
Sex
F 49 (49) 23 (51.1) 26 (47.3) 1 1
M 51 (51) 22 (48.9) 29 (52.7) 1.17 (0.53 - 2.58) 1.06 (0.47 - 2.39)
Mean age 60.05 62.2 58.28

Due to the vignette markdown theme, these styling changes won’t appear in this HTML. The screenshot below shows how the table appears when using the above code in the default Rmarkdown template.

For outputting to PDF documents using LaTeX, the same neat_table function can be used, but now the latex output format must be specified. Also it is highly recommended to use the booktabs argument to produce far cleaner looking tables.

The above call will display the table below using the default Rmarkdown template.

kable

If full control of the table appearance is required, then the raw character matrix is provided as the mat attribute of the output of contingency_table. It can be used in conjunction with knitr::kable and kableExtra. NB: The default value for format in kable is pandoc, which does not work well with epitab, try html or markdown instead.

Treated
All Yes No OR Adj OR
Total 100 45 (100) 55 (100)
Age 0-40 13 (13) 5 (11.1) 8 (14.5) 1 1
41-60 34 (34) 12 (26.7) 22 (40) 1.15 (0.29 - 4.25) 1.15 (0.29 - 4.31)
61-80 39 (39) 21 (46.7) 18 (32.7) 0.54 (0.14 - 1.90) 0.54 (0.14 - 1.96)
80+ 14 (14) 7 (15.6) 7 (12.7) 0.63 (0.13 - 2.87) 0.63 (0.13 - 2.96)
Sex F 49 (49) 23 (51.1) 26 (47.3) 1 1
M 51 (51) 22 (48.9) 29 (52.7) 1.17 (0.53 - 2.58) 1.06 (0.47 - 2.39)

Again note that the stylistic changes made above will not display in the vignette you are currently reading due to the vignette template, but they will appear in your Rmarkdown output as shown below.

Word

Since Word is a proprietary format, it is challenging to directly embed tables into documents. The most convenient method to export a contingency table into Word involves the following steps:

  1. Save the raw table as a CSV (see code snippet below)
  2. Open the CSV in Excel and copy the table
  3. Right click in a Word document and select Paste Options -> Use Destination Styles
  4. Adjust table appearance using standard Table Tools | Design options

Custom summary functions

In the above examples, the summary functions used to build up the table in crosstab_funcs, row_funcs, and col_funcs have been provided by epitab. However, for greater flexibility, any correctly parametrised function can be supplied instead. This section details the appropriate form for each of these 3 arguments.

Cross-tabulated

The functions passed in to crosstab_functions are run for each combination of outcome and independent variable level.

Arguments:

The function must return a vector of length one, representing the statistic for this covariate-level / outcome-level pair.

The example function below calculates the proportion of each treatment type per covariate level (rather than also displaying the counts as freq does).

##         |          |          |Treated     |           |
##         |          |All       |Yes         |No         |
## --------------------------------------------------------
##         |          |          |            |           |
##         |Total     |100       |100.0%      |100.0%     |
##         |          |          |            |           |
## Age     |0-40      |13.0%     |11.1%       |14.5%      |
##         |41-60     |34.0%     |26.7%       |40.0%      |
##         |61-80     |39.0%     |46.7%       |32.7%      |
##         |80+       |14.0%     |15.6%       |12.7%      |
##         |          |          |            |           |
## Sex     |F         |49.0%     |51.1%       |47.3%      |
##         |M         |51.0%     |48.9%       |52.7%      |

Column-wise

The column-wise functions provided with epitab are summary_mean and summary_median, and are used to investigate relationships between the outcome variables that aren’t necessarily associated with the categorical covariates.

Args:

Returns:

The function must return a single value, representing the statistic for this outcome level.

The example function below extends summary_mean by adding the standard deviation in parentheses. It is hard-coded to work for the continuous variable age in the dummy data set.

##                   |          |                  |Treated          |                  |
##                   |          |All               |Yes              |No                |
## --------------------------------------------------------------------------------------
##                   |          |                  |                 |                  |
##                   |Total     |100               |45 (100)         |55 (100)          |
##                   |          |                  |                 |                  |
## Age               |0-40      |13 (13)           |5 (11.1)         |8 (14.5)          |
##                   |41-60     |34 (34)           |12 (26.7)        |22 (40)           |
##                   |61-80     |39 (39)           |21 (46.7)        |18 (32.7)         |
##                   |80+       |14 (14)           |7 (15.6)         |7 (12.7)          |
##                   |          |                  |                 |                  |
## Sex               |F         |49 (49)           |23 (51.1)        |26 (47.3)         |
##                   |M         |51 (51)           |22 (48.9)        |29 (52.7)         |
##                   |          |                  |                 |                  |
## Mean age          |          |60.05             |62.2             |58.28             |
##                   |          |                  |                 |                  |
## Mean age (sd)     |          |60.05 (20.13)     |62.2 (20.44)     |58.28 (19.89)     |

Row-wise

Row-wise functions are used to estimate summary measures for the independent variables outside of the contingency table. This can be useful for providing a summary statistic that is not necessarily related to the outcome variables. This example will describe the case where we wish to run a linear regression on a continuous outcome, in particular, this example will regress on continuous age. This is not a particularly helpful analysis, since one of the covariates is age group, but it will serve as an example. These functions are run for each independent variable and must be parametrised as follows:

Args:

The function must return a vector with length equal to the number of levels of var.

##         |          |            |Treated       |              |                      |
##         |          |All         |Yes           |No            |Regression on age     |
## --------------------------------------------------------------------------------------
##         |          |            |              |              |                      |
##         |Total     |100         |45 (100)      |55 (100)      |                      |
##         |          |            |              |              |                      |
## Age     |0-40      |13 (13)     |5 (11.1)      |8 (14.5)      |1                     |
##         |41-60     |34 (34)     |12 (26.7)     |22 (40)       |23.49                 |
##         |61-80     |39 (39)     |21 (46.7)     |18 (32.7)     |42.379                |
##         |80+       |14 (14)     |7 (15.6)      |7 (12.7)      |65.6                  |
##         |          |            |              |              |                      |
## Sex     |F         |49 (49)     |23 (51.1)     |26 (47.3)     |1                     |
##         |M         |51 (51)     |22 (48.9)     |29 (52.7)     |-6.563                |