In addition to flashier
and ggplot2
, we
will make use of package cowplot
below in order to arrange
plots into grids:
Empirical Bayes matrix factorization (Wang and Stephens, 2021) decomposes a data matrix \(Y \in \mathbb{R}^{n \times p}\) as \[ Y = LF' + E, \] with “loadings” \(L \in \mathbb{R}^{n \times K}\), “factors” \(F \in \mathbb{R}^{p \times K}\), and residual errors \[ e_{ij} \sim N(0, 1 / \tau_{ij}).\] The model puts priors on each factor and each set of loadings: \[f_{\cdot k} \sim g^{(f)}_k,\ \ell_{\cdot k} \sim g^{(\ell)}_k,\] with \(g^{(\ell)}_k\) and \(g^{(f)}_k\) assumed to belong to some families of distributions \(\mathcal{G}^{(\ell)}\) and \(\mathcal{G}^{(f)}\) and estimated using the data. The default choice of prior family for both factors and loadings is the family of point-normal distributions: \[ g \sim \pi_0 \delta_0 + (1 - \pi_0) N(0, \sigma^2), \] where both \(\pi_0\) and \(\sigma^2\) are free parameters. This family is especially useful when factors and loadings can be expected to be at least modestly sparse; in other settings, different prior families might be preferred.
To avoid over-parametrization, it is necessary to make some assumptions about the precision parameters \(\tau_{ij}\). The default assumption is that all \(\tau_{ij}\)s are equal: \[ e_{ij} \sim N(0, 1 / \tau).\]
Note that when the prior families \(\mathcal{G}^{(\ell)}\) and \(\mathcal{G}^{(f)}\) are closed under scaling (as is typically the case), then the model as formulated above is not identifiable, since we can scale each set of loadings \(\ell_{\cdot k}\) by any constant \(d_k \ne 0\) if we also scale factor \(f_{\cdot k}\) by \(1 / d_k\).
We can make the matrix factorization unique by writing \[ Y = LDF' + E, \] with the scales of loadings \(\ell_{\cdot 1}, \ldots, \ell_{\cdot K}\) and factors \(f_{\cdot 1}, \ldots, f_{\cdot K}\) constrained in some fashion (for example, by requiring \(\| \ell_{\cdot k} \|_2 = 1\) and \(\| f_{\cdot k} \|_2 = 1\) for all \(k\)). We refer to such a factorization as an LDF factorization.
gtex
dataset and the flash()
functionThroughout this vignette, we will use the main flash()
function to produce a number of EBMF fits to dataset gtex
,
which is comprised of a subset of the data used as the primary example
in Wang and Stephens (2021). The dataset is derived from data made
available by the Genotype Tissue Expression (GTEx) project (Lonsdale et
al. 2013), which provides \(z\)-scores
for assessing the significance of effects of genetic variants (“single
nucleotide polymorphisms” or SNPs) on gene expression across 44 human
tissues. To reduce the data to a more manageable size, Urbut et
al. (2019) chose the “top” SNP for each gene — that is, the SNP
associated with the largest (absolute) \(z\)-score over all 44 tissues. This process
yields a 16,069 by 44 matrix of \(z\)-scores, with rows corresponding to
SNP-gene pairs or “eQTLs” and columns corresponding to tissues. The
gtex
dataset included in flashier
is further
subsampled down to 1000 rows.
One potential aim in applying matrix factorization methods to the
gtex
dataset is to analyze patterns of effect sharing. For
example, Wang and Stephens (2021) point out an EBMF factor that has
large values across brain tissues (cortex, cerebellum, basal ganglia,
etc.), but values much closer to zero for all non-brain tissues. We can
expect eQTLs that are loaded on this factor to affect brain and
non-brain tissues much differently. Other factors have large values for
a single tissue only (e.g., testis). We can expect eQTLs loaded on such
factors to be more or less tissue-specific.
To fit the EBMF model, we use function flash()
, which
adds up to greedy_Kmax
factor/loadings pairs using the
“greedy” algorithm described in Wang and Stephens (2021). In brief, the
algorithm adds new factor/loadings pairs one at a time, optimizing each
pair as it gets added without returning to re-optimize previously added
pairs. In other words, it finds the best rank-one fit; then the best
rank-two fit with the first factor/loadings pair fixed; then the best
rank-three fit with the first two factor/loadings pairs fixed; and so
on. The algorithm stops adding factor/loadings pairs when they fail to
yield an increase in the variational lower bound on the log likelihood
(or ELBO), and in this sense EBMF can “automatically” select the number
of factor/loadings pairs \(K\). Thus it
is often sufficient simply to set greedy_Kmax
as high as is
practically manageable (the default is greedy_Kmax = 50
).
Here, however, we will restrict the number of factor/loadings pairs in
order to ensure that the vignette can be run quickly:
data(gtex, gtex_colors)
fit_default <- flash(gtex, greedy_Kmax = 5)
#> Adding factor 1 to flash object...
#> Adding factor 2 to flash object...
#> Adding factor 3 to flash object...
#> Adding factor 4 to flash object...
#> Adding factor 5 to flash object...
#> Wrapping up...
#> Done.
#> Nullchecking 5 factors...
#> Done.
By default, flash()
performs a “nullcheck” after the
greedy algorithm terminates: that is, it loops over each greedily added
factor/loadings pair and checks that removing it does not produce an
increase in the ELBO (i.e., a better fit). If it does, then the
factor/loadings pair is removed from the final fit.
The output of function flash()
is an object of class
flash
. Useful information about the fit can be accessed as
list elements. For example, the ELBO can be accessed as:
Additionally, a few class methods have been implemented, including
fitted()
, which returns \(\mathbb{E} (LF')\);
residuals()
, which yields the matrix of expected residuals
\(Y - \mathbb{E} (LF')\); and
ldf()
, which gives the LDF factorization with the scaling
of loadings and factors determined by argument type
.
Finally, flashier
provides a convenient
plot()
method. By default, plot()
returns two
figures: a “scree plot” showing the proportion of variance explained by
each factor/loadings pair; and plots of posterior means of either
factors or loadings (depending on the argument to
pm_which
). Here, we supply colors for the posterior means
plots in order to match the figures included in Wang and Stephens
(2021):
Most of the large values for factor 2 above correspond to brain tissues (yellow), but the factor is not very “clean” in the sense that many other tissues have medium to large values as well. Factor 3 is primarily confined to whole blood, but there again appears to be a lot of noise, which makes biological interpretation difficult.
In the following sections, we will attempt to clean up these factors
by choosing better (non-default) arguments to various parameters in
function flash()
. We note in passing that
flashier
supplies two interfaces for fitting EBMF models:
one, via flash()
; and a second, pipeable interface that
provides many more options for customization. In this vignette, we only
consider the first, simpler interface; for an introduction to the
second, see the “Advanced flashier” vignette.
We begin by considering parameter var_type
. As mentioned
above, the default model (var_type = 0
) estimates a single
precision parameter \(\tau\): \[ e_{ij} \sim N(0, 1 / \tau).\] It is also
possible to estimate a precision parameter for each row
(var_type = 1
) or for each column
(var_type = 2
). For example, setting
var_type = 2
puts \[ e_{ij} \sim
N(0, 1 / \tau_j),\] so that residual variance is assumed to vary
from column to column but is constant within each column. Note that this
“by-column” assumption contains the constant variance assumption
var_type = 0
as a special case, so it will typically yield
a higher ELBO (we write “typically” rather than “always” because the
optimization surface is in general non-convex, so there is no guarantee
that the global optimum will be found):
fit_var_bycol <- flash(gtex, greedy_Kmax = 5, var_type = 2, verbose = 0)
c(const_elbo = fit_default$elbo, bycol_elbo = fit_var_bycol$elbo)
#> const_elbo bycol_elbo
#> -83131.69 -80490.27
The most general possible assumption allowed by flashier
is that the matrix of precision parameters \(\tau_{ij}\) is an arbitrary rank-one
matrix. That is, for each \(i\) and
\(j\), \(\tau_{ij} = \tau^{(1)}_i \tau^{(2)}_j\),
where \(\tau^{(1)}\) is a \(n\)-vector to be estimated and \(\tau^{(2)}\) is a \(p\)-vector to be estimated. This
assumption, which we refer to as the “Kronecker” variance assumption,
can be motivated by noting that it describes a model in which residuals
are distributed \(e_{ij} \sim N(0,
1)\), but the rows and columns of \(Y\) have each been scaled by some (unknown)
constants: \[ \text{diag} (\tau^{(1)}) Y
\text{diag} (\tau^{(2)}) = L F' + E.\]
Just as the by-column assumption usually yields an increase in ELBO relative to the constant variance assumption, the Kronecker assumption will generally yield a higher ELBO than the by-column assumption:
fit_var_kronecker <- flash(gtex, greedy_Kmax = 5, var_type = c(1, 2), verbose = 0)
c(bycol_elbo = fit_var_bycol$elbo, kron_elbo = fit_var_kronecker$elbo)
#> bycol_elbo kron_elbo
#> -80490.27 -78636.44
Although the Kronecker assumption can yield large increases in ELBO, one should be careful using it with very large datasets, as the precision parameters must be estimated at each iteration via an alternating maximization algorithm. In contrast, precision parameters can be obtained analytically (and thus much more quickly) when using a constant, by-row, or by-column variance assumption. Note also that despite the large increase in ELBO from the by-column to the Kronecker variance assumption, there are not obvious differences in the plotted posterior means:
p1 <- plot(
fit_var_bycol,
pm_which = "factors",
pm_colors = gtex_colors,
order_by_pve = FALSE,
include_scree = FALSE
) + ggtitle("By-column variance assumption")
p2 <- plot(
fit_var_kronecker,
pm_which = "factors",
pm_colors = gtex_colors,
order_by_pve = FALSE,
include_scree = FALSE
) + ggtitle("Kronecker variance assumption")
plot_grid(p1, p2, nrow = 2)
To compare the fits more directly, we can use function
ldf()
to extract factor values, with argument
type
set to "m"
in order to match the values
displayed by plot.flash
("m"
stands for
maximum norm; setting type = "m"
scales the columns of
\(F\) so that the maximum absolute
value over each column is equal to 1). A scatterplot confirms that
estimates are very similar, with minor differences in factors 2, 4, and
5:
comparison_df <- data.frame(
var_by_col = as.vector(ldf(fit_var_bycol, type = "m")$F),
var_kronecker = as.vector(ldf(fit_var_kronecker, type = "m")$F),
k = factor(rep(1:5, each = ncol(gtex)))
)
ggplot(comparison_df, aes(x = var_by_col, y = var_kronecker, col = k)) +
geom_point(alpha = 0.5) +
geom_abline(slope = 1, linetype = "dashed") +
scale_color_brewer(palette = "Set1") +
labs(x = "By-column variance assumption", y = "Kronecker variance assumption") +
theme_minimal()
In many scenarios, the data \(Y\) is observed with some known error. In such a case, it might be preferable to fit the model
\[ Y = L F' + S + E, \] where \(S_{ij} \sim N(0, s^2_{ij})\) with the \(s^2_{ij}\)s fixed.
In simple cases, this model can almost be reduced to the
model described above. For example, since the gtex
dataset
is comprised of \(z\)-scores, we might
choose to set the \(s_{ij}\)s
identically equal to one. With by-column precision parameters \(\tau_j\), this yields the model \[ Y_{ij} \sim N \left(\sum_k \ell_{ik} f_{jk}, 1 /
\tau^2_{j} + 1 \right). \] This is equivalent to the previously
described model \[ Y_{ij} \sim N \left(\sum_k
\ell_{ik} f_{jk}, 1 / \tilde{\tau}^2_{j} \right) \] provided that
we add constraints \(\tilde{\tau_j} \le
1\) for all \(j\). This model
should yield a lower ELBO than the model with arbitrary column-specific
variances (again, we write “should” because it is not guaranteed that
flash()
will converge to a global optimum), but it is
arguably the more correct model.
fit_with_meas_err <- flash(gtex, S = 1, greedy_Kmax = 5, var_type = 2, verbose = 0)
c(bycol_elbo = fit_var_bycol$elbo, meas_err_elbo = fit_with_meas_err$elbo)
#> bycol_elbo meas_err_elbo
#> -80490.27 -80491.84
We note here that it is also possible to pass a matrix of standard
errors \(s_{ij}\) as argument to
S
. If these standard errors are to account for all residual
variance (that is, if the matrix of “true” means can assumed to be
exactly low-rank rather than approximately so), then one can set
var_type = NULL
. If not, then updates to precision
parameters \(\tau_{ij}\) must be
accomplished via one call to function optimize()
for each
precision parameter, so variance estimation can be very slow. In such
cases, we generally recommend leaving S
unspecified and
allowing all residual variance to be estimated.
We next turn to parameter ebnm_fn
. Note that the
argument to ebnm_fn
is a function; indeed, in
flashier,
prior families \(\mathcal{G}^{(\ell)}_k\) and \(\mathcal{G}^{(f)}_k\) are not specified
directly, but rather implicitly via the functions used to solve the
empirical Bayes normal means (EBNM) subproblems, which have form \[ \begin{gather} x_i \sim \mathcal{N} \left(
\theta_i, s_i^2 \right) \\ \theta_i \sim g \in \mathcal{G}. \end{gather}
\] (Here, the \(x_i\)s and \(s_i\)s are known observations and standard
errors and the \(\theta_i\)s are
unknown “means.”) Effectively, ebnm_fn
does not only
(implicitly) specify \(\mathcal{G}\),
but also (directly) the method for estimating \(g \in \mathcal{G}\).
A number of useful EBNM solvers are provided by package
ebnm
(Willwerscheid and Stephens 2021). The default
argument is ebnm_fn = ebnm_point_normal
, which takes the
prior family \(\mathcal{G}\) (for both
factors and loadings) to be the family of point-normal distributions:
\[ g \sim \pi_0 \delta_0 + (1 - \pi_0) N(0,
\sigma^2). \] In the scenario under consideration, this choice of
prior family makes sense, as it is reasonable to expect some degree of
sparsity in both factors and loadings: some effects might only be shared
across a few related tissues (e.g., brain tissues but not other
tissues), and patterns of sharing may not be ubiquitous across SNP-gene
pairs (e.g., many effects may be constant across brain and non-brain
tissues).
By varying ebnm_fn
, we also vary our assumptions about
the “true” underlying distributions of factors and loadings. For
example, if we can reasonably expect each set of loadings \(\ell_{\cdot k}\) to be heavy-tailed, then
we might prefer a more flexible family of distributions, such as the
family of scale mixtures of normals
(ebnm_fn = ebnm_normal_scale_mixture
) or the family of
symmetric distributions that are unimodal at zero
(ebnm_fn = ebnm_unimodal_symmetric
). Note that these
families are nested, so that the unimodal family should yield a higher
ELBO than scale mixtures of normals, which should in turn yield a higher
ELBO than the family of point-normals:
fit_normalmix <- flash(
gtex,
greedy_Kmax = 5,
ebnm_fn = ebnm_normal_scale_mixture,
verbose = 0
)
fit_unimodal <- flash(
gtex,
greedy_Kmax = 5,
ebnm_fn = ebnm_unimodal_symmetric,
verbose = 0
)
c(pn_elbo = fit_default$elbo,
normalmix_elbo = fit_normalmix$elbo,
unimodal_elbo = fit_unimodal$elbo)
#> pn_elbo normalmix_elbo unimodal_elbo
#> -83131.69 -82981.84 -82978.90
It is also possible to introduce sign constraints via the choice of
prior family; for example, to constrain factors to be nonnegative, we
need only choose a prior family with nonnegative support such as the
family of point-exponential distributions
(ebnm_fn = ebnm_point_exponential
) or the family of all
nonnegative distributions that are unimodal at zero
(ebnm_fn = ebnm_unimodal_nonnegative
). This constraint can
model the (reasonable) assumption that when effects are shared across
tissues, they are shared in the same direction (that is, expression
either increases or decreases across a number of related tissues, but
does not increase in some and simultaneously decrease in others). We can
constrain factors while allowing loadings to remain unconstrained by
passing a list as argument to ebnm_fn
:
fit_seminonnegative <- flash(
gtex,
greedy_Kmax = 5,
var_type = 2,
ebnm_fn = c(ebnm_point_normal, ebnm_point_exponential),
verbose = 0
)
The result is a “semi-nonnegative” factorization, which in the
scenario under consideration yields sparse, interpretable factors
(especially in comparison to the factors produced using all default
arguments to flash()
; see above). Here, factor 2 represents
sharing among brain, pituitary, and testis tissues; factor 3, among
whole blood, spleen tissue, lung tissue, and lymphocytes; and factor 5,
among heart and muscle tissues. Factor 4 represents effects uniquely
expressed in testis tissue:
The “backfitting” algorithm generally requires a large increase in fitting time but can substantially improve the overall model fit. As described above, the greedy algorithm leaves previously added factor/loadings pairs alone when optimizing each newly added factor/loadings pair. In contrast, the backfitting algorithm iterates over a number of existing factor/loadings pairs, updating each pair one at a time until all have converged (see Wang and Stephens, 2021). In this way, factors and loadings that are added early on can, so to speak, use information contained in subsequent factors and loadings to improve the overall model fit. Since, by default, a backfit takes the greedy fit as its starting point, it is guaranteed to produce an increase in ELBO:
fit_greedy <- flash(
gtex,
greedy_Kmax = 5,
var_type = 2,
ebnm_fn = c(ebnm_point_normal, ebnm_point_exponential),
backfit = FALSE,
verbose = 0
)
fit_backfit <- flash(
gtex,
greedy_Kmax = 5,
var_type = 2,
ebnm_fn = c(ebnm_point_normal, ebnm_point_exponential),
backfit = TRUE,
verbose = 0
)
c(greedy_elbo = fit_greedy$elbo, bf_elbo = fit_backfit$elbo)
#> greedy_elbo bf_elbo
#> -81194.23 -80893.79
Qualitatively, results can vary: for large, complex fits, the
improvement can be considerable; for smaller fits, there might be few
obvious differences. Interestingly, backfitting the semi-nonnegative fit
from above produces a less visually appealing fit. We suspect
that this occurs because 5 factor/loadings pairs is not enough to
account for the structure in the data; if one increases
greedy_Kmax
, one obtains a more satisfying backfit.
p1 <- plot(
fit_greedy,
pm_which = "factors",
pm_colors = gtex_colors,
order_by_pve = FALSE,
include_scree = FALSE
) + ggtitle("Greedy fit")
p2 <- plot(
fit_backfit,
pm_which = "factors",
pm_colors = gtex_colors,
order_by_pve = FALSE,
include_scree = FALSE
) + ggtitle("Backfit")
plot_grid(p1, p2, nrow = 2)
We again obtain a more direct comparison of fits using function
ldf()
. Here it is apparent that backfitting makes factor 2
less sparse by “borrowing” from factor 1. Other factors remain very
similar:
comparison_df <- data.frame(
fit_greedy = as.vector(ldf(fit_greedy, type = "m")$F),
fit_backfit = as.vector(ldf(fit_backfit, type = "m")$F),
k = factor(rep(1:5, each = ncol(gtex)))
)
ggplot(comparison_df, aes(x = fit_greedy, y = fit_backfit, col = k)) +
geom_point(alpha = 0.5) +
geom_abline(slope = 1, linetype = "dashed") +
scale_color_brewer(palette = "Set1") +
labs(x = "Greedy fit", y = "Backfit") +
theme_minimal()
One of the list elements in the object returned by flash
is a function that can sample from posterior distributions on loadings
and factors. Take the backfit above as an example. To better understand
which tissues are bound up with whole blood effects, we might like
confidence intervals for the third factor. We construct 95% confidence
intervals using 200 samples and then finding 2.5% and 97.5% quantiles as
follows:
# Set seed for reproducibility.
set.seed(1)
# Use returned sampler to sample from posterior.
samp <- fit_backfit$sampler(nsamp = 200)
# Only keep factor 3.
factor3_samp <- lapply(samp, function(x) x[[2]][, 3])
# Normalize the loadings.
factor3_samp <- sapply(factor3_samp, function(x) x / max(abs(x)))
# Get 95% confidence intervals.
factor3_ci <- apply(factor3_samp, 1, quantile, c(0.025, 0.975))
Since the plot
method returns a ggplot2
object, it can be customized using ggplot
syntax, making
the addition of error bars a simple task:
plot(fit_backfit, kset = 3, pm_colors = gtex_colors, include_scree = FALSE) +
geom_errorbar(aes(ymin = factor3_ci[1, ], ymax = factor3_ci[2, ]))
Confidence intervals for spleen tissue, lung tissue, and lymphocytes do not contain zero, so we can be reasonably confident that this pattern of sharing is “real” and not just an artefact of the data. We note, however, that empirical Bayes methods are known to underestimate uncertainty in the data (since they do not account for uncertainty in estimates \(g \in \mathcal{G}\)), so we offer this last conclusion with a large grain of salt.
The following R version and packages were used to generate this vignette:
sessionInfo()
#> R version 4.2.1 (2022-06-23)
#> Platform: aarch64-apple-darwin20 (64-bit)
#> Running under: macOS Monterey 12.3
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib
#>
#> locale:
#> [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] dplyr_1.1.3 cowplot_1.1.1 ggplot2_3.4.3 flashier_1.0.7 magrittr_2.0.3
#> [6] ebnm_1.1-2
#>
#> loaded via a namespace (and not attached):
#> [1] softImpute_1.4-1 tidyselect_1.2.0 xfun_0.40 bslib_0.5.1
#> [5] purrr_1.0.2 ashr_2.2-63 splines_4.2.1 lattice_0.21-9
#> [9] colorspace_2.1-0 vctrs_0.6.3 generics_0.1.3 htmltools_0.5.6.1
#> [13] yaml_2.3.7 utf8_1.2.3 rlang_1.1.1 mixsqp_0.3-48
#> [17] jquerylib_0.1.4 pillar_1.9.0 withr_2.5.1 glue_1.6.2
#> [21] RColorBrewer_1.1-3 trust_0.1-8 lifecycle_1.0.3 stringr_1.5.0
#> [25] munsell_0.5.0 gtable_0.3.4 evaluate_0.22 labeling_0.4.3
#> [29] knitr_1.44 fastmap_1.1.1 invgamma_1.1 irlba_2.3.5.1
#> [33] parallel_4.2.1 fansi_1.0.5 Rcpp_1.0.11 scales_1.2.1
#> [37] cachem_1.0.8 horseshoe_0.2.0 jsonlite_1.8.7 truncnorm_1.0-9
#> [41] farver_2.1.1 deconvolveR_1.2-1 digest_0.6.33 stringi_1.7.12
#> [45] grid_4.2.1 cli_3.6.1 tools_4.2.1 sass_0.4.7
#> [49] tibble_3.2.1 tidyr_1.3.0 pkgconfig_2.0.3 Matrix_1.6-1.1
#> [53] SQUAREM_2021.1 rmarkdown_2.25 rstudioapi_0.15.0 R6_2.5.1
#> [57] compiler_4.2.1