This vignette demonstrates that covariance and correlation data can be compressed and visualized using multidimensional scaling using the same principles and machinery as for distance and similarity data. I argue that these geometric representations have interpretative advantages over other methods and should be more widely used.
Because ordination is grounded in geometric data analysis, it is easiest and most straightforward on naturally geometric data. For example, the glucose and insulin data collected by Reaven and Miller (1970) consists of five continuous-valued measurements for each of 145 persons. Because the variables (measurements) have different scales, in order to represent the cases (persons) geometrically we must separately center and scale them. The most basic ordination method, principal components analysis (PCA), computes a singular value decomposition \(X = U D V^\top\) of the centered and scaled data matrix \(X \in \mathbb{R}^{n\times p}\):
<- scale(heplots::Diabetes[, -6L], center = TRUE, scale = TRUE)
x <- svd(x)
s <- length(s$d) r
The matrix factors \(U \in \mathbb{R}^{n\times r}\) and \(V \in \mathbb{R}^{p\times r}\) arise from eigendecompositions of \(X X^\top\) and of \(X^\top X\), respectively, which have the same set of eigenvalues \(\lambda_1,\ldots,\lambda_r\). The square roots of these eigenvalues make up the diagonal matrix \(D \in \mathbb{R}^{r\times r}\). The conventional PCA biplot uses principal coordinates for the cases, given by the rows of \(U D\), and standardized coordinates for the variables, given by the rows of \(V\).
The columns of \(U = \left[\,u_1\,\cdots\,u_r\,\right]\) and \(V = \left[\,v_1\,\cdots\,v_r\,\right]\) comprise the eigenvectors of \(X X^\top\) and of \(X^\top X\). They are orthonormal, which means that \(U^\top U = I_r = V^\top V\) and that the total variance (called inertia) in each matrix is \(\sum_{j=1}^{r}{ {v_j}^2 } = r = \sum_{j=1}^{r}{ {v_j}^2 }\). Meanwhile, the diagonal \(D\) contains all of the inertia of the centered and scaled data matrix \(X\):
# inertia of the (scaled) data
sum(x^2)
## [1] 720
# inertia of the case and variable factors
sum(s$u^2)
## [1] 5
sum(s$v^2)
## [1] 5
# inertia of the diagonal factor
sum(s$d^2)
## [1] 720
For the (2-dimensional, i.e. planar) biplot, this inertia is conferred on the cases (converting standardized coordinates to principal coordinates) so that the planar distances between the points in the plot representing the cases are approximately equal to their Euclidean distances in \(X\). Because the biplot only accommodates two principal components (it is a projection onto a subset of orthogonal coordinates), the planar distances underestimate the true distances. Here is a scatterplot of the true distances and their biplot approximations:
# distances between cases
<- dist(x)
x.dist # distances between cases (principal coordinates)
<- dist(s$u[, 1:2] %*% diag(s$d[1:2]))
s.dist # scatterplot
plot(
x = as.vector(x.dist),
y = as.vector(s.dist),
xlim = c(0, 8), ylim = c(0, 8),
asp = 1, pch = 19, cex = .5,
xlab = "Case distances in centered and scaled data",
ylab = "Case point distances in planar biplot"
)lines(x = c(0, 8), y = c(0, 8))
In contrast, the variables are better understood through their correlations, which are approximately preserved by their standardized coordinates. Writing \(X = [\,y_1\,\cdots\,y_m\,]\) as an array of column variables, the covariance between \(y_i\) and \(y_j\) is proportional to their inner product \[\textstyle \operatorname{cov}(y_i,y_j) = \frac{1}{n} y_i \cdot y_j = \frac{1}{n} \lVert y_i\rVert\lVert y_j\rVert\cos\theta_{ij}\text,\] so that the cosine of the angle \(\theta_{ij}\) between them equals their correlation: \[\cos\theta_{ij} = \frac{\operatorname{cov}(y_i,y_j)}{\sqrt{\operatorname{cov}(y_i,y_i)\operatorname{cov}(y_j,y_j)}/n} = \frac{\operatorname{cov}(y_i,y_j)}{\sigma_i\sigma_j} = r_{ij}\]
Here the cosines between the vectors in the biplot are plotted against the variable correlations \(r_{ij}\) in the centered and scaled data. While the vectors are shorter in the biplot for the same reason that distances are shorter, their consines may be larger or smaller depending on their respective coordinates in the remaining coordinates.
# correlations between variables
<- cor(x)
x.cor # magnitudes of variable vectors
<- apply(s$v[, 1:2] %*% diag(s$d[1:2]), 1, norm, "2")
s.len # cosines between variables (principal coordinates)
<- (s$v[, 1:2] / s.len) %*% diag(s$d[1:2]^2) %*% t(s$v[, 1:2] / s.len)
s.cor # scatterplot
plot(
x = as.vector(x.cor[lower.tri(x.cor)]),
y = as.vector(s.cor[lower.tri(s.cor)]),
xlim = c(-1, 1), ylim = c(-1, 1),
asp = 1, pch = 19, cex = .5,
xlab = "Variable correlations in centered and scaled data",
ylab = "Variable vector cosines in planar biplot"
)lines(x = c(-1, 1), y = c(-1, 1))
The faithful approximation of inter-case distances by principal coordinates is the idea behind classical multidimensional scaling (CMDS), which can be applied to a data set of distances \(\delta_{ij},\ 1\leq i\leq j\leq n\) in the absence of coordinates (variables). CMDS produces a set of artificial coordinates for the cases that yield nested best approximations of the inter-case distances in terms of the sum of squared errors. The technique uses the eigendecomposition of a doubly-centered matrix of squared distances, which produces a matrix \(U \Lambda^{1/2}\) whose first \(r\) coordinates—for any \(r\leq n\)—minimize the variance of \((U \Lambda^{1/2}) (U \Lambda^{1/2})^\top - \Delta = U \Lambda U^\top - \Delta\), where \(\Delta = (\delta_{ij}) \in \mathbb{R}^{n\times n}\), to obtain the approximation. The columns of \(U \Lambda^{1/2}\) are then called principal coordinates, and the first \(r\) most faithfully recover the \(\delta_{ij}\) using points in \(\mathbb{R}^r\) of any possible \(r\). For example, because the road distances between several U.S. cities arise from a roughly 2-dimensional process, the point distances in a CMDS are very close approximations:
<- as.matrix(UScitiesD)
d <- diag(1, nrow(d)) - matrix(1/nrow(d), nrow(d), nrow(d))
cent <- -.5 * cent %*% (d^2) %*% cent
d.cent <- svd(d.cent)
d.cmds <- d.cmds$u[, 1:2] %*% diag(sqrt(d.cmds$d[1:2]))
d.coord # scatterplot
plot(
x = as.vector(UScitiesD),
y = as.vector(dist(d.coord)),
xlim = c(0, max(UScitiesD)), ylim = c(0, max(UScitiesD)),
asp = 1, pch = 19, cex = .5,
xlab = "City road distances",
ylab = "Point distances in planar CMDS"
)lines(x = c(0, max(UScitiesD)), y = c(0, max(UScitiesD)))
In practice, the goal of CMDS is usually to produce a scatterplot in which the distances \(\sqrt{(x_j-x_i)^2+(y_j-y_i)^2}\) between the points that represent the \(n\) cases approximate their original distances \(\delta_{ij}\). In this case, the artificial coordinates approximately recover the geographic arrangement. By chance, this CMDS rotated the conventional cardinal directions by about \(\pi\) radians (though at least the map is recognizable from above rather than below the surface of the Earth):
plot(
pch = NA, asp = 1,
d.coord, xlab = "First principal coordinate", ylab = "Second principal coordinate"
)text(d.coord, labels = rownames(d), cex = .9)
The faithful approximation of inter-variable correlations by the angles between their coordinate vectors provides a dual CMDS procedure. Suppose we have data that consist not of distances between cases but of covariances \(\operatorname{cov}(y_i,y_j),\ 1\leq i\leq j\leq p\) between variables. Again the data are coordinate-free, so PCA is inapplicable. But were the data derived from a (not necessarily centered or scaled) case–variable matrix \(X\), then the covariance matrix \(C=(\operatorname{cov}(y_i,y_j))\) would have been obtained as \(C=\frac{1}{n}X^\top X\). This, up to scalar, is the matrix whose eigenvectors would be given by \(V\) in the SVD \(X = U D V^\top\). Therefore, we can obtain artificial coordinates for these variables that approximate what we know of their geometry—thinking of the variables as unknown vectors whose magnitudes and angles are encoded in \(C\)—via an eigendecomposition \(C = V \Lambda V^\top\): Take \(Y = V \Lambda^{1/2} \in \mathbb{R}^{p\times r}\), so that \(Y^\top Y \approx C\).
While covariances of the unscaled diabetes data are not meaningful, they can be used to validate the technique. Again because the eigendecomposition is intrinsically ordered by variance, the first \(r\) eigenvectors provide the most faithful \(r\)-dimensional approximation; i’ll take \(r=2\) in anticipation of a biplot:
# covariances and standard deviations
<- cov(x)
c <- diag(sqrt(diag(c)))
s # eigendecomposition of covariance matrix
<- eigen(c)
c.eigen # artificial coordinates
<- c.eigen$vectors[, 1:2] %*% diag(sqrt(c.eigen$values[1:2]))
c.coord # scatterplot
<- c.coord %*% t(c.coord)
c.inner plot(
x = as.vector(c[lower.tri(c)]),
y = as.vector(c.inner[lower.tri(c.inner)]),
xlim = range(c[lower.tri(c)]), ylim = range(c[lower.tri(c)]),
asp = 1, pch = 19, cex = .5,
xlab = "Measurement covariances in unscaled data",
ylab = "Vector inner products in planar CMDS"
)lines(x = range(c[lower.tri(c)]), y = range(c[lower.tri(c)]))
Thus, whereas CMDS of cases approximates distances, CMDS of variables approximates covariances. This equips us with an analogous visualization technique for correlation data: Represent each variable \(y_i\) as a unit vector \(\hat{y}_i\) in such a way that their angle cosines \(\cos\theta_{ij}\) equal their correlations \(r_{ij}\), then project these vectors onto the plane in which their variance is maximized. This is exactly how variables are commonly represented in row-principal PCA biplots, by the equivalence of \(V\) in the SVD \(X = U D V^\top\) with \(E\) in the eigendecomposition \(X^\top X = E \Lambda E^\top\). Here is the result for the five physiological measurements of the adult cohort in the diabetes study:
<- cor(heplots::Diabetes[, -6L])
c <- eigen(c)
c.eigen <- c.eigen$vectors[, 1:2] %*% diag(sqrt(c.eigen$values[1:2]))
c.coord plot(
pch = NA, asp = 1,
c.coord, xlab = "First principal coordinate", ylab = "Second principal coordinate"
)arrows(0, 0, c.coord[, 1L], c.coord[, 2L])
text(c.coord, labels = rownames(c), cex = .9)
We can immediately see that the two tests of glucose level (fasting and not) are highly correlated, that they vary independently of relative weight, and that the two measures of insulin response (steady state plasma glucose and the insulin test) are similarly associated with relative weight but differentially associated with glucose level.
In this setting, we could have obtained this visualization directly from the case–variable data \(X\), as in PCA. The technique becomes uniquely useful when we have correlations without underlying coordinates.
library(ordr)
library(dplyr)
A natural use case for CMDS of variables is the analysis of multiple rankings of the same set of objects in terms of their concordance. Rankings’ concordance is often measured using rank correlations such as Kendall’s \(\tau\), which may be general correlation coefficients in the sense proposed by Kendall but are not associated with an underlying geometry. In this setting, there is no original \(X\) nor Euclidean coordinates. Nevertheless, we can use CMDS to represent these rankings as unit vectors in Euclidean space whose pairwise cosines approximate their rank correlations!
A real-world example is provided by the Quacquarelli
Symonds World University Rankings, which include rankings of
hundreds of world universities along six dimensions: academic
reputation, employer reputation, faculty–student ratio, citations per
faculty, international faculty ratio, and international student ratio.
QS weight these rankings differently in their overall assessment, but
our analysis will compare the rankings to each other across
universities, ignoring these weights. The subset qswur_usa
installed with ordr include U.S.-based universities
ranked in the years 2017–2020, their
classifications by QS, and their six integer-valued rankings.
(Scores used to generate the rankings are included with the QS data
files but omitted from qswur_usa
.)
data(qswur_usa)
print(qswur_usa)
## # A tibble: 612 × 13
## year instit…¹ size focus res age status rk_ac…² rk_em…³ rk_ra…⁴ rk_ci…⁵
## <int> <chr> <fct> <fct> <fct> <int> <chr> <int> <int> <int> <int>
## 1 2017 MASSACH… M CO VH 5 B 6 4 12 10
## 2 2017 STANFOR… L FC VH 5 A 5 5 20 14
## 3 2017 HARVARD… L FC VH 5 B 1 1 42 8
## 4 2017 CALIFOR… S CO VH 5 B 23 90 3 4
## 5 2017 UNIVERS… L FC VH 5 B 13 47 54 49
## 6 2017 PRINCET… M CO VH 5 B 10 32 115 3
## 7 2017 YALE UN… L FC VH 5 B 9 12 4 138
## 8 2017 CORNELL… L FC VH 5 B 21 49 163 24
## 9 2017 JOHNS H… L FC VH 5 B 41 139 11 70
## 10 2017 UNIVERS… L FC VH 5 B 33 35 16 103
## # … with 602 more rows, 2 more variables: rk_intl_faculty <int>,
## # rk_intl_students <int>, and abbreviated variable names ¹institution,
## # ²rk_academic, ³rk_employer, ⁴rk_ratio, ⁵rk_citations
For this example, i’ll focus only on rankings for the year 2020 for which all rankings were available. This leaves me with only 38 universities, so my conclusions must be taken with caution! Since the rankings were subsetted from the full international data set, they are not contiguous (some integers between rankings never appear). To resolve this, i’ll also recalibrate the rankings by matching each vector of ranks to the vector of its sorted unique values:
%>%
qswur_usa filter(year == 2020L) %>%
select(institution, starts_with("rk_")) %>%
mutate_at(
vars(starts_with("rk_")),
~ match(., sort(unique(as.numeric(.))))
%>%
) filter_at(vars(starts_with("rk_")), ~ ! is.na(.)) ->
qswur_usa2020print(qswur_usa2020)
## # A tibble: 38 × 7
## institution rk_ac…¹ rk_em…² rk_ra…³ rk_ci…⁴ rk_in…⁵ rk_in…⁶
## <chr> <int> <int> <int> <int> <int> <int>
## 1 MASSACHUSETTS INSTITUTE OF T… 3 2 8 4 1 6
## 2 STANFORD UNIVERSITY 2 3 5 7 2 21
## 3 HARVARD UNIVERSITY 1 1 17 5 8 29
## 4 CALIFORNIA INSTITUTE OF TECH… 12 17 1 2 3 12
## 5 UNIVERSITY OF CHICAGO 9 11 19 22 20 14
## 6 PRINCETON UNIVERSITY 7 7 35 1 26 22
## 7 CORNELL UNIVERSITY 11 13 40 9 5 19
## 8 UNIVERSITY OF PENNSYLVANIA 14 12 11 31 12 28
## 9 YALE UNIVERSITY 6 4 2 46 17 37
## 10 COLUMBIA UNIVERSITY 8 8 10 47 51 3
## # … with 28 more rows, and abbreviated variable names ¹rk_academic,
## # ²rk_employer, ³rk_ratio, ⁴rk_citations, ⁵rk_intl_faculty, ⁶rk_intl_students
This subset of universities is now contiguously ranked along the six dimensions described above. The Kendall correlation \(\tau_{ij}\) between two rankings measures their concordance. To calculate it, every pair of universities contributes either \(+1\) or \(-1\) according as the rankings \(i\) and \(j\) place that pair in the same order, and the sum is scaled down by the number of pairs \({n\choose 2}\) so that the result lies between \(-1\) and \(1\). We interpret \(\tau_{ij}=1\) as perfect concordance (the rankings are equivalent), \(\tau_{ij}=-1\) as perfect discordance (the rankings are reversed), and \(\tau_{ij}=0\) as independence (the rankings are independent).
rk_academic
),rk_employer
)rk_ratio
)rk_citations
)rk_intl_faculty
)rk_intl_students
)These not variations on a single theme (measures of the same construct), like different measures of guideline adherence or positive affect. Though they do all seem potentially sensitive to a university’s resources, including finance and prestige. I intuit that the two reputation indexes should be positively correlated, and that the two international personnel ratios should be as well. I also wonder if the faculty–student ratio might be anti-correlated with the number of citations per faculty, separating more research-focused institutions from more teaching-focused ones.
Heatmaps are commonly, perhaps most commonly, used to visualize
correlation matrices, and they make a useful contrast to the CMDS
correlation plot. Below i use heatmap()
to produce a basic
correlation heatmap, but the blog
post version of this vignette uses the corrplot
package to more elegant effect. What can be learned from a glance at
this plot? While the rankings by academic and employer reputations are
highly concordant, those by international faculty and student ratios are
less so. The faculty–student ratio and faculty citation rankings have
the weakest concordance of any pair, but they are still positively
correlated.
<- cor(select(qswur_usa2020, starts_with("rk_")), method = "kendall")
corr heatmap(corr, scale = "none")
This visualization is useful, but it’s very busy: To compare any pair of rankings, i have to find the cell in the grid corresponding to that pair and refer back to the color scale to assess its meaning. I can’t rely on the nearby cells for context, because they may be stronger or weaker than average and skew my interpretation. For example, the visibly weak associations between the faculty–student ratio and other rankings (the third row/column) happen to be arranged so that the slightly stronger among them, with the two reputational variables, are sandwiched between the even stronger associations between the two reputational rankings and between them and the faculty citations ranking. Meanwhile, this ranking’s weaker associations are sandwiched between more typical, but still comparatively stronger, associations. A different ordering of the variables might “obscure” these pattern and “reveal” others, which forces the analyst to choose between these emphases rather than allowing the viewer to assess them on an even footing.
The plot is also strictly pairwise: Every correlation between two rankings occupies its own cell—two, in fact, making almost half of the plot duplicative. This means that a subset analysis of, say, three rankings requires focusing on three cells at the corners of a right triangle while ignoring all the surrounding cells. This is not an easy visual task. It would be straightforward to create a new plot for any subset, but then the larger context of the remaining rankings would be lost.
Finally, the color and size scales themselves are less than desirable in a plot of data that all lie on a single scale (\(-1 \leq r \leq 1\)). Contrast this to a histogram, which also visualizes a single data type (count) but uses only one axis to represent count (the other to represent value) and does not require color at all. In general, plots of simple data should use very few scales (aesthetic mappings)—not only to make plots more readable but also to allow for additional scales that highlight additional data properties.
CMDS of variables offers a natural alternative visualization, which Gower, Gardner–Lubbe, and le Roux (2011) call a “correlation monoplot” in that it consists of but one of the two layers of a PCA or other biplot. As with CMDS of cases, the point isn’t to overlay the case scores and variable loadings from a singular value decomposition, but to use the scores or loadings alone to endow the cases or variables with a Euclidean geometry they didn’t originally have. To that end, i’ll plot the variables as vectors with tails at the origin and heads at their principal coordinates \(Y = V \Lambda^{1/2}\), with a unit circle included for reference:
eigen_ord(corr) %>%
as_tbl_ord() %>%
augment_ord() %>%
mutate_rows(metric = stringr::str_remove(name, "rk_")) %>%
confer_inertia(1) ->
c_eigen%>%
c_eigen ggbiplot() +
theme_minimal() +
geom_unit_circle() +
geom_rows_vector() +
geom_rows_text_radiate(aes(label = metric)) +
scale_x_reverse(expand = expansion(add = .4)) +
scale_y_continuous(expand = expansion(add = .3)) +
ggtitle("Kendall correlations between university rankings",
"CMDS correlation monoplot")
With respect to the pairwise correlations, the biplot is significantly less precise: Though the vectors all have unit length in \(\mathbb{R}^r\) (\(r\leq p=6\)), their projections onto the first two principal coordinates are much shorter, indicating that much of the geometric configuration requires additional dimensions to represent. Indeed, these coordinates capture only \(48.2\%+14.3\%=62.5\%\) of the inertia in the full, six-dimensional eigendecomposition. This means that the angles between the vectors must be interpreted with caution: For example, it looks like the academic and employer reputation rankings are extremely correlated, but the apparent alignment of the vectors could be an artifact of the projection, when in fact they “rise” and “fall” in opposite directions along the remaining principal coordinates. The correlation heatmap, by comparison, leaves no such ambiguity.
However, the biplot far surpasses the heatmap at parsimony: Each variable is represented by a single vector, and the angle cosines between the variable vectors roughly approximate their correlations. For instance, the rankings based on international student and faculty ratios have correlation around \(\cos(\frac{\pi}{4})=\frac{1}{\sqrt{2}}\), corresponding to either explaining half the “variance” in the other—not technically meaningful in the ranking context but a useful conceptual anchor. Meanwhile, the faculty–student ratio ranking is nearly independent of the faculty citation ranking, contrary to my intuition that these rankings would reflect a reverse association between research- and teaching-oriented institutions. The convenience of recognizing correlations as cosines may be worth the significant risk of error, especially since that error (the residual \(37.5\%\) of inertia) can be exactly quantified.
Moreover, the principal coordinates of the variable vectors indicate their loadings onto the first and second principal moments of inertia—the two dimensions that capture the most variation in the data. For example, the first principal coordinate is most aligned with the two reputational rankings, suggesting that a general prestige ranking is the strongest overall component of the several specific rankings. In contrast, the faculty–student ratio and faculty citation rankings load most strongly onto the second principal coordinate, suggesting that the divide between research- and teaching-focused institutions may yet be important to understanding how universities compare along these different metrics. These observations, provisional though they are, would be difficult to discern from the heatmap. More importantly, unlike the secondary patterns visible in the heatmap, these are not artifacts of the layout but arise directly from the (correlational) data.
This last point means that observations made from a correlation monoplot can be validated from the CMDS coordinates. In particular, we can examine the variables’ loadings onto the third principal coordinate, and we can check whether the reputational rankings are aligned or misaligned along it.
%>%
c_eigen fortify(.matrix = "rows") %>%
select(-name, -.matrix)
## # A tibble: 6 × 8
## EV1 EV2 EV3 EV4 EV5 EV6 metric .element
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 -0.834 -0.0907 0.412 -0.0430 -0.0206 0.351 academic active
## 2 -0.795 -0.0964 0.477 0.0416 -0.181 -0.311 employer active
## 3 -0.517 0.771 -0.0480 -0.331 0.158 -0.0372 ratio active
## 4 -0.731 -0.352 -0.239 0.0278 0.528 -0.0685 citations active
## 5 -0.631 -0.233 -0.521 -0.392 -0.352 0.00783 intl_faculty active
## 6 -0.603 0.262 -0.324 0.665 -0.140 0.0312 intl_students active
Based on the third principal coordinates, the reputational rankings are aligned, as we knew already from the correlation matrix and heatmap. What’s a bit more interesting is that this component seems to separate these two rankings from those having to do with faculty citation rates and the international compositions of the faculty and student body. Based on the decomposition of inertia, this third principal coordinate is nearly as important as the second! It therefore makes sense to plot the two together, in effect examining the residuals after regressing out the first principal coordinate:
%>%
c_eigen ggbiplot(aes(x = 2, y = 3)) +
theme_minimal() +
geom_unit_circle() +
geom_rows_vector() +
geom_rows_text_radiate(aes(label = metric)) +
scale_x_continuous(expand = expansion(add = .5)) +
scale_y_continuous(expand = expansion(add = .5)) +
ggtitle("Kendall correlations between university rankings",
"CMDS correlation monoplot, second & third principal coordinates")
The primary antitheses of the reputational rankings, after removing the first principal coordinate, are the two rankings based on international composition. The third principal coordinate aligns closely with this axis. This axis is also largely independent of the axis that distinguishes research- from teaching-oriented institutions, which is closely aligned with the second principal coordinate. From my own limited experience, i’d hazard a guess that this reflects two tiers of international representation among students and faculty, one expressed by the most prestigious institutions that recruit highly qualified applicants from all over the world, and the other expressed by institutions that are not especially prestigious but are located in communities or regions with high percentages of international residents.
In sum, we have interpreted the three strongest axes along which the QS rankings vary: (1) overall quality, prestige, and reputation; (2) research versus teaching orientation; and (3) international personnel ratios, which come at some cost to reputation. This is of course a preliminary and amateur analysis! But a visualization scheme that encourages hypothesis generation is worth having on hand.