Stephen Huan

Sparse Cholesky factorization
by greedy conditional selection

This post summarizes joint work with Joseph Guinness, Matthias Katzfuss, Houman Owhadi, and Florian Schäfer. For additional details, see the paper or source code.

Thanks Seulgi for drawing the preview image! The full resolution image may be found here.

The problem

Our motivating example is a regression problem that we model with Gaussian processes.

In our setting, we have N N points whose corresponding covariance matrix Θ \Theta is formed by

Θi,jK(xi,xj),\begin{aligned} \Theta_{i, j} \coloneqq K(\bm{ x}_i, \bm{ x}_j), \end{aligned}

or pairwise evaluation of a kernel function K(,) K(\cdot, \cdot) . Commonly used kernels include the Matérn family of covariance functions, which include the exponential kernel and squared exponential kernel (also known as the Gaussian or RBF kernel) as special cases. The kernel function along with a mean function μ() \mu(\cdot) define a Gaussian process fGP(μ,K) f \sim \mathcal{GP}(\mu, K) such that for any set of points X X , f(X) f(X) is jointly distributed according to a multivariate Gaussian.

If we partition our points into training points for which we know the value of f f and testing points at which we wish to predict the value of f f , then the Gaussian process modeling of f f enables us to compute the posterior distribution of the testing points in closed form as

E[yPryTr]=μPr+ΘPr,TrΘTr,Tr1(yTrμTr)Cov[yPryTr]=ΘPr,PrΘPr,TrΘTr,Tr1ΘTr,Pr.\begin{aligned} \providecommand{\Ehelper}{ \@ifstar{\Estar}{\Enostar} } \providecommand{\Enostar}[1]{\mathbb{E}[#1]} \providecommand{\Estar}[1]{\mathbb{E}\left[#1\right]} \Ehelper{\bm{ y}_\text{Pr} \mid \bm{ y}_\text{Tr}} &= \bm{ \mu}_\text{Pr} + \Theta_{\text{Pr}, \text{Tr}} \Theta_{\text{Tr}, \text{Tr}}^{-1} (\bm{ y}_\text{Tr} - \bm{ \mu}_\text{Tr}) \\ \providecommand{\Covvhelper}{ \@ifstar{\Covvstar}{\Covvnostar} } \providecommand{\Covvnostar}[1]{\mathbb{C}\text{ov}[#1]} \providecommand{\Covvstar}[1]{\mathbb{C}\text{ov}\left[#1\right]} \Covvhelper{\bm{ y}_\text{Pr} \mid \bm{ y}_\text{Tr}} &= \Theta_{\text{Pr}, \text{Pr}} - \Theta_{\text{Pr}, \text{Tr}} \Theta_{\text{Tr}, \text{Tr}}^{-1} \Theta_{\text{Tr}, \text{Pr}}. \end{aligned}

Note that our predictions naturally come equipped with uncertainty estimates from the posterior covariance. Due to the ability to measure uncertainty and to encode prior information in the kernel function, Gaussian processes and kernel methods enjoy widespread usage in geostatistics, machine learning, and optimal experimental design.

However, computing (1) requires inverting the covariance matrix of the training points. Additional statistical inference often requires computing quantities like the likelihood or log determinant. For a dense covariance matrix Θ \Theta , directly computing these quantities has a computational cost of O(N3) \mathcal{O}(N^3) and a memory cost of O(N2) \mathcal{O}(N^2) , which is prohibitively expensive for large N N . Can we efficiently approximate (1) while maintaining predictive accuracy?

The screening effect

Spatial statisticians have long observed the screening effect, or the observation that conditional on nearby points, more distant points have very little influence on the prediction.

Unconditional covariance Conditional covariance
Screening effect in a Matérn kernel with =1 \ell = 1 and ν=1/2 \nu = 1/2 . Unconditional correlation with (0,0) (0, 0) is shown before \textcolor{#a1b4c7}{\text{before}} and after \textcolor{#23553c}{\text{after}} conditioning on four nearby points \textcolor{#ea8810}{\text{nearby points}} .

In other words, after conditioning on a few points, the conditional correlation localizes around the target point, implying approximate conditional sparsity.

The Vecchia approximation

We can exploit this conditional sparsity by looking at the joint distribution.

π(y)=π(y1)π(y2y1)π(y3y1,y2)π(yNy1,y2,,yN1).\begin{aligned} \pi(\bm{ y}) &= \pi(y_1) \pi(y_2 \mid y_1) \pi(y_3 \mid y_1, y_2) \dotsm \pi(y_N \mid y_1, y_2, \dotsc, y_{N - 1}). \end{aligned}

The Vecchia approximation proposes to approximate the joint distribution by

π(y)π(yi1)π(yi2ys2)π(yi3ys3)π(yiNysN)\begin{aligned} \pi(\bm{ y}) &\approx \pi(y_{i_1}) \pi(y_{i_2} \mid y_{s_2}) \pi(y_{i_3} \mid y_{s_3}) \dotsm \pi(y_{i_N} \mid y_{s_N}) \end{aligned}

for some ordering i1,,iN i_1, \dotsc, i_N and sparsity sets sk{i1,,ik1} s_k \subseteq \{ i_1, \dotsc, i_{k - 1} \} with skk \providecommand{\cardhelper}{ \@ifstar{\cardstar}{\cardnostar} } \providecommand{\cardnostar}[1]{\lvert #1 \rvert} \providecommand{\cardstar}[1]{\left \lvert #1 \right \rvert} \cardhelper{s_k} \ll k . The quality of the approximation depends significantly on the chosen ordering and sparsity pattern. In order to exploit the screening effect as much as possible, it is natural to take the nearest neighbors as the sparsity pattern. For the ordering, we would like to select points that have good coverage of the entire space. Towards that end, we form the reverse-maximin ordering by first selecting the last index iN i_N arbitrarily and then choosing for k=N1,,1 k = N - 1, \dotsc, 1 the index

ik=argmaxiIk+1  minjIk+1xixj\begin{aligned} i_k = \operatorname*{argmax}_{i \in -\mathcal{I}_{k + 1}} \; \min_{j \in \mathcal{I}_{k + 1}} \providecommand{\normhelper}{ \@ifstar{\normstar}{\normnostar} } \providecommand{\normnostar}[1]{\lVert #1 \rVert} \providecommand{\normstar}[1]{\left \lVert #1 \right \rVert} \normhelper{\bm{ x}_i - \bm{ x}_j} \end{aligned}

where I{1,,N}I -\mathcal{I} \coloneqq \{ 1, \dotsc, N \} \setminus \mathcal{I} and In{in,in+1,,iN} \mathcal{I}_n \coloneqq \{ i_n, i_{n + 1}, \dotsc, i_N \} , that is, select the point furthest from previously selected points every iteration. This ordering and sparsity pattern can be computed efficiently with geometric data structures like the k k -d tree or ball tree.

Sparse Cholesky factors

From another perspective, the Vecchia approximation implicitly forms an approximate Cholesky factor of the covariance matrix (a lower triangular matrix L L such that LLΘ L L^{\top} \approx \Theta ).

We can hope for (approximate) sparsity in the Cholesky factor as Cholesky factorization can be viewed as recursively computing the block decomposition

chol(Θ)=(Id0Θ2,1Θ1,11Id)(chol(Θ1,1)00chol(Θ2,2Θ2,1Θ1,11Θ1,2))\begin{aligned} \operatorname{chol}(\Theta) &= \begin{pmatrix} \text{Id} & 0 \\ \textcolor{#c7740e}{\Theta_{2, 1} \Theta_{1, 1}^{-1}} & \text{Id} \end{pmatrix} \begin{pmatrix} \operatorname{chol}(\Theta_{1, 1}) & 0 \\ 0 & \operatorname{chol}(\textcolor{#a1b4c7}{ \Theta_{2, 2} - \Theta_{2, 1} \Theta_{1, 1}^{-1} \Theta_{1, 2} }) \end{pmatrix} \\ \end{aligned}

where Θ2,2Θ2,1Θ1,11Θ1,2 \Theta_{2, 2} - \Theta_{2, 1} \Theta_{1, 1}^{-1} \Theta_{1, 2} is the conditional covariance in (1). Thus Cholesky factorization is a process of iteratively conditioning the underlying Gaussian process, which we hope will lead to approximate sparsity from the screening effect.

KL minimization

We can use the same ordering and sparsity pattern as the Vecchia approximation (2). In order to compute the entries of L L , we can minimize the Kullback-Leibler (KL) divergence between two centered Gaussians with the true and approximate covariance matrices over the space S \mathcal{S} of lower triangular matrices satisfying the specified sparsity pattern,

LargminL^SDKL(N(0,Θ)    N(0,(L^L^)1)).\begin{aligned} L &\coloneqq \operatorname*{argmin}_{\hat{L} \in \mathcal{S}} \, \providecommand{\KLhelper}{ \@ifstar{\KLstar}{\KLnostar} } \providecommand{\KLnostar}[2]{\mathbb{D}_{\operatorname{KL}}(#1\; \| \;#2)} \providecommand{\KLstar}[2]{\mathbb{D}_{\operatorname{KL}}\left(#1\; \middle\| \;#2\right)} \KLhelper*{\mathcal{N}(\bm{ 0}, \Theta)}{\mathcal{N}(\bm{ 0}, (\hat{L} \hat{L}^{\top})^{-1})}. \end{aligned}

Note that we want LLΘ1 L L^{\top} \approx \Theta^{-1} , the inverse of the covariance matrix (sometimes called the precision matrix). While Θ \Theta encodes marginal relationships between the variables, Θ1 \Theta^{-1} encodes conditional relationships and therefore often leads to sparser factors.

Luckily this optimization problem has a closed-form explicit solution

Lsi,i=Θsi,si1e1e1Θsi,si1e1\begin{aligned} L_{s_i, i} &= \frac{\Theta_{s_i, s_i}^{-1} \bm{ e}_1} {\sqrt{\bm{ e}_1^{\top} \Theta_{s_i, s_i}^{-1} \bm{ e}_1}} \end{aligned}

which computes the nonzero entries for the i i -th column of L L in time O(si3) \mathcal{O}(\providecommand{\cardhelper}{ \@ifstar{\cardstar}{\cardnostar} } \providecommand{\cardnostar}[1]{\lvert #1 \rvert} \providecommand{\cardstar}[1]{\left \lvert #1 \right \rvert} \cardhelper{s_i}^3) .

If we were to directly use a sparse Cholesky factor to approximate ΘTr,Tr1 \Theta_{\text{Tr}, \text{Tr}}^{-1} in (1), we would still need to compute ΘPr,Tr \Theta_{\text{Pr}, \text{Tr}} which could be prohibitive. Instead, we put "prediction points first" by forming the joint covariance matrix of training and prediction points

L=(LPr,Pr0LTr,PrLTr,Tr)LL(ΘPr,PrΘPr,TrΘTr,PrΘTr,Tr)1\begin{aligned} L = \begin{pmatrix} L_{\text{Pr}, \text{Pr}} & 0 \\ L_{\text{Tr}, \text{Pr}} & L_{\text{Tr}, \text{Tr}} \end{pmatrix} \qquad L L^{\top} \approx \begin{pmatrix} \Theta_{\text{Pr}, \text{Pr}} & \Theta_{\text{Pr}, \text{Tr}} \\ \Theta_{\text{Tr}, \text{Pr}} & \Theta_{\text{Tr}, \text{Tr}} \end{pmatrix}^{-1} \end{aligned}

Given the sparse approximate Cholesky factor L L , (1) can be approximated efficiently with

E[yPryTr]=LPr,PrLTr,PryTrCov[yPryTr]=LPr,PrLPr,Pr1.\begin{aligned} \providecommand{\Ehelper}{ \@ifstar{\Estar}{\Enostar} } \providecommand{\Enostar}[1]{\mathbb{E}[#1]} \providecommand{\Estar}[1]{\mathbb{E}\left[#1\right]} \Ehelper{\bm{ y}_\text{Pr} \mid \bm{ y}_\text{Tr}} &= -L_{\text{Pr}, \text{Pr}}^{-\top} L_{\text{Tr}, \text{Pr}}^{\top} \bm{ y}_\text{Tr} \\ \providecommand{\Covvhelper}{ \@ifstar{\Covvstar}{\Covvnostar} } \providecommand{\Covvnostar}[1]{\mathbb{C}\text{ov}[#1]} \providecommand{\Covvstar}[1]{\mathbb{C}\text{ov}\left[#1\right]} \Covvhelper{\bm{ y}_\text{Pr} \mid \bm{ y}_\text{Tr}} &= L_{\text{Pr}, \text{Pr}}^{-\top} L_{\text{Pr}, \text{Pr}}^{-1}. \end{aligned}

This can be viewed as a generalization of the Vecchia approximation (2).

Decomposing the KL divergence

Plugging the optimal L L back into the KL divergence yields the interesting decomposition

2DKL(N(0,Θ)    N(0,(LL)1))=i=1N[log(Θi,isi{i})log(Θi,ii+1:)].\begin{aligned} 2 \providecommand{\KLhelper}{ \@ifstar{\KLstar}{\KLnostar} } \providecommand{\KLnostar}[2]{\mathbb{D}_{\operatorname{KL}}(#1\; \| \;#2)} \providecommand{\KLstar}[2]{\mathbb{D}_{\operatorname{KL}}\left(#1\; \middle\| \;#2\right)} \KLhelper*{\mathcal{N}(\bm{ 0}, \Theta)}{\mathcal{N}(\bm{ 0}, (L L^{\top})^{-1})} &= \sum_{i = 1}^N \left [ \log \left ( \Theta_{i, i \mid s_i \setminus \{ i \}} \right ) - \log \left ( \Theta_{i, i \mid i + 1:} \right ) \right ]. \end{aligned}

This sum is the accumulated difference in posterior log variance for a series of independent regression problems: each to predict the i i -th variable given a subset of the variables after it in the ordering. The term log(Θi,isi{i}) \log \left ( \Theta_{i, i \mid s_i \setminus \{ i \}} \right ) obtained when restricted to the variables in the sparsity pattern si s_i is compared to the ground truth log(Θi,ii+1:) \log \left ( \Theta_{i, i \mid i + 1:} \right ) .

We can generalize this formula if multiple columns share the same sparsity pattern, a setting called aggregated or supernodal Cholesky factorization which improves computational efficiency. If i~ \tilde{i} is a list of aggregated columns, its contribution to the KL divergence is

ii~log(Θi,isi{i})=logdet(Θi~,i~s~)\begin{aligned} \sum_{i \in \tilde{i}} \log \left (\Theta_{i, i \mid s_i \setminus \{ i \} } \right ) &= \operatorname{logdet}(\Theta_{\tilde{i}, \tilde{i} \mid \tilde{s}}) \end{aligned}

or the posterior log determinant of the group's covariance matrix. However, this formula assumes the aggregated columns are adjacent in the ordering, which may not be the case.

In the case of non-adjacent columns, a sparsity entry can be below one column but above another, thus only partially conditioning the group when it is added to the sparsity pattern. Writing yk \bm{ y}_{\parallel k} for the conditioning of the variables in y \bm{ y} by k k but skipping the first p p , we have

Θi~,i~kCov[yk]=(L:pL:pL:pLp+1:Lp+1:L:pLp+1:Lp+1:)=(L:pLp+1:)(L:pLp+1:)\begin{aligned} \Theta_{\tilde{i}, \tilde{i} \parallel k} &\coloneqq \providecommand{\Covvhelper}{ \@ifstar{\Covvstar}{\Covvnostar} } \providecommand{\Covvnostar}[1]{\mathbb{C}\text{ov}[#1]} \providecommand{\Covvstar}[1]{\mathbb{C}\text{ov}\left[#1\right]} \Covvhelper{\bm{ y}_{\parallel k}} = \begin{pmatrix} L_{:p} L_{:p}^{\top} & L_{:p} {L'}_{p + 1:}^{\top} \\ {L'}_{p + 1:} L_{:p}^{\top} & {L'}_{p + 1:} {L'}_{p + 1:}^{\top} \end{pmatrix} = \begin{pmatrix} L_{:p} \\ {L'}_{p + 1:} \end{pmatrix} \begin{pmatrix} L_{:p} \\ {L'}_{p + 1:} \end{pmatrix}^{\top} \end{aligned}

where L L is the fully unconditional Cholesky factor and L L' is the fully conditional factor.

Partial Cholesky factor
Decomposition of the covariance matrix of partially conditioned variables into a fully pure Cholesky factor and a fully conditional Cholesky factor.

Armed with this representation, the contribution to the KL divergence is simply

ii~log(Θi,isi{i})=logdet(Θi~,i~k).\begin{aligned} \sum_{i \in \tilde{i}} \log \left (\Theta_{i, i \mid s_i \setminus \{ i \} } \right ) &= \operatorname{logdet}(\Theta_{\tilde{i}, \tilde{i} \parallel k}). \end{aligned}

Information-theoretic objectives

The previous formulas all imply that the sparsity selection should minimize the posterior variance (or log determinant) of the target which are measures of the uncertainty of the target. In other words, the selected entries should be maximally informative to the target.

To make this notion precise, we can define the mutual information or information gain as

I[yPr;yTr]H[yPr]H[yPryTr]\begin{aligned} \providecommand{\MIhelper}{ \@ifstar{\MIstar}{\MInostar} } \providecommand{\MInostar}[2]{\mathbb{I}[#1; #2]} \providecommand{\MIstar}[2]{\mathbb{I}\left[#1; #2\right]} \MIhelper{\bm{ y}_\text{Pr}}{\bm{ y}_\text{Tr}} &\coloneqq \providecommand{\Entropyhelper}{ \@ifstar{\Entropystar}{\Entropynostar} } \providecommand{\Entropynostar}[1]{\mathbb{H}[#1]} \providecommand{\Entropystar}[1]{\mathbb{H}\left[#1\right]} \Entropyhelper{\bm{ y}_\text{Pr}} - \providecommand{\Entropyhelper}{ \@ifstar{\Entropystar}{\Entropynostar} } \providecommand{\Entropynostar}[1]{\mathbb{H}[#1]} \providecommand{\Entropystar}[1]{\mathbb{H}\left[#1\right]} \Entropyhelper{\bm{ y}_\text{Pr} \mid \bm{ y}_\text{Tr}} \end{aligned}

where H[X] \providecommand{\Entropyhelper}{ \@ifstar{\Entropystar}{\Entropynostar} } \providecommand{\Entropynostar}[1]{\mathbb{H}[#1]} \providecommand{\Entropystar}[1]{\mathbb{H}\left[#1\right]} \Entropyhelper{X} is the entropy of the random variable X X . Since the entropy of yPr \bm{ y}_\text{Pr} is constant, maximizing the mutual information is equivalent to minimizing the posterior entropy. For multivariate Gaussians the entropy is monotonically increasing with the log determinant, so maximizing mutual information is indeed equivalent to minimizing posterior variance!

In addition, we observe an information-theoretic analogue of the EV-VE identity

Var[yPr]=E[Var[yPryTr]]+Var[E[yPryTr]]H[yPr]=H[yPryTr]+I[yPr;yTr]\begin{aligned} \textcolor{#ea8810}{\providecommand{\Varhelper}{ \@ifstar{\Varstar}{\Varnostar} } \providecommand{\Varnostar}[1]{\mathbb{V}\text{ar}[#1]} \providecommand{\Varstar}[1]{\mathbb{V}\text{ar}\left[#1\right]} \Varhelper{\bm{ y}_\text{Pr}}} &= \textcolor{#a1b4c7}{\providecommand{\Ehelper}{ \@ifstar{\Estar}{\Enostar} } \providecommand{\Enostar}[1]{\mathbb{E}[#1]} \providecommand{\Estar}[1]{\mathbb{E}\left[#1\right]} \Ehelper{\providecommand{\Varhelper}{ \@ifstar{\Varstar}{\Varnostar} } \providecommand{\Varnostar}[1]{\mathbb{V}\text{ar}[#1]} \providecommand{\Varstar}[1]{\mathbb{V}\text{ar}\left[#1\right]} \Varhelper{\bm{ y}_\text{Pr} \mid \bm{ y}_\text{Tr}}}} + \textcolor{#b8420f}{\providecommand{\Varhelper}{ \@ifstar{\Varstar}{\Varnostar} } \providecommand{\Varnostar}[1]{\mathbb{V}\text{ar}[#1]} \providecommand{\Varstar}[1]{\mathbb{V}\text{ar}\left[#1\right]} \Varhelper{\providecommand{\Ehelper}{ \@ifstar{\Estar}{\Enostar} } \providecommand{\Enostar}[1]{\mathbb{E}[#1]} \providecommand{\Estar}[1]{\mathbb{E}\left[#1\right]} \Ehelper{\bm{ y}_\text{Pr} \mid \bm{ y}_\text{Tr}}}} \\ \textcolor{#ea8810}{\providecommand{\Entropyhelper}{ \@ifstar{\Entropystar}{\Entropynostar} } \providecommand{\Entropynostar}[1]{\mathbb{H}[#1]} \providecommand{\Entropystar}[1]{\mathbb{H}\left[#1\right]} \Entropyhelper{\bm{ y}_\text{Pr}}} &= \textcolor{#a1b4c7}{\providecommand{\Entropyhelper}{ \@ifstar{\Entropystar}{\Entropynostar} } \providecommand{\Entropynostar}[1]{\mathbb{H}[#1]} \providecommand{\Entropystar}[1]{\mathbb{H}\left[#1\right]} \Entropyhelper{\bm{ y}_\text{Pr} \mid \bm{ y}_\text{Tr}}} + \textcolor{#b8420f}{\providecommand{\MIhelper}{ \@ifstar{\MIstar}{\MInostar} } \providecommand{\MInostar}[2]{\mathbb{I}[#1; #2]} \providecommand{\MIstar}[2]{\mathbb{I}\left[#1; #2\right]} \MIhelper{\bm{ y}_\text{Pr}}{\bm{ y}_\text{Tr}}} \end{aligned}

where the posterior entropy can be identified with the posterior variance and the mutual information corresponds to the variance of the optimal estimator E[yPryTr] \providecommand{\Ehelper}{ \@ifstar{\Estar}{\Enostar} } \providecommand{\Enostar}[1]{\mathbb{E}[#1]} \providecommand{\Estar}[1]{\mathbb{E}\left[#1\right]} \Ehelper{\bm{ y}_\text{Pr} \mid \bm{ y}_\text{Tr}} (1), which can be intuitively understood as the extent to which the estimator actually varies with data.

Finally, the expected mean squared error of the estimator E[yPryTr] \providecommand{\Ehelper}{ \@ifstar{\Estar}{\Enostar} } \providecommand{\Enostar}[1]{\mathbb{E}[#1]} \providecommand{\Estar}[1]{\mathbb{E}\left[#1\right]} \Ehelper{\bm{ y}_\text{Pr} \mid \bm{ y}_\text{Tr}} which is defined as MSEE[(yPrE[yPryTr])2] \text{MSE} \coloneqq \providecommand{\Ehelper}{ \@ifstar{\Estar}{\Enostar} } \providecommand{\Enostar}[1]{\mathbb{E}[#1]} \providecommand{\Estar}[1]{\mathbb{E}\left[#1\right]} \Ehelper{(\bm{ y}_\text{Pr} - \providecommand{\Ehelper}{ \@ifstar{\Estar}{\Enostar} } \providecommand{\Enostar}[1]{\mathbb{E}[#1]} \providecommand{\Estar}[1]{\mathbb{E}\left[#1\right]} \Ehelper{\bm{ y}_\text{Pr} \mid \bm{ y}_\text{Tr}})^2} is simply the posterior variance because

MSE=E[E[(yPrE[yPryTr])2yTr]]=E[Var[yPryTr]]=Var[yPryTr],\begin{aligned} \text{MSE} = \providecommand{\Ehelper}{ \@ifstar{\Estar}{\Enostar} } \providecommand{\Enostar}[1]{\mathbb{E}[#1]} \providecommand{\Estar}[1]{\mathbb{E}\left[#1\right]} \Ehelper{\providecommand{\Ehelper}{ \@ifstar{\Estar}{\Enostar} } \providecommand{\Enostar}[1]{\mathbb{E}[#1]} \providecommand{\Estar}[1]{\mathbb{E}\left[#1\right]} \Ehelper{(\bm{ y}_\text{Pr} - \providecommand{\Ehelper}{ \@ifstar{\Estar}{\Enostar} } \providecommand{\Enostar}[1]{\mathbb{E}[#1]} \providecommand{\Estar}[1]{\mathbb{E}\left[#1\right]} \Ehelper{\bm{ y}_\text{Pr} \mid \bm{ y}_\text{Tr}})^2 \mid \bm{ y}_\text{Tr}}} = \providecommand{\Ehelper}{ \@ifstar{\Estar}{\Enostar} } \providecommand{\Enostar}[1]{\mathbb{E}[#1]} \providecommand{\Estar}[1]{\mathbb{E}\left[#1\right]} \Ehelper{\providecommand{\Varhelper}{ \@ifstar{\Varstar}{\Varnostar} } \providecommand{\Varnostar}[1]{\mathbb{V}\text{ar}[#1]} \providecommand{\Varstar}[1]{\mathbb{V}\text{ar}\left[#1\right]} \Varhelper{\bm{ y}_\text{Pr} \mid \bm{ y}_\text{Tr}}} = \providecommand{\Varhelper}{ \@ifstar{\Varstar}{\Varnostar} } \providecommand{\Varnostar}[1]{\mathbb{V}\text{ar}[#1]} \providecommand{\Varstar}[1]{\mathbb{V}\text{ar}\left[#1\right]} \Varhelper{\bm{ y}_\text{Pr} \mid \bm{ y}_\text{Tr}}, \end{aligned}

from the quirk that the posterior covariance (1) does not depend on observing yTr \bm{ y}_\text{Tr} .

Greedy selection

Rather than use distance as a geometric proxy for importance, why not use information directly? The following toy example illustrates the benefits of selection by information.

In this 1-d regression problem, the goal is to predict the y y -value of the target \textcolor{#ea8810}{\text{target}} point, given the locations of many training \textcolor{#a1b4c7}{\text{training}} points. However, the predictor can only use the y y -values of two training points, so which points it ends up selecting \textcolor{#23553c}{\text{selecting}} will determine the prediction.

Nearest neighbors Conditional selection
The red \textcolor{#9c380d}{\text{red}} line is the conditional mean \textcolor{#9c380d}{\text{conditional mean}} μ \mu , conditional on the selected \textcolor{#23553c}{\text{selected}} points, and the ±2σ \pm 2 \sigma confidence interval \textcolor{#b8420f}{\text{confidence interval}} is shaded for the conditional variance \textcolor{#b8420f}{\text{conditional variance}} σ2 \sigma^2 .

If one selects the nearest neighbors (left panel), both points are to the right of the target and quite close to each other. This results in an inaccurate prediction with high uncertainty.

A more balanced view of the situation is obtained when picking the slightly farther but ultimately more informative point to the left (right panel), improving the prediction and reducing the uncertainty. In higher dimensions this phenomena is only exacerbated as there are more "sides" of the target, and so more reason to spread out the selected points.

The resulting selections on a 2-d grid with Matérn kernels of different smoothnesses are visualized in this YouTube video (I recommend putting it on double time; it's quite satisfying).

It is intractable (and probably NP hard) to select the optimal k k points out of N N points total, so we greedily select the most informative point, conditional on all points previously selected. The naive time complexity is O(Nk4) \mathcal{O}(N k^4) , but by maintaining a partial Cholesky factor we can reduce this to O(Nk2) \mathcal{O}(N k^2) . For multiple points, application of the matrix determinant lemma

logdet(ΘPr,PrI,k)logdet(ΘPr,PrI)=log(Θk,kI,Pr)log(Θk,kI)\begin{aligned} \operatorname{logdet}(\Theta_{\text{Pr}, \text{Pr} \mid I, k}) - \operatorname{logdet}(\Theta_{\text{Pr}, \text{Pr} \mid I}) &= \log(\Theta_{k, k \mid I, \text{Pr}}) - \log(\Theta_{k, k \mid I}) \end{aligned}

switches the roles of the prediction points and the training points, allowing the change in log determinant to be computed much more efficiently as the change in variance of the training point after conditioning on the prediction points. For m m targets the time complexity is O(Nk2+Nm2+m3) \mathcal{O}(N k^2 + N m^2 + m^3) , gracefully reducing to the complexity of the single-point algorithm when m=1 m = 1 . Finally, efficient application of rank-one downdating accounts for partially conditioned points at no additional asymptotic time cost.

Sparsity pattern

We can directly apply our algorithm to select the sparsity pattern of Cholesky factors.

Cholesky factor Point selection
Equivalence of sparsity selection in Cholesky factors and point selection in GPs.

For a column of a Cholesky factor in isolation, the target \textcolor{#ea8810}{\text{target}} point is the diagonal \textcolor{#ea8810}{\text{diagonal}} entry, candidates \textcolor{#a1b4c7}{\text{candidates}} are below \textcolor{#a1b4c7}{\text{below}} it, and the selected \textcolor{#23553c}{\text{selected}} entries are added to the sparsity pattern \textcolor{#23553c}{\text{sparsity pattern}} . Points violating lower triangularity are not shown. Thus, sparsity selection in Cholesky factorization (left panel) is analogous to training point selection in directed Gaussian process regression (right panel).

Numerical experiments

Here I give, in my opinion, the most thought-provoking experiments from the paper.

Nearest neighbors classification

We perform image classification on the MNIST dataset of handwritten digits (greyscale images of digits from 0-9) using a very simple k k -nearest neighbors approach: select k k images, either with nearest neighbors (k k -NN) or with conditional selection (Ck k -NN). Take the most frequently occurring label in the selected images, and use that as the predicted label.

MNIST classification
Classification accuracy of k k -NN variants on the MNIST dataset.

The k k -NN algorithm (shown in blue) linearly decreases in accuracy with increasing k k as the bias-variance trade-off would predict. However, not only does Ck k -NN maintain a higher accuracy for all k k , but its accuracy actually increases until around k10 k \approx 10 .

This suggests conditioning able to somehow "robustify" the selection procedure.

Sparse signal recovery

Our selection algorithm can be viewed as the covariance equivalent of orthogonal matching pursuit. To see this, let Θ \Theta be a symmetric positive-definite (s.p.d.) matrix, so it admits a factorization Θ=FF \Theta = F^{\top} F for feature vectors F F . Perform a QR factorization on F=QR F = QR so that

Θ=(QR)(QR)=R(QQ)R=RR.\begin{aligned} \Theta = (QR)^{\top} (QR) = R^{\top} (Q^{\top} Q) R = R^{\top} R. \end{aligned}

But R R is an upper triangular matrix, so by the uniqueness of the Cholesky factorization, R R is a Cholesky factor of Θ \Theta . As the QR factorization performs iterative orthogonalization in the feature space F F through the Gram–Schmidt process, this corresponds to conditioning in Θ \Theta .

Motivated by this connection, we consider an experiment in which we randomly generate an a priori sparse Cholesky factor L L and attempt to recover it given only the measurements QLL Q \coloneqq L L^{\top} . We report the intersection over union (IOU) of the sparsity patterns.

Nearest neighbors Conditional selection
Accuracy over varying matrix size N N and nonzero entries per column s s .

Conditional selection maintains a near-prefect accuracy with increasing problem size and factor density, while the other naive methods perform significantly worse.

Preconditioning the conjugate gradient

The conjugate gradient is a classical iterative algorithm for solving linear systems in s.p.d. matrices. In this experiment, we use sparse Cholesky factors as a preconditioner for the conjugate gradient and track how many nonzeros per column we need to converge to a specified precision within a fixed number of iterations (here, 50).

Conjugate gradient
Number of nonzeros per column to converge within 50 iterations.

Our conditional selection methods (shown in orange and red) need a nearly constant number of nonzeros with increasing problem size N N . The nearest neighbor-based methods, however, start sharply increasing. This could hint at a qualitatively different regime in which the conditional methods discover hidden structure nearest neighbors cannot access.

Conclusion

Replacing nearest neighbors with conditional selection has the potential to significantly improve the accuracy of sparsity patterns and the statistical efficiency of the resulting approximate sparse Cholesky factor, resulting in faster and more accurate computation of statistical quantities of interest such as the posterior mean, posterior variance, likelihood, log determinant, and inversion of the covariance matrix.

Beyond Gaussian processes, efficient computation with large structured positive-definite matrices could accelerate optimization routines (through Newton's method and its variants), solving linear systems with iterative methods like the conjugate gradient by preconditioning, and more. Optimal transport, mirror descent, and the natural gradient are all possibilities.

By replacing geometric criteria hand-crafted towards a specific family of covariance functions with relatively general information-theoretic criteria, this work represents an important step in automating the discovery of exploitable structure in complex positive-definite matrices. Importantly, the selection methods only require access to the entries of the matrix, not the underlying points, allowing it to be applied in principle to any s.p.d. matrix.

Postscript

Conditioning in Gaussian processes    Gaussian elimination    Sodachi Oikura’s favorite\begin{aligned} \text{Conditioning in Gaussian processes} \iff \text{Gaussian elimination} \iff \text{Sodachi Oikura's favorite} \end{aligned}
Two portraits with Euler on the left and Gauss on the right

…けれど私が1等尊敬してるのはガウス先生だよ。

...but the [mathematician] I respect the most is Gauss.

---from Zoku Owarimonogatari episode 2, at roughly 13:00.
さみしいも、たのしい。Page source. Last updated: .