by greedy conditional selection

*This post summarizes joint work with
Joseph Guinness,
Matthias Katzfuss,
Houman Owhadi, and Florian Schäfer. For
additional details, see the
paper or
source code*.

*Thanks Seulgi for
drawing the preview image! The full resolution image may be found
here.*

Our motivating example is a regression problem that we model with Gaussian processes.

In our setting, we have $N$ points whose corresponding covariance matrix $\Theta$ is formed by

$\begin{aligned} \Theta_{i, j} \coloneqq K(\bm{ x}_i, \bm{
x}_j), \end{aligned}$

or pairwise evaluation of a kernel function $K(\cdot, \cdot)$. Commonly used kernels include the Matérn family of covariance functions, which include the exponential kernel and squared exponential kernel (also known as the Gaussian or RBF kernel) as special cases. The kernel function along with a mean function $\mu(\cdot)$ define a Gaussian process $f \sim \mathcal{GP}(\mu, K)$ such that for any set of points $X$, $f(X)$ is jointly distributed according to a multivariate Gaussian.

If we partition our points into training points for which we know the value of $f$ and testing points at which we wish to predict the value of $f$, then the Gaussian process modeling of $f$ enables us to compute the posterior distribution of the testing points in closed form as

$\begin{aligned} \providecommand{\Ehelper}{ \@ifstar{\Estar}{\Enostar} } \providecommand{\Enostar}[1]{\mathbb{E}[#1]} \providecommand{\Estar}[1]{\mathbb{E}\left[#1\right]} \Ehelper{\bm{ y}_\text{Pr} \mid \bm{ y}_\text{Tr}} &= \bm{ \mu}_\text{Pr} + \Theta_{\text{Pr}, \text{Tr}} \Theta_{\text{Tr}, \text{Tr}}^{-1} (\bm{ y}_\text{Tr} - \bm{ \mu}_\text{Tr}) \\ \providecommand{\Covvhelper}{ \@ifstar{\Covvstar}{\Covvnostar} } \providecommand{\Covvnostar}[1]{\mathbb{C}\text{ov}[#1]} \providecommand{\Covvstar}[1]{\mathbb{C}\text{ov}\left[#1\right]} \Covvhelper{\bm{ y}_\text{Pr} \mid \bm{ y}_\text{Tr}} &= \Theta_{\text{Pr}, \text{Pr}} - \Theta_{\text{Pr}, \text{Tr}} \Theta_{\text{Tr}, \text{Tr}}^{-1} \Theta_{\text{Tr}, \text{Pr}}. \end{aligned}$Note that our predictions naturally come equipped with uncertainty estimates from the posterior covariance. Due to the ability to measure uncertainty and to encode prior information in the kernel function, Gaussian processes and kernel methods enjoy widespread usage in geostatistics, machine learning, and optimal experimental design.

However, computing (1) requires inverting the covariance matrix of the training points. Additional statistical inference often requires computing quantities like the likelihood or log determinant. For a dense covariance matrix $\Theta$, directly computing these quantities has a computational cost of $\mathcal{O}(N^3)$ and a memory cost of $\mathcal{O}(N^2)$, which is prohibitively expensive for large $N$. Can we efficiently approximate (1) while maintaining predictive accuracy?

Spatial statisticians have long observed the *screening effect*,
or the observation that conditional on nearby points, more distant
points have very little influence on the prediction.

In other words, after conditioning on a few points, the conditional correlation localizes around the target point, implying approximate conditional sparsity.

We can exploit this conditional sparsity by looking at the joint distribution.

$\begin{aligned} \pi(\bm{ y}) &= \pi(y_1) \pi(y_2 \mid
y_1) \pi(y_3 \mid y_1, y_2) \dotsm \pi(y_N \mid y_1, y_2,
\dotsc, y_{N - 1}). \end{aligned}$

The Vecchia approximation proposes to approximate the joint distribution by

$\begin{aligned} \pi(\bm{ y}) &\approx \pi(y_{i_1}) \pi(y_{i_2} \mid y_{s_2}) \pi(y_{i_3} \mid y_{s_3}) \dotsm \pi(y_{i_N} \mid y_{s_N}) \end{aligned}$for some ordering $i_1, \dotsc, i_N$ and sparsity sets $s_k \subseteq \{ i_1, \dotsc, i_{k - 1} \}$ with $\providecommand{\cardhelper}{ \@ifstar{\cardstar}{\cardnostar} } \providecommand{\cardnostar}[1]{\lvert #1 \rvert} \providecommand{\cardstar}[1]{\left \lvert #1 \right \rvert} \cardhelper{s_k} \ll k$. The quality of the approximation depends significantly on the chosen ordering and sparsity pattern. In order to exploit the screening effect as much as possible, it is natural to take the nearest neighbors as the sparsity pattern. For the ordering, we would like to select points that have good coverage of the entire space. Towards that end, we form the reverse-maximin ordering by first selecting the last index $i_N$ arbitrarily and then choosing for $k = N - 1, \dotsc, 1$ the index

$\begin{aligned} i_k = \operatorname*{argmax}_{i \in
-\mathcal{I}_{k + 1}} \; \min_{j \in \mathcal{I}_{k + 1}}
\providecommand{\normhelper}{
\@ifstar{\normstar}{\normnostar} }
\providecommand{\normnostar}[1]{\lVert #1 \rVert}
\providecommand{\normstar}[1]{\left \lVert #1 \right \rVert}
\normhelper{\bm{ x}_i - \bm{ x}_j} \end{aligned}$

where $-\mathcal{I} \coloneqq \{ 1, \dotsc, N \} \setminus \mathcal{I}$ and $\mathcal{I}_n \coloneqq \{ i_n, i_{n + 1}, \dotsc, i_N \}$, that is, select the point furthest from previously selected points every iteration. This ordering and sparsity pattern can be computed efficiently with geometric data structures like the $k$-d tree or ball tree.

From another perspective, the Vecchia approximation implicitly forms an approximate Cholesky factor of the covariance matrix (a lower triangular matrix $L$ such that $L L^{\top} \approx \Theta$).

We can hope for (approximate) sparsity in the Cholesky factor as Cholesky factorization can be viewed as recursively computing the block decomposition

$\begin{aligned} \operatorname{chol}(\Theta) &=
\begin{pmatrix} \text{Id} & 0 \\
\textcolor{#c7740e}{\Theta_{2, 1} \Theta_{1, 1}^{-1}} &
\text{Id} \end{pmatrix} \begin{pmatrix}
\operatorname{chol}(\Theta_{1, 1}) & 0 \\ 0 &
\operatorname{chol}(\textcolor{#a1b4c7}{ \Theta_{2, 2} -
\Theta_{2, 1} \Theta_{1, 1}^{-1} \Theta_{1, 2} })
\end{pmatrix} \\ \end{aligned}$

where $\Theta_{2, 2} - \Theta_{2, 1} \Theta_{1, 1}^{-1} \Theta_{1, 2}$ is the conditional covariance in (1). Thus Cholesky factorization is a process of iteratively conditioning the underlying Gaussian process, which we hope will lead to approximate sparsity from the screening effect.

We can use the same ordering and sparsity pattern as the Vecchia approximation (2). In order to compute the entries of $L$, we can minimize the Kullback-Leibler (KL) divergence between two centered Gaussians with the true and approximate covariance matrices over the space $\mathcal{S}$ of lower triangular matrices satisfying the specified sparsity pattern,

$\begin{aligned} L &\coloneqq
\operatorname*{argmin}_{\hat{L} \in \mathcal{S}} \,
\providecommand{\KLhelper}{ \@ifstar{\KLstar}{\KLnostar} }
\providecommand{\KLnostar}[2]{\mathbb{D}_{\operatorname{KL}}(#1\;
\| \;#2)}
\providecommand{\KLstar}[2]{\mathbb{D}_{\operatorname{KL}}\left(#1\;
\middle\| \;#2\right)} \KLhelper*{\mathcal{N}(\bm{ 0},
\Theta)}{\mathcal{N}(\bm{ 0}, (\hat{L}
\hat{L}^{\top})^{-1})}. \end{aligned}$

Note that we want $L L^{\top} \approx \Theta^{-1}$, the inverse of the covariance matrix (sometimes called the precision matrix). While $\Theta$ encodes marginal relationships between the variables, $\Theta^{-1}$ encodes conditional relationships and therefore often leads to sparser factors.

Luckily this optimization problem has a closed-form explicit solution

$\begin{aligned} L_{s_i, i} &= \frac{\Theta_{s_i,
s_i}^{-1} \bm{ e}_1} {\sqrt{\bm{ e}_1^{\top} \Theta_{s_i,
s_i}^{-1} \bm{ e}_1}} \end{aligned}$

which computes the nonzero entries for the $i$-th column of $L$ in time $\mathcal{O}(\providecommand{\cardhelper}{ \@ifstar{\cardstar}{\cardnostar} } \providecommand{\cardnostar}[1]{\lvert #1 \rvert} \providecommand{\cardstar}[1]{\left \lvert #1 \right \rvert} \cardhelper{s_i}^3)$.

If we were to directly use a sparse Cholesky factor to approximate $\Theta_{\text{Tr}, \text{Tr}}^{-1}$ in (1), we would still need to compute $\Theta_{\text{Pr}, \text{Tr}}$ which could be prohibitive. Instead, we put "prediction points first" by forming the joint covariance matrix of training and prediction points

$\begin{aligned} L = \begin{pmatrix} L_{\text{Pr},
\text{Pr}} & 0 \\ L_{\text{Tr}, \text{Pr}} &
L_{\text{Tr}, \text{Tr}} \end{pmatrix} \qquad L L^{\top}
\approx \begin{pmatrix} \Theta_{\text{Pr}, \text{Pr}} &
\Theta_{\text{Pr}, \text{Tr}} \\ \Theta_{\text{Tr},
\text{Pr}} & \Theta_{\text{Tr}, \text{Tr}}
\end{pmatrix}^{-1} \end{aligned}$

Given the sparse approximate Cholesky factor $L$, (1) can be approximated efficiently with

$\begin{aligned} \providecommand{\Ehelper}{
\@ifstar{\Estar}{\Enostar} }
\providecommand{\Enostar}[1]{\mathbb{E}[#1]}
\providecommand{\Estar}[1]{\mathbb{E}\left[#1\right]}
\Ehelper{\bm{ y}_\text{Pr} \mid \bm{ y}_\text{Tr}} &=
-L_{\text{Pr}, \text{Pr}}^{-\top} L_{\text{Tr},
\text{Pr}}^{\top} \bm{ y}_\text{Tr} \\
\providecommand{\Covvhelper}{
\@ifstar{\Covvstar}{\Covvnostar} }
\providecommand{\Covvnostar}[1]{\mathbb{C}\text{ov}[#1]}
\providecommand{\Covvstar}[1]{\mathbb{C}\text{ov}\left[#1\right]}
\Covvhelper{\bm{ y}_\text{Pr} \mid \bm{ y}_\text{Tr}} &=
L_{\text{Pr}, \text{Pr}}^{-\top} L_{\text{Pr},
\text{Pr}}^{-1}. \end{aligned}$

This can be viewed as a generalization of the Vecchia approximation (2).

Plugging the optimal $L$ back into the KL divergence yields the interesting decomposition

$\begin{aligned} 2 \providecommand{\KLhelper}{
\@ifstar{\KLstar}{\KLnostar} }
\providecommand{\KLnostar}[2]{\mathbb{D}_{\operatorname{KL}}(#1\;
\| \;#2)}
\providecommand{\KLstar}[2]{\mathbb{D}_{\operatorname{KL}}\left(#1\;
\middle\| \;#2\right)} \KLhelper*{\mathcal{N}(\bm{ 0},
\Theta)}{\mathcal{N}(\bm{ 0}, (L L^{\top})^{-1})} &=
\sum_{i = 1}^N \left [ \log \left ( \Theta_{i, i \mid s_i
\setminus \{ i \}} \right ) - \log \left ( \Theta_{i, i \mid
i + 1:} \right ) \right ]. \end{aligned}$

This sum is the accumulated *difference* in posterior log
variance for a series of independent regression problems: each to
predict the
$i$-th variable given a subset of the variables after it in the ordering.
The term
$\log \left ( \Theta_{i, i \mid s_i \setminus \{ i \}} \right )$
obtained when restricted to the variables in the sparsity pattern
$s_i$
is compared to the ground truth
$\log \left ( \Theta_{i, i \mid i + 1:} \right )$.

We can generalize this formula if multiple columns share the same sparsity pattern, a setting called aggregated or supernodal Cholesky factorization which improves computational efficiency. If $\tilde{i}$ is a list of aggregated columns, its contribution to the KL divergence is

$\begin{aligned} \sum_{i \in \tilde{i}} \log \left
(\Theta_{i, i \mid s_i \setminus \{ i \} } \right ) &=
\operatorname{logdet}(\Theta_{\tilde{i}, \tilde{i} \mid
\tilde{s}}) \end{aligned}$

or the posterior log determinant of the group's covariance matrix. However, this formula assumes the aggregated columns are adjacent in the ordering, which may not be the case.

In the case of non-adjacent columns, a sparsity entry can be below one
column but above another, thus only *partially* conditioning the
group when it is added to the sparsity pattern. Writing
$\bm{ y}_{\parallel k}$
for the conditioning of the variables in
$\bm{ y}$
by
$k$
but skipping the first
$p$, we have

$\begin{aligned} \Theta_{\tilde{i}, \tilde{i} \parallel k}
&\coloneqq \providecommand{\Covvhelper}{
\@ifstar{\Covvstar}{\Covvnostar} }
\providecommand{\Covvnostar}[1]{\mathbb{C}\text{ov}[#1]}
\providecommand{\Covvstar}[1]{\mathbb{C}\text{ov}\left[#1\right]}
\Covvhelper{\bm{ y}_{\parallel k}} = \begin{pmatrix} L_{:p}
L_{:p}^{\top} & L_{:p} {L'}_{p + 1:}^{\top} \\ {L'}_{p +
1:} L_{:p}^{\top} & {L'}_{p + 1:} {L'}_{p + 1:}^{\top}
\end{pmatrix} = \begin{pmatrix} L_{:p} \\ {L'}_{p + 1:}
\end{pmatrix} \begin{pmatrix} L_{:p} \\ {L'}_{p + 1:}
\end{pmatrix}^{\top} \end{aligned}$

where $L$ is the fully unconditional Cholesky factor and $L'$ is the fully conditional factor.

Armed with this representation, the contribution to the KL divergence is simply

$\begin{aligned} \sum_{i \in \tilde{i}} \log \left
(\Theta_{i, i \mid s_i \setminus \{ i \} } \right ) &=
\operatorname{logdet}(\Theta_{\tilde{i}, \tilde{i} \parallel
k}). \end{aligned}$

The previous formulas all imply that the sparsity selection should
minimize the posterior variance (or log determinant) of the target which
are measures of the uncertainty of the target. In other words, the
selected entries should be *maximally informative* to the target.

To make this notion precise, we can define the mutual information or information gain as

$\begin{aligned} \providecommand{\MIhelper}{
\@ifstar{\MIstar}{\MInostar} }
\providecommand{\MInostar}[2]{\mathbb{I}[#1; #2]}
\providecommand{\MIstar}[2]{\mathbb{I}\left[#1; #2\right]}
\MIhelper{\bm{ y}_\text{Pr}}{\bm{ y}_\text{Tr}}
&\coloneqq \providecommand{\Entropyhelper}{
\@ifstar{\Entropystar}{\Entropynostar} }
\providecommand{\Entropynostar}[1]{\mathbb{H}[#1]}
\providecommand{\Entropystar}[1]{\mathbb{H}\left[#1\right]}
\Entropyhelper{\bm{ y}_\text{Pr}} -
\providecommand{\Entropyhelper}{
\@ifstar{\Entropystar}{\Entropynostar} }
\providecommand{\Entropynostar}[1]{\mathbb{H}[#1]}
\providecommand{\Entropystar}[1]{\mathbb{H}\left[#1\right]}
\Entropyhelper{\bm{ y}_\text{Pr} \mid \bm{ y}_\text{Tr}}
\end{aligned}$

where $\providecommand{\Entropyhelper}{ \@ifstar{\Entropystar}{\Entropynostar} } \providecommand{\Entropynostar}[1]{\mathbb{H}[#1]} \providecommand{\Entropystar}[1]{\mathbb{H}\left[#1\right]} \Entropyhelper{X}$ is the entropy of the random variable $X$. Since the entropy of $\bm{ y}_\text{Pr}$ is constant, maximizing the mutual information is equivalent to minimizing the posterior entropy. For multivariate Gaussians the entropy is monotonically increasing with the log determinant, so maximizing mutual information is indeed equivalent to minimizing posterior variance!

In addition, we observe an information-theoretic analogue of the EV-VE identity

$\begin{aligned}
\textcolor{#ea8810}{\providecommand{\Varhelper}{
\@ifstar{\Varstar}{\Varnostar} }
\providecommand{\Varnostar}[1]{\mathbb{V}\text{ar}[#1]}
\providecommand{\Varstar}[1]{\mathbb{V}\text{ar}\left[#1\right]}
\Varhelper{\bm{ y}_\text{Pr}}} &=
\textcolor{#a1b4c7}{\providecommand{\Ehelper}{
\@ifstar{\Estar}{\Enostar} }
\providecommand{\Enostar}[1]{\mathbb{E}[#1]}
\providecommand{\Estar}[1]{\mathbb{E}\left[#1\right]}
\Ehelper{\providecommand{\Varhelper}{
\@ifstar{\Varstar}{\Varnostar} }
\providecommand{\Varnostar}[1]{\mathbb{V}\text{ar}[#1]}
\providecommand{\Varstar}[1]{\mathbb{V}\text{ar}\left[#1\right]}
\Varhelper{\bm{ y}_\text{Pr} \mid \bm{ y}_\text{Tr}}}} +
\textcolor{#b8420f}{\providecommand{\Varhelper}{
\@ifstar{\Varstar}{\Varnostar} }
\providecommand{\Varnostar}[1]{\mathbb{V}\text{ar}[#1]}
\providecommand{\Varstar}[1]{\mathbb{V}\text{ar}\left[#1\right]}
\Varhelper{\providecommand{\Ehelper}{
\@ifstar{\Estar}{\Enostar} }
\providecommand{\Enostar}[1]{\mathbb{E}[#1]}
\providecommand{\Estar}[1]{\mathbb{E}\left[#1\right]}
\Ehelper{\bm{ y}_\text{Pr} \mid \bm{ y}_\text{Tr}}}} \\
\textcolor{#ea8810}{\providecommand{\Entropyhelper}{
\@ifstar{\Entropystar}{\Entropynostar} }
\providecommand{\Entropynostar}[1]{\mathbb{H}[#1]}
\providecommand{\Entropystar}[1]{\mathbb{H}\left[#1\right]}
\Entropyhelper{\bm{ y}_\text{Pr}}} &=
\textcolor{#a1b4c7}{\providecommand{\Entropyhelper}{
\@ifstar{\Entropystar}{\Entropynostar} }
\providecommand{\Entropynostar}[1]{\mathbb{H}[#1]}
\providecommand{\Entropystar}[1]{\mathbb{H}\left[#1\right]}
\Entropyhelper{\bm{ y}_\text{Pr} \mid \bm{ y}_\text{Tr}}} +
\textcolor{#b8420f}{\providecommand{\MIhelper}{
\@ifstar{\MIstar}{\MInostar} }
\providecommand{\MInostar}[2]{\mathbb{I}[#1; #2]}
\providecommand{\MIstar}[2]{\mathbb{I}\left[#1; #2\right]}
\MIhelper{\bm{ y}_\text{Pr}}{\bm{ y}_\text{Tr}}}
\end{aligned}$

where the posterior entropy can be identified with the posterior variance and the mutual information corresponds to the variance of the optimal estimator $\providecommand{\Ehelper}{ \@ifstar{\Estar}{\Enostar} } \providecommand{\Enostar}[1]{\mathbb{E}[#1]} \providecommand{\Estar}[1]{\mathbb{E}\left[#1\right]} \Ehelper{\bm{ y}_\text{Pr} \mid \bm{ y}_\text{Tr}}$ (1), which can be intuitively understood as the extent to which the estimator actually varies with data.

Finally, the expected mean squared error of the estimator $\providecommand{\Ehelper}{ \@ifstar{\Estar}{\Enostar} } \providecommand{\Enostar}[1]{\mathbb{E}[#1]} \providecommand{\Estar}[1]{\mathbb{E}\left[#1\right]} \Ehelper{\bm{ y}_\text{Pr} \mid \bm{ y}_\text{Tr}}$ which is defined as $\text{MSE} \coloneqq \providecommand{\Ehelper}{ \@ifstar{\Estar}{\Enostar} } \providecommand{\Enostar}[1]{\mathbb{E}[#1]} \providecommand{\Estar}[1]{\mathbb{E}\left[#1\right]} \Ehelper{(\bm{ y}_\text{Pr} - \providecommand{\Ehelper}{ \@ifstar{\Estar}{\Enostar} } \providecommand{\Enostar}[1]{\mathbb{E}[#1]} \providecommand{\Estar}[1]{\mathbb{E}\left[#1\right]} \Ehelper{\bm{ y}_\text{Pr} \mid \bm{ y}_\text{Tr}})^2}$ is simply the posterior variance because

$\begin{aligned} \text{MSE} = \providecommand{\Ehelper}{ \@ifstar{\Estar}{\Enostar} } \providecommand{\Enostar}[1]{\mathbb{E}[#1]} \providecommand{\Estar}[1]{\mathbb{E}\left[#1\right]} \Ehelper{\providecommand{\Ehelper}{ \@ifstar{\Estar}{\Enostar} } \providecommand{\Enostar}[1]{\mathbb{E}[#1]} \providecommand{\Estar}[1]{\mathbb{E}\left[#1\right]} \Ehelper{(\bm{ y}_\text{Pr} - \providecommand{\Ehelper}{ \@ifstar{\Estar}{\Enostar} } \providecommand{\Enostar}[1]{\mathbb{E}[#1]} \providecommand{\Estar}[1]{\mathbb{E}\left[#1\right]} \Ehelper{\bm{ y}_\text{Pr} \mid \bm{ y}_\text{Tr}})^2 \mid \bm{ y}_\text{Tr}}} = \providecommand{\Ehelper}{ \@ifstar{\Estar}{\Enostar} } \providecommand{\Enostar}[1]{\mathbb{E}[#1]} \providecommand{\Estar}[1]{\mathbb{E}\left[#1\right]} \Ehelper{\providecommand{\Varhelper}{ \@ifstar{\Varstar}{\Varnostar} } \providecommand{\Varnostar}[1]{\mathbb{V}\text{ar}[#1]} \providecommand{\Varstar}[1]{\mathbb{V}\text{ar}\left[#1\right]} \Varhelper{\bm{ y}_\text{Pr} \mid \bm{ y}_\text{Tr}}} = \providecommand{\Varhelper}{ \@ifstar{\Varstar}{\Varnostar} } \providecommand{\Varnostar}[1]{\mathbb{V}\text{ar}[#1]} \providecommand{\Varstar}[1]{\mathbb{V}\text{ar}\left[#1\right]} \Varhelper{\bm{ y}_\text{Pr} \mid \bm{ y}_\text{Tr}}, \end{aligned}$from the quirk that the posterior covariance (1) does not depend on observing $\bm{ y}_\text{Tr}$.

Rather than use distance as a geometric proxy for importance, why not use information directly? The following toy example illustrates the benefits of selection by information.

In this 1-d regression problem, the goal is to predict the $y$-value of the $\textcolor{#ea8810}{\text{target}}$ point, given the locations of many $\textcolor{#a1b4c7}{\text{training}}$ points. However, the predictor can only use the $y$-values of two training points, so which points it ends up $\textcolor{#23553c}{\text{selecting}}$ will determine the prediction.

If one selects the nearest neighbors (left panel), both points are to the right of the target and quite close to each other. This results in an inaccurate prediction with high uncertainty.

A more balanced view of the situation is obtained when picking the slightly farther but ultimately more informative point to the left (right panel), improving the prediction and reducing the uncertainty. In higher dimensions this phenomena is only exacerbated as there are more "sides" of the target, and so more reason to spread out the selected points.

The resulting selections on a 2-d grid with Matérn kernels of different smoothnesses are visualized in this YouTube video (I recommend putting it on double time; it's quite satisfying).

It is intractable (and probably NP hard) to select the optimal $k$ points out of $N$ points total, so we greedily select the most informative point, conditional on all points previously selected. The naive time complexity is $\mathcal{O}(N k^4)$, but by maintaining a partial Cholesky factor we can reduce this to $\mathcal{O}(N k^2)$. For multiple points, application of the matrix determinant lemma

$\begin{aligned} \operatorname{logdet}(\Theta_{\text{Pr},
\text{Pr} \mid I, k}) -
\operatorname{logdet}(\Theta_{\text{Pr}, \text{Pr} \mid I})
&= \log(\Theta_{k, k \mid I, \text{Pr}}) -
\log(\Theta_{k, k \mid I}) \end{aligned}$

switches the roles of the prediction points and the training points,
allowing the change in log determinant to be computed much more
efficiently as the change in variance of the *training* point
after conditioning on the prediction points. For
$m$
targets the time complexity is
$\mathcal{O}(N k^2 + N m^2 + m^3)$, gracefully reducing to the complexity of the single-point algorithm
when
$m = 1$. Finally, efficient application of
rank-one downdating
accounts for partially conditioned points at no additional asymptotic
time cost.

We can directly apply our algorithm to select the sparsity pattern of Cholesky factors.

For a column of a Cholesky factor in isolation, the $\textcolor{#ea8810}{\text{target}}$ point is the $\textcolor{#ea8810}{\text{diagonal}}$ entry, $\textcolor{#a1b4c7}{\text{candidates}}$ are $\textcolor{#a1b4c7}{\text{below}}$ it, and the $\textcolor{#23553c}{\text{selected}}$ entries are added to the $\textcolor{#23553c}{\text{sparsity pattern}}$. Points violating lower triangularity are not shown. Thus, sparsity selection in Cholesky factorization (left panel) is analogous to training point selection in directed Gaussian process regression (right panel).

Here I give, in my opinion, the most thought-provoking experiments from the paper.

We perform image classification on the MNIST dataset of handwritten digits (greyscale images of digits from 0-9) using a very simple $k$-nearest neighbors approach: select $k$ images, either with nearest neighbors ($k$-NN) or with conditional selection (C$k$-NN). Take the most frequently occurring label in the selected images, and use that as the predicted label.

The
$k$-NN algorithm (shown in blue) linearly decreases in accuracy with
increasing
$k$
as the bias-variance trade-off would predict. However, not only does
C$k$-NN maintain a higher accuracy for all
$k$, but its accuracy actually *increases* until around
$k \approx 10$.

This suggests conditioning able to somehow "robustify" the selection procedure.

Our selection algorithm can be viewed as the covariance equivalent of orthogonal matching pursuit. To see this, let $\Theta$ be a symmetric positive-definite (s.p.d.) matrix, so it admits a factorization $\Theta = F^{\top} F$ for feature vectors $F$. Perform a QR factorization on $F = QR$ so that

$\begin{aligned} \Theta = (QR)^{\top} (QR) = R^{\top}
(Q^{\top} Q) R = R^{\top} R. \end{aligned}$

But $R$ is an upper triangular matrix, so by the uniqueness of the Cholesky factorization, $R$ is a Cholesky factor of $\Theta$. As the QR factorization performs iterative orthogonalization in the feature space $F$ through the Gram–Schmidt process, this corresponds to conditioning in $\Theta$.

Motivated by this connection, we consider an experiment in which we
randomly generate an *a priori* sparse Cholesky factor
$L$
and attempt to recover it given only the measurements
$Q \coloneqq L L^{\top}$. We report the intersection over union (IOU) of the sparsity patterns.

Conditional selection maintains a near-prefect accuracy with increasing problem size and factor density, while the other naive methods perform significantly worse.

The conjugate gradient is a classical iterative algorithm for solving linear systems in s.p.d. matrices. In this experiment, we use sparse Cholesky factors as a preconditioner for the conjugate gradient and track how many nonzeros per column we need to converge to a specified precision within a fixed number of iterations (here, 50).

Our conditional selection methods (shown in orange and red) need a nearly constant number of nonzeros with increasing problem size $N$. The nearest neighbor-based methods, however, start sharply increasing. This could hint at a qualitatively different regime in which the conditional methods discover hidden structure nearest neighbors cannot access.

Replacing nearest neighbors with conditional selection has the potential to significantly improve the accuracy of sparsity patterns and the statistical efficiency of the resulting approximate sparse Cholesky factor, resulting in faster and more accurate computation of statistical quantities of interest such as the posterior mean, posterior variance, likelihood, log determinant, and inversion of the covariance matrix.

Beyond Gaussian processes, efficient computation with large structured positive-definite matrices could accelerate optimization routines (through Newton's method and its variants), solving linear systems with iterative methods like the conjugate gradient by preconditioning, and more. Optimal transport, mirror descent, and the natural gradient are all possibilities.

By replacing geometric criteria hand-crafted towards a specific family
of covariance functions with relatively general information-theoretic
criteria, this work represents an important step in automating the
discovery of exploitable structure in complex positive-definite
matrices. Importantly, the selection methods only require access to the
entries of the matrix, *not* the underlying points, allowing it
to be applied in principle to any s.p.d. matrix.

$\begin{aligned} \text{Conditioning in Gaussian processes}
\iff \text{Gaussian elimination} \iff \text{Sodachi Oikura's
favorite} \end{aligned}$

さみしいも、たのしい。Page source. Last updated: .