Our motivating example is a regression problem that we model with
Gaussian processes.
In our setting, we have
N
points whose corresponding covariance matrix
Θ
is formed by
Θi,j:=K(xi,xj),
or pairwise evaluation of a kernel function
K(⋅,⋅). Commonly used kernels include the
Matérn
family of covariance functions, which include the exponential kernel and
squared exponential kernel (also known as the Gaussian or RBF kernel) as
special cases. The kernel function along with a mean function
μ(⋅)
define a Gaussian process
f∼GP(μ,K)
such that for any set of points
X,
f(X)
is jointly distributed according to a multivariate Gaussian.
If we partition our points into training points for which we know the
value of
f
and testing points at which we wish to predict the value of
f, then the Gaussian process modeling of
f
enables us to compute the posterior distribution of the testing points
in closed form as
Note that our predictions naturally come equipped with uncertainty
estimates from the posterior covariance. Due to the ability to measure
uncertainty and to encode prior information in the kernel function,
Gaussian processes and kernel methods enjoy widespread usage in
geostatistics, machine learning, and optimal experimental design.
However, computing
(1) requires
inverting the covariance matrix of the training points. Additional
statistical inference often requires computing quantities like the
likelihood or log determinant. For a dense covariance matrix
Θ, directly computing these quantities has a computational cost of
O(N3)
and a memory cost of
O(N2), which is prohibitively expensive for large
N. Can we efficiently approximate
(1) while
maintaining predictive accuracy?
Spatial statisticians have long observed the screening effect,
or the observation that conditional on nearby points, more distant
points have very little influence on the prediction.
In other words, after conditioning on a few points, the conditional
correlation localizes around the target point, implying approximate
conditional sparsity.
for some ordering
i1,…,iN
and sparsity sets
sk⊆{i1,…,ik−1}
with
∣sk∣≪k. The quality of the approximation depends significantly on the chosen
ordering and sparsity pattern. In order to exploit the screening effect
as much as possible, it is natural to take the nearest neighbors as the
sparsity pattern. For the ordering, we would like to select points that
have good coverage of the entire space. Towards that end, we form the
reverse-maximin ordering by first selecting the last index
iN
arbitrarily and then choosing for
k=N−1,…,1
the index
ik=i∈−Ik+1argmaxj∈Ik+1min∥xi−xj∥
where
−I:={1,…,N}∖I
and
In:={in,in+1,…,iN}, that is, select the point furthest from previously selected points
every iteration. This ordering and sparsity pattern can be computed
efficiently with geometric data structures like the
k-d tree or ball tree.
From another perspective, the Vecchia approximation implicitly forms an
approximate Cholesky factor of the covariance matrix (a lower triangular
matrix
L
such that
LL⊤≈Θ).
We can hope for (approximate) sparsity in the Cholesky factor as
Cholesky factorization can be viewed as recursively computing the block
decomposition
where
Θ2,2−Θ2,1Θ1,1−1Θ1,2
is the conditional covariance in
(1). Thus Cholesky
factorization is a process of iteratively conditioning the underlying
Gaussian process, which we hope will lead to approximate sparsity from
the screening effect.
We can use the same ordering and sparsity pattern as the Vecchia
approximation
(2). In order
to compute the entries of
L, we can minimize the Kullback-Leibler (KL) divergence between two
centered Gaussians with the true and approximate covariance matrices
over the space
S
of lower triangular matrices satisfying the specified sparsity pattern,
L:=L^∈SargminDKL(N(0,Θ)N(0,(L^L^⊤)−1)).
Note that we want
LL⊤≈Θ−1, the inverse of the covariance matrix (sometimes called the precision
matrix). While
Θ
encodes marginal relationships between the variables,
Θ−1
encodes conditional relationships and therefore often leads to sparser
factors.
Luckily this optimization problem has a closed-form explicit solution
Lsi,i=e1⊤Θsi,si−1e1Θsi,si−1e1
which computes the nonzero entries for the
i-th column of
L
in time
O(∣si∣3).
If we were to directly use a sparse Cholesky factor to approximate
ΘTr,Tr−1
in (1), we would
still need to compute
ΘPr,Tr
which could be prohibitive. Instead, we put "prediction points first" by
forming the joint covariance matrix of training and prediction points
This sum is the accumulated difference in posterior log
variance for a series of independent regression problems: each to
predict the
i-th variable given a subset of the variables after it in the ordering.
The term
log(Θi,i∣si∖{i})
obtained when restricted to the variables in the sparsity pattern
si
is compared to the ground truth
log(Θi,i∣i+1:).
We can generalize this formula if multiple columns share the same
sparsity pattern, a setting called aggregated or supernodal Cholesky
factorization which improves computational efficiency. If
i~
is a list of aggregated columns, its contribution to the KL divergence
is
i∈i~∑log(Θi,i∣si∖{i})=logdet(Θi~,i~∣s~)
or the posterior log determinant of the group's covariance matrix.
However, this formula assumes the aggregated columns are adjacent in the
ordering, which may not be the case.
In the case of non-adjacent columns, a sparsity entry can be below one
column but above another, thus only partially conditioning the
group when it is added to the sparsity pattern. Writing
y∥k
for the conditioning of the variables in
y
by
k
but skipping the first
p, we have
The previous formulas all imply that the sparsity selection should
minimize the posterior variance (or log determinant) of the target which
are measures of the uncertainty of the target. In other words, the
selected entries should be maximally informative to the target.
To make this notion precise, we can define the mutual information or
information gain as
I[yPr;yTr]:=H[yPr]−H[yPr∣yTr]
where
H[X]
is the entropy of the random variable
X. Since the entropy of
yPr
is constant, maximizing the mutual information is equivalent to
minimizing the posterior entropy. For multivariate Gaussians the entropy
is monotonically increasing with the log determinant, so maximizing
mutual information is indeed equivalent to minimizing posterior
variance!
In addition, we observe an information-theoretic analogue of the EV-VE
identity
where the posterior entropy can be identified with the posterior
variance and the mutual information corresponds to the variance of the
optimal estimator
E[yPr∣yTr](1), which can be
intuitively understood as the extent to which the estimator actually
varies with data.
Finally, the expected mean squared error of the estimator
E[yPr∣yTr]
which is defined as
MSE:=E[(yPr−E[yPr∣yTr])2]
is simply the posterior variance because
Rather than use distance as a geometric proxy for importance, why not
use information directly? The following toy example illustrates the
benefits of selection by information.
In this 1-d regression problem, the goal is to predict the
y-value of the
target
point, given the locations of many
training
points. However, the predictor can only use the
y-values of two training points, so which points it ends up
selecting
will determine the prediction.
If one selects the nearest neighbors (left panel), both points are to
the right of the target and quite close to each other. This results in
an inaccurate prediction with high uncertainty.
A more balanced view of the situation is obtained when picking the
slightly farther but ultimately more informative point to the left
(right panel), improving the prediction and reducing the uncertainty. In
higher dimensions this phenomena is only exacerbated as there are more
"sides" of the target, and so more reason to spread out the selected
points.
The resulting selections on a 2-d grid with Matérn kernels of different
smoothnesses are visualized in this
YouTube video (I recommend
putting it on double time; it's quite satisfying).
It is intractable (and probably NP hard) to select the optimal
k
points out of
N
points total, so we greedily select the most informative point,
conditional on all points previously selected. The naive time complexity
is
O(Nk4), but by maintaining a partial Cholesky factor we can reduce this to
O(Nk2). For multiple points, application of the
matrix determinant lemma
switches the roles of the prediction points and the training points,
allowing the change in log determinant to be computed much more
efficiently as the change in variance of the training point
after conditioning on the prediction points. For
m
targets the time complexity is
O(Nk2+Nm2+m3), gracefully reducing to the complexity of the single-point algorithm
when
m=1. Finally, efficient application of
rank-one downdating
accounts for partially conditioned points at no additional asymptotic
time cost.
We can directly apply our algorithm to select the sparsity pattern of
Cholesky factors.
For a column of a Cholesky factor in isolation, the
target
point is the
diagonal
entry,
candidates
are
below
it, and the
selected
entries are added to the
sparsity pattern. Points violating lower triangularity are not shown. Thus, sparsity
selection in Cholesky factorization (left panel) is analogous to
training point selection in directed Gaussian process regression (right
panel).
We perform image classification on the MNIST dataset of handwritten
digits (greyscale images of digits from 0-9) using a very simple
k-nearest neighbors approach: select
k
images, either with nearest neighbors (k-NN) or with conditional selection (Ck-NN). Take the most frequently occurring label in the selected images,
and use that as the predicted label.
The
k-NN algorithm (shown in blue) linearly decreases in accuracy with
increasing
k
as the bias-variance trade-off would predict. However, not only does
Ck-NN maintain a higher accuracy for all
k, but its accuracy actually increases until around
k≈10.
This suggests conditioning able to somehow "robustify" the selection
procedure.
Our selection algorithm can be viewed as the covariance equivalent of
orthogonal matching pursuit. To see this, let
Θ
be a symmetric positive-definite (s.p.d.) matrix, so it admits a
factorization
Θ=F⊤F
for feature vectors
F. Perform a QR factorization on
F=QR
so that
Θ=(QR)⊤(QR)=R⊤(Q⊤Q)R=R⊤R.
But
R
is an upper triangular matrix, so by the uniqueness of the Cholesky
factorization,
R
is a Cholesky factor of
Θ. As the QR factorization performs iterative orthogonalization in the
feature space
F
through the Gram–Schmidt process, this corresponds to conditioning in
Θ.
Motivated by this connection, we consider an experiment in which we
randomly generate an a priori sparse Cholesky factor
L
and attempt to recover it given only the measurements
Q:=LL⊤. We report the intersection over union (IOU) of the sparsity patterns.
Conditional selection maintains a near-prefect accuracy with increasing
problem size and factor density, while the other naive methods perform
significantly worse.
The conjugate gradient is a classical iterative algorithm for solving
linear systems in s.p.d. matrices. In this experiment, we use sparse
Cholesky factors as a preconditioner for the conjugate gradient and
track how many nonzeros per column we need to converge to a specified
precision within a fixed number of iterations (here, 50).
Our conditional selection methods (shown in orange and red) need a
nearly constant number of nonzeros with increasing problem size
N. The nearest neighbor-based methods, however, start sharply
increasing. This could hint at a qualitatively different regime in which
the conditional methods discover hidden structure nearest neighbors
cannot access.
Replacing nearest neighbors with conditional selection has the potential
to significantly improve the accuracy of sparsity patterns and the
statistical efficiency of the resulting approximate sparse Cholesky
factor, resulting in faster and more accurate computation of statistical
quantities of interest such as the posterior mean, posterior variance,
likelihood, log determinant, and inversion of the covariance matrix.
Beyond Gaussian processes, efficient computation with large structured
positive-definite matrices could accelerate optimization routines
(through Newton's method and its variants), solving linear systems with
iterative methods like the conjugate gradient by preconditioning, and
more. Optimal transport, mirror descent, and the natural gradient are
all possibilities.
By replacing geometric criteria hand-crafted towards a specific family
of covariance functions with relatively general information-theoretic
criteria, this work represents an important step in automating the
discovery of exploitable structure in complex positive-definite
matrices. Importantly, the selection methods only require access to the
entries of the matrix, not the underlying points, allowing it
to be applied in principle to any s.p.d. matrix.