Many contemporary generative models for scientific fields are trained and deployed across a wide range of spatial resolutions. In such settings, it is natural to seek a description: the target law is a Borel probability measure on an infinite-dimensional state space, training data are evaluations of random fields on arbitrary grids, and the learned model is an operator that can be queried consistently on new discretizations. Neural operator architectures have largely resolved the representational side of this objective: one can parameterize drifts, scores, or transport maps in a manner that is stable under refinement and supports evaluation on meshes unseen during training. The remaining obstacle is . Even when evaluation of the learned operator is nearly grid-independent in complexity (up to FFT or spectral costs), the end-to-end cost of sampling is dominated by the number of iterative updates required to reach a prescribed accuracy in distribution. In other words, once expressivity and resolution invariance are in place, the practical bottleneck becomes the of the sampling dynamics rather than the cost of a single model evaluation.
We focus on annealed Langevin-type samplers in Hilbert spaces, where the generative procedure is implemented by a sequence of stochastic steps driven by a learned drift (or score) at a range of noise levels. For such samplers, it is well understood in finite dimensions that performance is governed by geometry: the drift may be accurate, yet progress is throttled along poorly conditioned directions. This phenomenon is not an artifact of discretization. In function space, anisotropy is typically more severe: high- and low-frequency components have vastly different natural scales, and naive isotropic noise schedules can force the algorithm to take step sizes constrained by the stiffest modes. Consequently, the number of iterations required to achieve a fixed accuracy can become large, even if each iteration is resolution-invariant. From a continuum standpoint, this is precisely a problem, and it suggests that learning only the drift is insufficient: to obtain genuine speedups, one must also adapt the covariance structure of the injected noise (equivalently, the reference Gaussian measure) that defines the geometry of the score and of the Langevin dynamics.
Our central premise is that the covariance schedule is the next essential degree of freedom once operator learning has provided resolution-invariant drift evaluation. We therefore propose to learn (i) a discretization-consistent drift operator and (ii) a family of trace-class covariance operators that act as learned preconditioners across annealing scales. The covariances are not treated as incidental hyperparameters but as trainable objects constrained to remain in an admissible set ensuring well-posedness in the infinite-dimensional setting. This changes the geometry in which the score is computed and in which the Langevin dynamics evolve, while preserving the underlying continuum formulation and enabling efficient sampling of the driving Gaussian random fields through structured representations (e.g. Fourier-diagonal or low-rank parameterizations).
Three considerations make this joint learning nontrivial in Hilbert space. First, Gaussian measures in infinite dimensions are mutually singular unless their covariances satisfy stringent equivalence conditions; the score is naturally defined along a Cameron–Martin space that depends on the reference measure. Second, the injected noise must be trace-class to define a bona fide stochastic evolution on the ambient space. Third, to retain discretization-independent guarantees, all bounds must be stable under Galerkin projection and must not degrade with dimension. These constraints motivate parameterizations that enforce uniform operator inequalities relative to a fixed baseline covariance. Under such constraints, the reference measures share a common Cameron–Martin space, the denoising objective is well-defined, and the resulting Langevin dynamics admit a continuum interpretation whose discretizations inherit dimension-free stability properties.
The contributions of this work are organized around this thesis.
We formulate a denoising score matching objective in a real separable Hilbert space in which the corruption noise is Gaussian with a trace-class covariance at each annealing scale. The learned drift targets a preconditioned score drift associated with the Radon–Nikodym derivative of the perturbed law with respect to the reference Gaussian measure. This formulation cleanly separates (a) the data distribution and its perturbations from (b) the choice of reference geometry, making explicit how covariance learning influences both the training signal and the sampling dynamics. Importantly, the objective is defined in function space; discretizations are only numerical surrogates used to approximate expectations and operator evaluations.
We introduce structured covariance parameterizations and regularizers that preserve trace-classness, positivity, and uniform equivalence to a baseline covariance. These admissibility conditions ensure that the family of Gaussian references remains mutually absolutely continuous and that the relevant Cameron–Martin derivatives are taken along a fixed subspace. On the sampling side, the same conditions support existence and uniqueness of solutions to the corresponding Langevin-type stochastic evolution (viewed as an SPDE or as an annealed sequence of such dynamics). This yields a principled foundation for learning preconditioners in infinite dimensions without silently changing the underlying measure-theoretic setting from scale to scale.
We provide an accuracy bound for the distribution produced by an annealed Euler–Maruyama sampler driven by the learned drift and the learned covariance schedule. The bound decomposes into three terms: a drift approximation component controlled by the mean-squared error of the learned drift under the intermediate perturbed laws, a discretization component controlled by the step sizes, and an annealing component capturing the mismatch induced by the finite noise schedule. Crucially, the constants depend on continuum quantities (such as Lipschitz/dissipativity parameters and uniform equivalence bounds for the covariances) and not on the discretization dimension. This establishes that, in the contractive regime, one can target a fixed Wasserstein accuracy with a number of sampling steps that does not grow with resolution, provided the drift approximation error and integrator stability are controlled uniformly.
In strongly log-concave regimes, the convergence rate of preconditioned Langevin dynamics is governed by a condition-number-like quantity in the metric induced by the covariance. We show that optimizing the covariance within an admissible family provably improves an upper bound on mixing time by reducing this proxy. Moreover, the improvement is not merely an artifact of the analysis: within the same covariance family and under first-order access, one cannot in general do better than the corresponding condition-number dependence. Thus, covariance learning is not only beneficial but, within structured families, essentially the correct and information-theoretically necessary lever for accelerating sampling. This perspective aligns with finite-dimensional preconditioning intuition while operating in a function-space framework in which the relevant geometry is encoded by operators rather than matrices.
From an algorithmic standpoint, the proposed method retains the operational simplicity of denoising-based training while adding a learnable preconditioning schedule that can be implemented efficiently in spectral bases. The resulting sampler trades expensive per-iteration costs for fewer iterations by improving contraction along difficult directions. Since the covariances are learned in the same continuum-first setting as the drift, the procedure is compatible with variable discretizations at both training and sampling time. In particular, one can exploit fast Gaussian random field sampling routines (FFT/KL) while maintaining the invariance properties expected of neural operator models.
The remainder of the paper develops these ideas systematically. We first recall the denoising drift–diffusion operator (DDO) framework in Hilbert space, emphasizing Gaussian reference measures, Cameron–Martin derivatives, and the function-space Langevin dynamics that underpin annealed sampling. We then introduce admissible covariance parameterizations, establish well-posedness and measure equivalence, and derive dimension-free error bounds and covariance-induced acceleration guarantees.
We briefly recall the drift–diffusion operator (DDO) viewpoint for denoising-based learning in a real separable Hilbert space H, emphasizing the role of Gaussian reference measures, Cameron–Martin derivatives, and the resulting function-space Langevin dynamics. Throughout, we treat all objects at the continuum level; discretizations enter only as numerical approximations of the operators and expectations defined below.
Let C be a trace-class,
positive, self-adjoint operator on H, and let μC := 𝒩(0, C)
denote the centered Gaussian Borel probability measure on H. The associated Cameron–Martin
space is
HμC := Im (C1/2) ⊆ H, ⟨h1, h2⟩HμC := ⟨C−1/2h1, C−1/2h2⟩H,
where C−1/2 is
understood on Im (C1/2). We use the
standard Cameron–Martin theorem: for any h ∈ HμC,
the shifted measure μCh := 𝒩(h, C)
is equivalent to μC and
In contrast, if h ∉ HμC,
then μCh ⟂ μC.
This dichotomy is the basic measure-theoretic reason that, in infinite
dimensions, the ``score’’ must be understood relative to a reference
Gaussian and along HμC
rather than along the full space H.
Given a measurable function Φ : H → ℝ and a point u ∈ H, we say that Φ is (Fr'echet) differentiable along
HμC
at u if there exists a
continuous linear functional DHμCΦ(u) ∈ HμC*
such that
Φ(u + h) = Φ(u) + DHμCΦ(u)[h] + o(∥h∥HμC), h ∈ HμC.
We identify HμC*
with HμC
via the Riesz isomorphism on (HμC, ⟨⋅, ⋅⟩HμC):
for ℓ ∈ HμC*,
we write Rℓ ∈ HμC
for the unique element satisfying ℓ[h] = ⟨Rℓ, h⟩HμC.
Concretely, when ℓ[h] = ⟨g, h⟩HμC,
then Rℓ = g,
and note the identity ⟨g, h⟩HμC = ⟨C−1g, h⟩H.
Fix a scale t and
abbreviate Ct := Ct(φ),
μt := 𝒩(0, Ct).
We consider perturbations of the form
vt = Atu + ηt, u ∼ μ, ηt ∼ 𝒩(0, Ct)
independent,
where either At = Id and
μ(Hμt) = 1,
or At is a
bounded linear smoother with At(H) ⊆ Hμt.
Let νt
denote the law of vt. Conditional
on u, the random variable
vt has law
𝒩(Atu, Ct),
i.e. a Cameron–Martin shift of μt by Atu.
Under the standing assumption Atu ∈ Hμt
almost surely (either directly or by smoothing), yields the conditional
Radon–Nikodym derivative
$$
\frac{d\mathcal N(A_t u, C_t)}{d\mu_t}(v)
=
\exp\Big( \langle C_t^{-1}A_t u, v\rangle_H - \tfrac12 \|A_t
u\|_{H_{\mu_t}}^2 \Big),
\qquad \mu_t\text{-a.s.}
$$
Averaging over u shows νt ≪ μt
with
We write $\Phi_t(v) :=
\log\big(\frac{d\nu_t}{d\mu_t}(v)\big)$, defined νt-a.s. (and
μt-a.s. on
the event where the right-hand side is finite and nonzero).
The DDO identity is the observation that the ``preconditioned score’’
RDHμtΦt
coincides with a conditional mean of the latent clean component.
Formally differentiating in a Cameron–Martin direction h ∈ Hμt
gives
$$
D_{H_{\mu_t}}\Phi_t(v)[h]
=
\frac{\mathbb E\big[ \exp(\cdots)\,\langle C_t^{-1}A_t u,
h\rangle_H\big]}{\mathbb E\big[\exp(\cdots)\big]}
=
\Big\langle C_t^{-1}\,\mathbb E\big[A_t u \mid v_t = v\big],\,
h\Big\rangle_H
=
\Big\langle \mathbb E\big[A_t u \mid v_t = v\big],\,
h\Big\rangle_{H_{\mu_t}},
$$
where exp (⋯) denotes the exponential
weight in and the conditional expectation is taken with respect to the
posterior induced by the joint law of (u, vt).
By the Riesz identification in Hμt,
we obtain the pointwise relation
Consequently, the preconditioned drift field
admits the equivalent denoising form
since vt = Atu + ηt.
This is precisely the identity that makes denoising objectives
compatible with learning the drift driving Langevin-type sampling.
Consider the mean-squared denoising functional at scale t,
𝒥t(G) := 𝔼u ∼ μ, ηt ∼ 𝒩(0, Ct)∥ηt + G(Atu + ηt)∥H2,
defined for measurable G : H → H with
𝔼∥G(vt)∥H2 < ∞.
By the Hilbert-space orthogonality principle (conditional expectation as
L2-projection), the
minimizer satisfies
G*(v) = − 𝔼[ηt ∣ vt = v], νt-a.s.
Comparing with shows G*(⋅) = F(⋅, t; φ).
Thus, training a parametrized operator Fθ(⋅, t)
by minimizing 𝔼∥ηt + Fθ(vt, t)∥H2
targets the drift without requiring explicit access to Φt or its
derivative. The role of the Gaussian reference is implicit but
essential: it determines the Cameron–Martin geometry in which Φt is
differentiable and in which the identity holds.
The drift is designed so that νt is invariant
for a preconditioned Langevin evolution on H. Let WCt
be a Ct-Wiener
process on H (well-defined
since Ct
is trace-class). The corresponding Langevin SPDE reads
where t is fixed. Under
standard dissipativity and Lipschitz hypotheses on F(⋅, t; φ), is
well posed and admits νt as an
invariant law; heuristically, the term −Us represents
the Ornstein–Uhlenbeck pullback associated with μt, while RDHμtΦt
corrects the dynamics to target νt rather than
μt.
Annealed Langevin sampling uses a finite sequence of scales t = 1, …, T and runs for a
short time at each scale, with νt serving as an
intermediate bridge between an initial Gaussian-like law and the
target-like law at the final scale. In computation one typically
replaces by an Euler–Maruyama discretization: for a step size ht > 0 and
iterates {Un} ⊂ H,
$$
U_{n+1} = U_n + h_t F_\theta(U_n,t) + \sqrt{2h_t}\,\xi_n,
\qquad
\xi_n \sim \mathcal N(0,C_t),
$$
where the learned field Fθ(⋅, t)
approximates F(⋅, t; φ) and the
Gaussian increments ξn implement the
Ct-covariance
noise. The continuum interpretation is that discretizations of inherit
their stability and convergence properties from the Hilbert-space
formulation, provided the covariance is trace-class and the drift is
controlled in the ambient norm.
We consider a continuum-first generative learning problem on a real separable Hilbert space H. Our data are i.i.d. samples {ui}i = 1N ⊂ H drawn from an unknown target law μ on (H, ℬ(H)). The goal is to construct, from these samples, a discretization-consistent sampler whose output law μ̂ approximates μ in a transport metric (typically W2), while keeping the number of drift evaluations and Gaussian random field (GRF) draws bounded independently of the numerical resolution used in computation.
Our approach is based on a multi-scale perturbation scheme indexed by
a finite set I = [T] = {1, …, T}.
At each scale t ∈ I
we introduce a centered Gaussian reference measure
μt = 𝒩(0, Ct(φ)),
where Ct(φ)
is a learnable covariance operator depending on parameters φ. Unlike fixed-noise denoising
formulations, we treat {Ct(φ)}t ∈ I
as part of the model to be learned jointly with the drift (or score)
operator, with the intent of shaping the geometry of the intermediate
distributions {νt} in a way
that improves mixing and reduces the number of sampling steps.
The operator inequalities constitute a uniform equivalence condition:
they prevent degeneracy of the covariance at any scale and, crucially,
imply that all Gaussian references {μt} share the
same Cameron–Martin space. Indeed, by standard results for equivalent
Gaussian measures, ensures
Hμt = Im (Ct(φ)1/2) = Im (C⋆1/2) for
all t ∈ I,
with equivalence of the corresponding Cameron–Martin norms up to
constants depending only on m
and M. This shared
Cameron–Martin structure is a primary technical requirement: it allows
us to interpret Cameron–Martin derivatives, preconditioned scores, and
stability estimates uniformly in t, and it prevents the learning
problem from inadvertently moving between mutually singular reference
geometries across scales.
The role of (A)–(B) is measure-theoretic rather than aesthetic. Conditional on u, the law of vt in is 𝒩(Atu, Ct(φ)). In infinite dimensions, Gaussian shifts by vectors outside the Cameron–Martin space produce measures that are singular with respect to μt. Thus, (A) or (B) is needed to ensure Atu ∈ Hμt almost surely, so that 𝒩(Atu, Ct(φ)) is a Cameron–Martin shift of μt for μ-almost every u, and the mixture law νt of vt is absolutely continuous with respect to μt. In practice, (B) is often the relevant regime: a mild smoothing At can be used to reconcile rough data laws μ with the Cameron–Martin regularity induced by trace-class Ct(φ).
We write νt := ℒ(vt) for the marginal law of the corrupted sample at scale t. The intermediate measures {νt} serve two purposes. First, they provide the training distributions for denoising-based regression of the drift field at each t. Second, they define an annealing path from an initial, noise-dominated distribution to a final, data-dominated distribution, to be traversed by a Langevin-type sampler.
At each t, the quantity of
interest is the preconditioned score drift F(⋅, t; φ)
associated with (νt, μt),
defined (as in the preceding section) through the log-density Φt = log (dνt/dμt)
by
F(u, t; φ) = −u + RDHμtΦt(u).
This drift depends on φ both
explicitly through Ct(φ)
(hence through μt and the
Cameron–Martin geometry) and implicitly through the induced corrupted
law νt. We
approximate F(⋅, t; φ) by a
parametrized, discretization-consistent map Fθ(⋅, t) : H → H,
typically realized as a neural operator so that evaluation on different
spatial grids corresponds to the same underlying continuum mapping.
The learning problem is therefore genuinely coupled: changing φ modifies the target of the regression problem for Fθ, while changing θ modifies the quality of the sampler that will ultimately be used with the covariances Ct(φ). Our objective is to choose (θ, φ) so that (i) Fθ(⋅, t) closely matches F(⋅, t; φ) in L2(νt; H) uniformly over t, and (ii) the resulting annealed Langevin dynamics, using {Ct(φ)} as its diffusion covariances, mixes rapidly at each scale. The admissibility constraints above are simultaneously a well-posedness requirement (trace-class noise and shared Cameron–Martin space) and a stabilization mechanism preventing the learned covariance schedule from collapsing onto low-dimensional or excessively stiff geometries.
Although all definitions are posed on H, training and sampling are implemented through finite-dimensional discretizations (e.g. Galerkin projections, grid evaluations, spectral truncations). We assume that the dataset can be queried on arbitrary discretizations and that, at sampling time, the learned maps Fθ(⋅, t) and covariance samplers for Ct(φ) can be evaluated at the same discretization without re-training. This motivates the explicit separation between (i) continuum objects μ, νt, μt, F(⋅, t; φ), and (ii) their numerical approximations. The uniform equivalence plays a key role here: it provides a scale- and parameter-uniform control of norms and regularity in the underlying function space, which is precisely what is needed to formulate discretization-robust error bounds for the induced samplers.
The remainder of the paper makes this problem setup constructive by specifying (a) an explicit joint training objective for (θ, φ) based on denoising, and (b) structured parameterizations of {Ct(φ)} that enforce admissibility while permitting efficient GRF sampling and differentiation with respect to φ.
Our training criterion is a preconditioned Langevin–denoising
objective in which the covariance schedule is treated as a learnable
component of the model. For each scale t ∈ I, recall that we
observe corrupted samples vt = Atu + ηt
with ηt ∼ 𝒩(0, Ct(φ)),
and we seek a drift surrogate Fθ(⋅, t)
approximating the preconditioned score drift F(⋅, t; φ)
associated with (νt, μt).
The empirical counterpart of this goal is captured by the population
risk
where ℛ is a regularizer enforcing
admissibility and additional structure. The appearance of ηt (rather than
an explicit score) is deliberate: it yields a regression problem with a
readily sampled target and admits an implementation by reparameterized
Gaussian draws. Formally, under the standing assumptions ensuring νt ≪ μt
and the existence of DHμtΦt,
the minimizer over sufficiently rich function classes satisfies the
usual denoising identity: at fixed t, the optimal drift field in L2(νt; H)
coincides with the preconditioned score drift F(⋅, t; φ). Hence
the dependence of ℒ on φ is essential rather than
ancillary: changing φ changes
the reference μt, the
Cameron–Martin geometry, and the induced corrupted law νt, and
therefore changes the regression target itself.
We enforce the operator inequalities mC⋆ ≼ Ct(φ) ≼ MC⋆
by adopting structured families in which admissibility is built in at
the continuum level. A convenient template (used repeatedly below) is
the congruence form
where St(φ)
is bounded, self-adjoint, positive, and satisfies
σ(St(φ)) ⊆ [m, M]
for all t. Then Ct(φ)
is trace-class and automatically satisfies ; moreover, the corresponding
Cameron–Martin spaces coincide with Im (C⋆1/2),
and all t-dependent norms are
equivalent with constants depending only on m, M. In what follows we
therefore focus on designing St(φ)
with (i) spectrum control, (ii) efficient sampling of ηt, and (iii)
differentiability with respect to φ as required by gradient-based
training.
Assume that C⋆
is diagonal in a known orthonormal basis {ek}k ≥ 1 ⊂ H
(e.g. the Fourier basis on a periodic domain), with C⋆ek = λk⋆ek,
∑kλk⋆ < ∞.
A particularly efficient family is
In this case GRF sampling is trivial: if {ξk}k ≥ 1
are i.i.d. standard normals,
which is implemented by truncating to the active discretization modes
and using FFTs for basis transforms when needed. To parameterize the
spectral multipliers while respecting bounds, we may set
with unconstrained parameters αt, k ∈ ℝ
and a smooth squashing map σ : ℝ → (0, 1) (e.g. logistic). This
yields exact satisfaction of the operator inequalities and avoids
nondifferentiable projection. Additionally, we can impose mild schedule
regularity across t by
penalizing discrete variations,
for the set 𝒦 of represented modes;
this reduces spurious oscillations of the geometry between adjacent
annealing scales.
Purely diagonal schedules cannot capture correlations across modes
induced by nonstationarity or global constraints in the data. To enlarge
expressivity while preserving tractability, we may augment a diagonal
St by a
finite-rank term:
where Dt
is diagonal in the reference basis and Ut : H → ℝr
(equivalently Ut* : ℝr → H)
has rank r≪ (active
dimension). Exact spectral enforcement σ(St) ⊆ [m, M]
for can be achieved by construction only in special cases; in practice
we therefore combine (i) a base diagonal component parameterized as in
with strict bounds Dt ≽ mI
and ∥Dt∥ ≤ M,
and (ii) a soft spectral penalty controlling departures from [m, M] on the active
discretization:
where (x)± = max {±x, 0}
and λmin, λmax
are computed on the finite-dimensional truncation. This empirical
enforcement is consistent with the continuum intent when the
parameterization is resolution-invariant and the rank r is fixed as resolution increases.
Sampling from 𝒩(0, Ct) with –
can be done by drawing a diagonal component plus a low-rank component,
using a Woodbury-type factorization on the discretization; gradients are
correspondingly efficient since all dependence on φ passes through diagonal scalings
and r-dimensional linear
algebra.
On domains where Fourier diagonalization is not appropriate
(nonperiodic boundaries, manifolds, irregular geometries), we may define
Ct(φ)
implicitly through an elliptic SPDE. A canonical choice is a
Mat'ern-type precision operator Lt(φ)
(self-adjoint, positive) with
with hyperparameters φ = (τt, κt)t ∈ I
and fixed integer p chosen so
that Ct is
trace-class on the spatial dimension of interest. Sampling proceeds by
the factorization ηt = Lt(φ)−1/2ξ
with ξ white noise,
implemented numerically by solving a linear elliptic problem (or
iterated solves when p > 1)
at the working discretization. Admissibility relative to C⋆ is enforced by
restricting (τt, κt)
to compact intervals and, when needed, choosing C⋆ within the same SPDE
family (so that follows from uniform ellipticity bounds). This yields a
parameterization with a small number of physically interpretable degrees
of freedom and a sampling cost comparable to a few PDE solves per
Langevin step.
When H = L2(D) and Ct(φ) arises as an integral operator with kernel ktφ(x, y), we can parameterize ktφ by stationary kernels (e.g. squared exponential, Mat'ern) or mixtures thereof. On regular grids, stationary kernels are diagonal in Fourier space and reduce to with st, k determined by a low-dimensional spectral density. This connects kernel hyperparameters directly to the spectrum of Ct, allowing us to enforce bounds by constraining the spectral density between m and M multiples of the baseline density.
Beyond hard or soft enforcement of , we include terms promoting numerical stability and efficient sampling. Typical choices include: (i) trace control, e.g. ∑t(Tr(Ct) − Tr(C⋆))2, to prevent excessive noise at intermediate scales; (ii) smoothness across t as in ; and (iii) structural sparsity (for low-rank schedules) via ∥Ut∥HS2 or nuclear-norm surrogates. Importantly, all such penalties are posed in terms of continuum-consistent quantities (trace, Hilbert–Schmidt norms, spectral multipliers) so that their effect does not depend qualitatively on discretization.
Optimization of ℒ(θ, φ) requires
differentiating through ηt ∼ 𝒩(0, Ct(φ)).
We use the standard reparameterization
implemented by in the Fourier-diagonal case, by low-rank factorizations
for , or by SPDE-based solvers for . This yields unbiased stochastic
gradients of ℒ with respect to φ under the usual differentiability
assumptions on the map φ ↦ Ct(φ)1/2
on the active discretization. In the diagonal case the derivatives are
explicit and cheap; in SPDE families, automatic differentiation through
the linear solve (or an adjoint solve) provides gradients with a cost
comparable to sampling itself. Since the loss in is expressed in the
ambient H-norm, we
additionally ensure that discretized implementations use
quadrature-consistent inner products so that training remains invariant
under refinement (up to the numerical approximation of Fθ and of the
sampling map φ ↦ ηt).
In summary, the PL-DDO formulation couples a denoising regression objective with a constrained, continuum-defined covariance schedule. The structured families above are designed so that (i) admissibility and common Cameron–Martin geometry are maintained uniformly, (ii) GRF sampling remains fast at arbitrary resolutions, and (iii) gradients with respect to φ can be computed reliably. These properties are the prerequisites for the sampling procedure developed next, where the learned Ct(φ) is used as the diffusion covariance in an annealed, preconditioned Langevin scheme.
Having learned a drift surrogate Fθ(⋅, t) together with an admissible covariance schedule Ct(φ), we define a sampler by discretizing a preconditioned Langevin dynamics on H whose diffusion at scale t is precisely Ct(φ). Throughout, we treat the annealing index t ∈ I = {1, …, T} as piecewise-constant ``time’’ and run a short Markov chain at each t, warm-started from the previous scale.
Fix a scale t. The ideal
continuum dynamics corresponding to the preconditioned score drift F(⋅, t; φ) is the
H-valued stochastic
evolution
where WCt(φ)
denotes a Ct(φ)-Wiener
process. Replacing F by Fθ yields the
learned dynamics. In the annealed setting we run for a finite horizon at
each t, with the terminal
state at t serving as the
initial condition at t + 1. We
emphasize that Ct(φ)
plays a dual role: it specifies the diffusion covariance in and, through
the training problem, it shapes the Cameron–Martin geometry in which the
learned drift is an approximation to the true preconditioned score
drift.
Let ht > 0 be a
step size at scale t, and let
M be the number of inner steps
per scale. On a finite-dimensional discretization Hn ⊂ H
(e.g. a Galerkin space) we use the explicit EM update
where {ξt(n)}
are independent across iterations n and across scales t. In practice we draw ξt(n)
via the reparameterization ξt(n) = Ct(φ)1/2z(n)
with z(n) ∼ 𝒩(0, I)
in the chosen basis. The update is the natural discretization of and is
easy to implement when Fθ is a neural
operator and Ct(φ)
admits fast sampling.
At the continuum level, explicit EM is sensitive to stiffness in the
drift. In our setting this stiffness is particularly visible because the
preconditioned drift decomposes as
so that the linear damping term −u is always present, while Gθ may be only
locally Lipschitz on typical data regimes. This observation motivates a
semi-implicit alternative that treats −u more stably.
A standard stabilization in (infinite-dimensional) Langevin
integrators is to apply a CN discretization to the linear
Ornstein–Uhlenbeck component and keep the nonlinear part explicit.
Concretely, applying CN to −u
and EM to Gθ yields
Solving for u(n + 1) gives
the explicit affine form
Two properties of are particularly useful for our purposes. First, the
linear factor satisfies |at| < 1 for
all ht > 0, which
prevents the explicit instability that can arise in when ht is not
sufficiently small relative to the linear damping. Second, when Gθ ≡ 0 the CN
scheme preserves the correct Gaussian invariant law for the linear
preconditioned OU dynamics (up to discretization effects in Hn), a feature
that is aligned with the role of μt = 𝒩(0, Ct(φ))
as the reference measure at each scale. For these reasons, CN-type
updates are often preferable at small noise levels (late in the
annealing) where stability constraints are most restrictive.
We stress that both and use exactly the same learned ingredients (Fθ, Ct(φ)); the distinction is purely numerical. Our theory in the next section is stated for the continuum dynamics and for stable discretizations; empirically, enlarges the range of usable step sizes without sacrificing resolution invariance.
We initialize at the coarsest/noisiest scale by drawing
and then, for t = 1, …, T, we run M iterations of either or with
covariance Ct(φ),
producing a terminal state ut(M).
We set ut + 1(0) := ut(M)
and proceed. The output is uT(M),
which we interpret as an approximate sample from νT (and, in
regimes where νT ≈ μ,
from μ). This warm-started
annealing is essential: it amortizes mixing by transporting mass through
a sequence of intermediate geometries rather than attempting to reach
νT in one
long chain.
Because Ct(φ) is learned, the geometry of the sampler varies with φ and can change markedly across t. We therefore choose {ht} in a manner that is robust to such variation. At a minimum, stability of explicit EM suggests taking ht inversely proportional to an estimate of the Lipschitz constant of Fθ(⋅, t) on typical states. In practice we approximate this by one of the following proxies on the active discretization: (i) a running estimate of ∥∇uFθ(u, t)∥ via automatic differentiation on minibatches of chain states; (ii) a spectral proxy based on the linear part, namely ensuring ht is not so large that the effective contraction (1 − ht) becomes negative in ; or (iii) a conservative schedule decreasing with t when t corresponds to decreasing noise (late scales are typically more ``peaked’’ and hence less forgiving).
For CN updates, we may choose larger ht because the linear stiffness is controlled by |at| < 1. Nevertheless, the nonlinear term Gθ still imposes a practical limit: excessively large bt leads to poor approximation of the target drift and can cause transient divergence before the −u term contracts. A simple and effective heuristic is to pick ht so that bt remains below a prescribed threshold (e.g. bt ≤ b̄) while allowing at to be close to 1 at early (high-noise) scales and smaller at late (low-noise) scales. We also note that the uniform equivalence bounds mC⋆ ≼ Ct(φ) ≼ MC⋆ imply that the typical noise energy and the induced Cameron–Martin norm remain comparable across t; this prevents the schedule from producing degenerate diffusion directions and thereby supports stable choices of ht that do not depend on discretization dimension.
All updates are executed on a finite-dimensional representation Hn determined by the query discretization. The drift Fθ(⋅, t) is evaluated as a discretization-consistent neural operator, and ξt(n) is drawn by sampling the truncated Gaussian 𝒩(0, PnCt(φ)Pn), where Pn : H → Hn is the projection associated with the discretization. Provided the sampling map Ct(φ)1/2 is implemented in a manner compatible with Pn (Fourier truncation, KL truncation, or discretized SPDE solves), the resulting chain is consistent under refinement in the sense that it approximates the same continuum algorithm, rather than a dimension-dependent surrogate.
The purpose of this section is to make explicit the algorithmic object to be analyzed: an annealed, preconditioned Langevin chain whose preconditioner is learned but constrained to an admissible operator family. In the next section we justify that, under the standing spectral bounds on Ct(φ) and standard monotonicity/Lipschitz hypotheses on the drift, the underlying continuum dynamics is well posed and the measures appearing in the construction are mathematically meaningful, so that the training and sampling procedures are posed on H rather than tied to a particular discretization.
We justify that, under our standing admissibility and regularity hypotheses, the continuum objects underlying training and sampling are mathematically meaningful on H: (i) the reference measures μt = 𝒩(0, Ct(φ)) and the perturbed laws νt are well defined and satisfy νt ∼ μt; (ii) the log-density Φt = log (dνt/dμt) admits a Cameron–Martin derivative so that the preconditioned score drift F(⋅, t; φ) is defined νt-a.s.; (iii) the continuum dynamics with time-varying covariance is well posed; and (iv) the denoising objective is finite.
Our basic structural requirement is the uniform operator
inequality
for fixed 0 < m ≤ M < ∞ and
trace-class C⋆.
First, implies each Ct(φ)
is trace-class with a uniform trace bound
and hence ηt ∼ 𝒩(0, Ct(φ))
is H-valued with 𝔼∥ηt∥2 = Tr(Ct(φ)).
Second, the equivalence yields a shared Cameron–Martin space: by
standard operator monotonicity arguments (e.g. comparing Ct1/2
and C⋆1/2 via
functional calculus), we have
and the corresponding Cameron–Martin norms are uniformly equivalent. In
particular, translations by elements of Hμ preserve
measure equivalence for all μt, uniformly in
t.
A convenient sufficient mechanism for is the congruence
parameterization
where St(φ)
is bounded, self-adjoint, positive, and σ(St(φ)) ⊂ [m, M].
Then – follow immediately from trace-class stability under bounded
congruences and the spectral bounds on St(φ).
Fix t ∈ I. Under
assumption (A) we have vt = u + ηt,
while under (B) we have vt = Atu + ηt
with At(H) ⊆ Hμ.
In either case, conditioning on u yields that νt is a mixture
of Cameron–Martin shifts of μt:
νt(⋅) = ∫μt(⋅ − Atu) μ(du),
where in case (A) we interpret At = Id and
require μ(Hμ) = 1.
For each fixed x ∈ Hμ,
the Cameron–Martin theorem gives
Applying with x = Atu ∈ Hμ
and averaging over u yields
the explicit Radon–Nikodym derivative
The right-hand side is strictly positive μt-a.s. and
finite provided, for instance, that 𝔼∥Atu∥Hμ2 < ∞,
which we take as a mild integrability condition (automatic in many
applications where At is smoothing
and u ∈ H has finite
second moment). Consequently νt ≪ μt,
and since each component measure μt(⋅ − Atu)
is itself equivalent to μt, the mixture
satisfies μt ≪ νt
as well; hence νt ∼ μt
and Φt = log (dνt/dμt)
is μt-a.s. (equivalently
νt-a.s.)
well defined.
Moreover, is formulated intrinsically on H using Hμ duality, so it is consistent under discretization: Galerkin projections simply replace μt and νt by their pushforwards to Hn, without altering the continuum identity.
The preconditioned drift is defined by
F(u, t; φ) := − u + R DHμΦt(u),
where DHμΦt
denotes the Fr'echet derivative along Hμ
(i.e. directional differentiability in Cameron–Martin directions). In
the present setting, existence of DHμΦt
can be justified under standard regularity assumptions on the density .
One sufficient condition is that the map
$$
v \;\mapsto\; \frac{d\nu_t}{d\mu_t}(v)
$$
belongs to a Gaussian Sobolev class W1, 2(μt)
on H, in which case DHμΦt
exists μt-a.s. and is
square integrable; see the Malliavin/abstract Wiener space literature on
differentiability of log-densities with respect to Gaussian reference
measures. For our purposes, we record this as an assumption: DHμΦt
exists and is such that u ↦ F(u, t; φ)
is globally Lipschitz and dissipative uniformly in t, as required in Theorem~1.
We consider the annealed dynamics in which t is piecewise constant. Let 0 = s0 < s1 < ⋯ < sT
be switching times and set t(s) = k for s ∈ [sk − 1, sk).
We may realize the noise via a cylindrical Wiener process Bs on H and write
By , each Ct(s)(φ)1/2
is Hilbert–Schmidt, so the stochastic integral in is well defined in
H. If F(⋅, t; φ) is
globally Lipschitz and satisfies a uniform dissipativity condition
then standard infinite-dimensional SDE/SPDE theory (e.g. via a
fixed-point argument for mild solutions on each interval [sk − 1, sk)
combined with Gr"onwall estimates and It^o isometry using ) yields
existence and uniqueness of a strong solution with continuous H-valued paths, together with
uniform moment bounds such as
where the constants depend on α, β, m, M, Tr(C⋆)
but not on any discretization dimension. The same statement applies to
the learned dynamics obtained by replacing F with Fθ, provided
Fθ(⋅, t)
obeys analogous Lipschitz/dissipativity bounds (which can be enforced by
architecture and/or spectral normalization and is compatible with our
regularization of φ).
Fix t and consider the
autonomous dynamics . The definition of F(⋅, t; φ) as a
μt-preconditioned
score drift implies that νt is formally
invariant: for smooth cylindrical test functions f, the generator
ℒtf(u) = ⟨F(u, t; φ), Df(u)⟩ + Tr (Ct(φ)D2f(u))
satisfies ∫ℒtf dνt = 0.
This identity is the infinite-dimensional analogue of the
finite-dimensional fact that Langevin with drift equal to the score of
the target has the target as invariant law; here the
integration-by-parts is performed with respect to the Gaussian reference
μt and
transferred to νt using Φt. Under the
dissipativity condition , one obtains a Feller Markov semigroup with at
most one invariant probability measure, hence νt is the unique
invariant law at scale t.
Finally, we verify that the denoising objective is finite under mild
growth conditions. Since ηt ∼ 𝒩(0, Ct(φ))
is H-valued with 𝔼∥ηt∥2 = Tr(Ct(φ)),
we have by
Moreover vt = Atu + ηt
is H-valued, and if Fθ(⋅, t)
has at most linear growth uniformly in t, namely ∥Fθ(u, t)∥ ≤ c0 + c1∥u∥,
then
𝔼∥Fθ(vt, t)∥2 ≤ 2c02 + 2c12 𝔼∥vt∥2 < ∞,
whenever 𝔼∥u∥2 < ∞ and holds.
Thus ℒ(θ, φ) is
finite for admissible φ and
such Fθ,
and the regularizer ℛ(φ) may
be viewed as ensuring the standing operator bounds needed for and . This
completes the well-posedness layer required before we turn to
quantitative sampling error bounds.
We now quantify the discrepancy between the distribution produced by the annealed Euler–Maruyama sampler driven by the learned drift Fθ and the terminal perturbed law νT (and, by a further triangle inequality, the target μ when νT ≈ μ). Our objective in this section is to exhibit an error decomposition whose constants depend on continuum quantities and on the uniform equivalence constants m, M, but not on any discretization resolution.
Fix t ∈ I and a
step size ht > 0. We
consider the (one-step) Euler kernels on H
where ξt ∼ 𝒩(0, Ct(φ))
is independent of u. For a
probability measure ρ on H we write ρKth
for its pushforward under the kernel. The full sampler is the
composition of M steps per
scale, hence the terminal law is
π̂ := ρ0 K̂1h M K̂2h M⋯K̂Th M, K̂th M := (K̂th)M,
with initialization ρ0 = μ1 = 𝒩(0, C1(φ))
for definiteness. We analogously define the ``ideal discrete-time’’
law
πh := ρ0 K1h M K2h M⋯KTh M.
The estimates below rest on a standard contractivity condition for
the fixed-scale dynamics. Concretely, we assume that for each t ∈ I the drift F(⋅, t; φ) is
globally Lipschitz on H with
constant L and is (strongly)
monotone with constant α > 0,
uniformly in t. In
preconditioned models the natural monotonicity is often formulated in
the Ct(φ)−1-geometry;
using mC⋆ ≼ Ct(φ) ≼ MC⋆,
one passes between the corresponding norms with constants depending only
on m, M. In
particular, any α stated in a
Ct−1-metric
yields an α′ in
with α′ depending
on m, M and the
baseline geometry, but not on discretization dimension.
We decompose the terminal error as
The first term is the ; the second term splits into (Euler bias) and
(finite number of scales / imperfect bridging of {νt}).
Let δ satisfy the uniform
mean-square error bound
To convert into a Wasserstein bound for the Markov kernels, we use a
synchronous coupling: evolve two chains from the same initial point and
with the same Gaussian increment ξt at each step.
For one Euler step at scale t,
this gives
U+ − Û+ = ht(F(U, t; φ) − Fθ(U, t)),
hence
𝔼∥U+ − Û+∥2 ≤ ht2 𝔼∥F(U, t; φ) − Fθ(U, t)∥2.
Under the standing moment bounds from Section~ and the fact that, for
contractive kernels, the chain remains close (in W2) to νt after a short
burn-in, one obtains a perturbation estimate of the form
with c1 depending
on m, M and the
uniform Lipschitz/dissipativity constants, but independent of any
discretization. Iterating across M steps and T scales, and using the
contractivity of Kth
implied by when ht ≤ 1/L,
yields
In the common regime ht ≡ h
and fixed total ``algorithmic time’’ ∑tMht = O(1),
the drift-induced error is O(δ1/2).
Even with the true drift, the Euler kernel Kth
has an invariant law νt, h
which differs from the continuum invariant νt. Under
standard regularity (global Lipschitz and monotonicity) and trace-class
noise, one has a strong error estimate for the Euler scheme in Hilbert
spaces which implies a Wasserstein bias bound
again with constants depending on the continuum parameters and on m, M through the noise
geometry, but not on any discretization resolution.
Moreover, because we run only M steps per scale, we incur a
finite-time mixing error to νt, h.
By contractivity of Kth
in W2 (a discrete
analogue of the continuum contraction), there exist γt > 0 and
Cmix < ∞ such
that for any initial ρ with
finite second moment,
with α from up to factors
controlled by m, M.
Combining – across scales yields an accumulated
discretization-and-mixing contribution bounded by
where Ξt
is a scale-dependent warm-start term (controlled by second moments,
hence dimension-free under ). In particular, taking Mht ≳ γt−1log (1/ε)
makes the second term O(ε) uniformly over
discretizations.
The remaining discrepancy is intrinsic to the annealing/bridging
design: even if we could sample from νt at each
scale, a finite number of scales and the chosen forward map u ↦ Atu
determine how closely νT approximates
μ (and how difficult the
transitions νt → νt + 1
are). We encode this as an additive term AnnealErr({νt}),
for instance
or any sharper telescoping surrogate reflecting the actual schedule and
the number of inner steps. This term depends on the modeling choice of
scales and corruption operators, but it is a continuum quantity and
therefore does not deteriorate with grid refinement.
Collecting , , , and the annealing contribution yields the following representative estimate, which is a refined version of Theorem~2 stated in the overview.
Since W2
metrizes weak convergence on sets with uniformly bounded second moments,
Theorem~ immediately implies a bounded-Lipschitz estimate: for ∥ ⋅ ∥BL the dual norm over
functions with ∥f∥∞ ≤ 1 and Lip(f) ≤ 1, we have
∥π̂ − νT∥BL ≤ W1(π̂, νT) ≤ W2(π̂, νT),
hence the same decomposition applies (with the same dimension-free
constants) to bounded test-function errors.
Finally, we emphasize why the bounds are genuinely resolution-independent. The contraction parameters and Euler error constants are obtained at the level of the continuum Hilbert space H, using only (i) trace-class noise bounds and (ii) drift coercivity/Lipschitz properties measured in norms whose equivalence is controlled by m, M. For a Galerkin projection Pn : H → Hn, the projected dynamics inherit the same constants uniformly in n, and W2-stability under weak convergence allows passing to the n → ∞ limit. Thus, once δ, {ht}, and the annealing design are fixed, the number of steps TM needed to reach a prescribed accuracy bound does not blow up with the grid.
In this section we isolate a regime in which the dependence of mixing on the covariance choice Ct(φ) can be characterized by a single operator condition number, and in which optimizing Ct(φ) within an admissible family yields a provable (and essentially unimprovable) reduction in the number of inner Langevin steps required at each scale.
Fix t ∈ I. We
assume that νt is absolutely
continuous with respect to μt = 𝒩(0, Ct(φ))
and admits a log-concave density of the form
where Ψt : H → ℝ
is differentiable along the Cameron–Martin space Hμt,
with derivative DHμtΨt(u) ∈ Hμt*.
With Φt = log (dνt/dμt),
we have Φt = −Ψt + const,
hence the true drift reads
F(u, t; φ) = −u − RDHμtΨt(u).
We further assume that Ψt is α-strongly convex and L-smooth , uniformly in t, in the following sense: for all
u, v ∈ H and
all t,
where ∥w∥Ct−1 := ∥Ct(φ)−1/2w∥
on the common image space (well-defined thanks to uniform equivalence).
Condition is the function-space analogue of Hessian bounds in
preconditioned coordinates; it implies monotonicity and Lipschitz
properties of F(⋅, t; φ) with
constants that depend on (α, L) and the equivalence
constants (m, M), but
not on discretization.
For a bounded, positive, self-adjoint operator C satisfying mC⋆ ≼ C ≼ MC⋆,
define the preconditioned smoothness and convexity moduli by
and the associated proxy
When holds with C = Ct(φ),
we have Lt(Ct(φ)) ≤ L
and αt(Ct(φ)) ≥ α,
hence κt(Ct(φ)) ≤ L/α.
More generally, if one compares two covariances C and C̃ in the admissible family, the
uniform inequalities mC⋆ ≼ C, C̃ ≼ MC⋆
imply
∥w∥C−12 ≃m, M ∥w∥C̃−12,
so κt(⋅)
varies in a controlled manner over the family, with constants depending
only on m, M.
Consider the fixed-scale Euler kernel Kth from the previous section with true drift F(⋅, t; φ). Under one obtains a discrete-time contraction estimate in W2 whose rate is governed by αt(Ct(φ)), with admissible step sizes determined by Lt(Ct(φ)). A representative statement is as follows.
The salient point is that the dependence on the covariance enters through κt(Ct(φ)), and the bounds are dimension-free because αt(C) and Lt(C) are continuum quantities expressed in operator norms on H (with equivalence controlled by m, M).
Let 𝒞 be an admissible family of
covariances satisfying mC⋆ ≼ C ≼ MC⋆,
for instance the congruence family C = C⋆1/2SC⋆1/2
with S bounded, positive,
self-adjoint and σ(S) ⊂ [m, M].
For each scale t, define the
best achievable proxy in the family by
κt⋆ := infC ∈ 𝒞κt(C).
Since Proposition~ shows that Mt need only
scale linearly with κt(C),
any choice Ct(φ)
that reduces κt(Ct(φ))
yields a commensurate reduction in the required number of inner steps.
In particular, comparing to the baseline C⋆, we obtain an
improvement factor bounded by κt(C⋆)/κt(Ct(φ)).
From the modeling perspective, the role of the covariance-learning
component φ is precisely to
search, within 𝒞, for covariances that
better match the local anisotropy of Ψt as seen
through –; this is the function-space analogue of preconditioning by an
approximate inverse Hessian.
We now show that the κ-dependence above cannot, in general, be improved when the covariance is restricted to a prescribed family 𝒞. The sharpest statement is obtained on quadratic instances, for which the proxy κt(C) is exact.
Fix t and consider the
Gaussian target νt = 𝒩(0, Q)
with Q trace-class and
strictly positive. Suppose Q
commutes with all C ∈ 𝒞 (e.g.,
all are diagonal in a common Fourier/KL basis). Then νt ≪ μt = 𝒩(0, C)
with $\Psi_t(u)=\tfrac12\langle B
u,u\rangle$ for a bounded positive operator B diagonal in the same basis, and
the preconditioned Langevin dynamics decouple mode-wise. Writing {λi(C)}
and {λi(Q)}
for eigenvalues in the shared basis, one verifies that the relevant
preconditioned spectrum is {λi(C)/λi(Q)},
hence
$$
\kappa_t(C)
\;=\;
\frac{\max_i \lambda_i(C)/\lambda_i(Q)}{\min_i
\lambda_i(C)/\lambda_i(Q)}.
$$
For the Euler scheme with step size h ≤ 1/Lt(C),
the i-th mode evolves as a
one-dimensional AR(1) recursion whose contraction factor is 1 − hai
with ai ∈ [αt(C), Lt(C)].
Consequently, no choice of h
can make all modes contract faster than the slowest one without
violating stability on the fastest one, and the number of steps required
to reduce the W2
distance by a factor ε must
scale at least linearly in Lt(C)/αt(C).
Theorems~ and together justify the interpretation of covariance learning as an essentially optimal acceleration mechanism within a constrained operator class: improvements in mixing are possible precisely to the extent that one can reduce κt(C) over C ∈ 𝒞, and for commuting Gaussian instances this reduction is both necessary and sufficient up to universal constants.
We present three experimental suites designed to isolate (i) whether
learning the covariance schedule Ct(φ)
produces measurable acceleration at fixed accuracy, (ii) whether this
acceleration is stable across discretizations (continuum-first
invariance), and (iii) whether the method remains effective in
conditional settings where per-sample throughput is the primary
constraint. In all experiments we train by the objective
ℒ(θ, φ) = 𝔼t 𝔼u, ηt ∥ηt + Fθ(Atu + ηt, t)∥2 + λ ℛ(φ),
with a schedule t ∈ [T] and inner sampling
steps M per scale in the
annealed Euler–Maruyama sampler. Unless stated otherwise, At = Id in
unconditional tasks and At is chosen as
a fixed low-pass smoother in tasks where μ(Hμt) < 1
is expected at fine scales. We parameterize Ct(φ)
in a Fourier/KL basis via a congruence Ct(φ) = C⋆1/2St(φ)C⋆1/2
with St(φ)
diagonal and σ(St) ⊂ [m, M]
enforced by projection after each optimizer step; this ensures a shared
Cameron–Martin space and eliminates discretization-dependent
admissibility failures.
We report (1) : the smallest TM achieving a prescribed test metric threshold, (2) : end-to-end runtime including GRF sampling and operator evaluation at fixed hardware and batch size, and (3) : performance when training and sampling are carried out on different discretizations. For unconditional generative tasks we use sliced W2 (finite-dimensional projections), MMD with a Sobolev kernel, and domain-relevant summary statistics (spectral energy, structure functions). For conditional inverse problems we also report posterior predictive error, coverage of credible intervals, and effective sample size per second for downstream quantities of interest. Baselines include (i) fixed Ct ≡ C⋆ with identical Fθ architecture, (ii) hand-designed schedules (e.g. geometric scaling of C⋆), and (iii) classical function-space samplers (pCN and preconditioned Crank–Nicolson variants) where applicable.
Our first suite uses synthetic measures μ on H = L2(𝕋d) constructed as mixtures of Gaussian random fields with known and intentionally mismatched spectral structure. Concretely, we sample a latent label z ∈ {1, …, K} and then draw u ∼ 𝒩(0, Qz), where each Qz is diagonal in the Fourier basis with eigenvalues λk(Qz) concentrated on disjoint bands and with prescribed anisotropy ratios across modes. This setting is designed so that (i) the optimal preconditioner in the commuting Gaussian case is explicit, and (ii) the κ-proxy varies over orders of magnitude depending on whether C matches the dominant bands of {Qz}.
We train PL-DDO with a Fourier neural operator for Fθ(⋅, t)
and compare learned Ct(φ)
against C⋆ chosen
as a Mat'ern-type prior with smooth spectrum. We observe that learned
schedules shift mass toward the active frequency bands and, as t increases, progressively sharpen
the spectrum while remaining within the admissible bounds [m, M]. The principal
quantitative outcome is that the inner step count M required to reach a fixed
sliced-W2 threshold
decreases proportionally to the reduction in the empirically estimated
condition proxy
$$
\widehat\kappa_t(C)
=
\frac{\widehat L_t(C)}{\widehat \alpha_t(C)},
\qquad
\widehat L_t(C),\widehat\alpha_t(C)
\ \text{estimated from finite differences along sampled directions.}
$$
In particular, at fixed T we
find that replacing C⋆ by the learned Ct(φ)
yields a consistent reduction in M across mixture weights and across
different anisotropy ratios, while the fixed-covariance baseline
requires small step sizes ht (stability
dictated by the fastest mode) and therefore more inner steps.
To test discretization robustness we train on a coarse grid, but sample and evaluate on finer grids by drawing ηt ∼ 𝒩(0, Ct(φ)) directly at the evaluation resolution. Because Ct(φ) is defined through a basis-consistent spectral rule and Fθ is discretization-consistent, the step reduction persists with only minor degradation in summary statistics. We also repeat evaluation under random regridding (nonuniform sample points followed by interpolation to the operator input format) and find that the learned covariance remains effective; in contrast, hand-tuned schedules exhibit nontrivial sensitivity to the discretization change, typically manifesting as either overly rough samples (excess high-frequency power) or slow mixing in the low-frequency modes.
Our second suite targets non-Gaussian measures obtained as pushforwards of Gaussian initial conditions through a PDE flow. Let u0 ∼ 𝒩(0, Q) be a divergence-free random field and define u = 𝒢(u0), where 𝒢 maps initial vorticity to a time-τ state of 2D Navier–Stokes on 𝕋2 at moderate Reynolds number. The resulting μ exhibits intermittency, nontrivial phase correlations, and heavy-tailed pointwise statistics, hence constitutes a stringent test of whether covariance learning (a purely second-order geometric mechanism) can nevertheless accelerate sampling when combined with a learned drift.
We use two evaluation modes. In the mode we compare sample quality as a function of TM at a fixed resolution, reporting energy spectra E(k), vorticity kurtosis, and MMD. In the mode we train Fθ on low-resolution observations of u but sample at a higher resolution; this probes whether the learned objects remain meaningful in the continuum limit. We employ a low-pass smoother At at early noise scales to ensure vt lies in Hμt while preserving the large-scale structures that govern long-range dependence.
Empirically, learning Ct(φ) yields a clear reduction in the number of inner steps required to match the spectral statistics: relative to C⋆, the learned schedule reallocates variance across modes in a manner consistent with the observed energy cascade, which in turn permits larger stable step sizes ht without destabilizing the high-frequency components. This translates to wall-clock improvements because each inner step includes both an operator evaluation and a GRF draw; decreasing M reduces both costs simultaneously. We emphasize that the measured acceleration is not an artifact of reduced accuracy: for each method we select TM to meet a fixed tolerance on a jointly chosen metric set (spectrum error, MMD, and a low-order moment vector), and then compare runtimes at that tolerance.
Resolution invariance is assessed by training on one discretization (with Fourier truncation n) and sampling on multiple larger truncations n′. We observe that the learned Ct(φ) continues to provide the same qualitative spectral shaping across n′, and that the required M to achieve a fixed tolerance does not grow with n′ within the tested range. In contrast, when Ct is hand-scaled without learning, we see either (i) a need to retune the schedule as n′ increases or (ii) degradation in high-frequency statistics. These behaviors are consistent with the theoretical emphasis on continuum-defined operator inequalities and shared Cameron–Martin geometry.
Our third suite addresses conditional sampling for elliptic inverse problems, with the goal of demonstrating many-sample throughput when repeatedly conditioning on new data. We consider a log-permeability field u ∈ H and observations y = 𝒪(𝒮(u)) + ξ, where 𝒮 solves the Darcy PDE and 𝒪 extracts sparse pressure measurements. The target is the posterior μ(⋅ ∣ y), which is non-Gaussian and anisotropic, and whose geometry varies with y. We train a conditional drift Fθ(⋅, t; y) by augmenting the neural operator input with a representation of y and the observation locations, while still learning a covariance schedule Ct(φ) shared across conditions; this isolates the effect of learning a global preconditioner that improves mixing uniformly over y.
We compare against (i) pCN initialized from the prior, (ii) pCN with hand-tuned step size, and (iii) PL-DDO with fixed C⋆. We report posterior predictive RMSE on held-out sensor locations, coverage of pointwise credible intervals, and effective sample size per second for low-dimensional quantities of interest (e.g. average flux through a boundary). The primary outcome is that learning Ct(φ) reduces the number of inner Langevin steps needed to reach stable posterior summaries, which directly increases throughput when many posterior samples are required per observation instance. In addition, the learned Ct(φ) decreases sensitivity to discretization: training on a moderate mesh and sampling on refined meshes yields comparable posterior summaries without retuning, provided that the PDE solver and the GRF sampler are consistent with the continuum operators.
Finally, we perform an ablation in which we freeze φ after pretraining on unconditional
data and then finetune only θ
conditionally. This separates geometry learning'' fromscore
fitting’’ and shows that a substantial fraction of the speedup persists,
indicating that the covariance schedule captures reusable anisotropy
common to the posterior family. Across all three suites, the aggregate
evidence supports the central experimental claims: (i) covariance
learning yields consistent step and wall-clock reductions at fixed
accuracy, (ii) these gains are maintained across discretizations, and
(iii) the method remains effective in conditional regimes where sampling
throughput dominates training cost.
Our central design choice is to a covariance schedule t ↦ Ct(φ) while keeping all definitions in the continuum Hilbert space H. This choice yields tangible acceleration in the regimes where preconditioning is known to control contraction, but it also introduces failure modes that do not arise when Ct is fixed . We summarize the main pitfalls and the mechanisms we use to avoid them, then relate the method to function-space MCMC and adaptive preconditioning, and finally outline limitations and future directions.
In infinite dimensions,
covariance learning'' is not an innocuous parametric extension: changing \(C_t\) changes the reference Gaussian measure \(\mu_t=\mathcal N(0,C_t)\), and in general two Gaussian measures on \(H\) can be mutually singular. Since our objective is formulated through the score term \(D_{H_{\mu_t}}\Phi_t\) with \(\Phi_t=\log(d\nu_t/d\mu_t)\), we require \(\nu_t\ll\mu_t\) and a common Cameron--Martin geometry across \(t\). The uniform equivalence constraint \[ m C_\star \preceq C_t(\varphi) \preceq M C_\star \] is precisely the condition that preventsadmissibility
collapse’’: it ensures Im (Ct1/2) = Im (C⋆1/2)
and rules out degenerate limits in which Ct loses rank in
directions that matter to μ.
Without such a constraint, stochastic gradient descent can drive some
eigenvalues of Ct to 0 (to reduce injected noise and artificially
simplify the denoising task) or to +∞
(to overwhelm the data term), and either behavior can render μt incompatible
with νt in
the Feldman–H'ajek sense. In practice we therefore treat admissibility
as a hard constraint (projection/clipping of spectra in a fixed basis)
augmented by a soft regularizer ℛ(φ) that discourages repeated
saturation at the bounds m and
M. The projection is not
merely numerical hygiene; it enforces the measure-theoretic
prerequisites under which the continuum objective is well-defined.
Even under admissibility, learning Ct and Fθ jointly can admit spurious optima. The denoising target ηt is drawn from 𝒩(0, Ct(φ)), so the distribution of training pairs (vt, ηt) changes with φ. This creates a feedback loop: the optimizer may alter Ct to generate ``easier’’ corruptions for the current Fθ, rather than covariances that are optimal for sampling. Two mitigations are important. First, we constrain the parameterization of Ct(φ) to be sufficiently low-dimensional and basis-consistent (e.g. diagonal or structured in a Fourier/KL basis), which limits the ability of Ct to fit idiosyncrasies of a finite dataset. Second, we interpret the learned schedule as a rather than a direct model of data covariance; accordingly, we evaluate it by downstream sampling performance at fixed accuracy rather than by training loss alone. This distinction matters because a lower denoising loss does not necessarily imply improved Langevin contraction if, for instance, Ct concentrates variance in directions where Fθ is inaccurate.
A more subtle pitfall is overfitting φ to the training discretization. If Ct(φ) is parameterized directly on a grid (e.g. as a dense matrix or pixel-wise kernel), then changing the resolution changes the parameterization itself, and the learned covariance can encode grid artifacts (aliasing, boundary effects, or anisotropies tied to coordinate axes). Such overfitting may even reduce the training objective while deteriorating continuum performance. Our continuum-first formulation is intended to preclude this failure mode: we choose Ct(φ) through operator rules that commute with refinement (Fourier multipliers, KL truncations consistent with a fixed basis, or SPDE-based covariances). Nevertheless, discretization overfitting can reappear if the chosen basis is not aligned with the domain geometry or if the implementation uses inconsistent quadrature/interpolation across grids. A practical diagnostic is to re-evaluate the learned schedule under multiple discretizations and regridding operators; a principled direction is to explicitly enforce discretization-consistency by constraining Ct(φ) to lie in a family with a well-defined continuum limit (e.g. pseudodifferential operators of fixed order, or Mat'ern-type SPDE covariances with learned coefficients).
From the viewpoint of function-space MCMC, our sampler is a learned, annealed analogue of preconditioned Langevin dynamics. When F is replaced by the exact gradient of a log-density and Ct ≡ C⋆, one recovers the standard preconditioned Langevin SPDE studied in the Da~Prato–Zabczyk setting. The insistence on a fixed Cameron–Martin space mirrors the design of pCN and related dimension-robust methods: pCN achieves mesh-invariant behavior precisely because its proposal preserves a reference Gaussian measure and avoids discretization-dependent drift terms. Our approach departs from classical pCN in two ways. First, we use a learned drift Fθ to approximate a score, thereby enabling non-Gaussian targets without explicit likelihood evaluations. Second, we allow the reference covariance to vary with t, but only within a uniformly equivalent family to preserve the function-space geometry needed for well-posedness and stability.
There is also a clear connection to adaptive MCMC and quasi-Newton preconditioning, where one tunes a covariance online based on empirical chain statistics. In finite dimensions, covariance adaptation can improve mixing but must be controlled to preserve ergodicity (e.g. diminishing adaptation). In function spaces, naive covariance adaptation is particularly delicate because empirical covariances are typically finite-rank and therefore singular with respect to the ambient Gaussian reference. Our method can be viewed as an analogue: we learn Ct(φ) from data (and optionally from conditional contexts y) under explicit operator inequalities that preserve equivalence to C⋆. This replaces online adaptation with a constrained optimization problem whose output can be reused across many sampling runs, a key advantage in conditional regimes where per-instance throughput dominates. A hybrid direction is to allow mild online adjustment within the admissible family (e.g. updating a small number of spectral parameters) while enforcing a diminishing adaptation schedule and maintaining mC⋆ ≼ Ct ≼ MC⋆ at all times.
A fundamental limitation is that our reference measures μt are Gaussian. While learning Ct can reshape second-order geometry, it cannot represent genuinely non-Gaussian reference structure (e.g. skewness, heavy tails, or nonlinear constraints). In highly non-log-concave or strongly multimodal settings, covariance learning may yield limited gains because the bottleneck is not a condition number but rather energy barriers and metastability. One possible extension is to replace μt by a non-Gaussian reference obtained via a learned transport map Tt applied to a Gaussian base, leading to μt = (Tt)#𝒩(0, C⋆). In infinite dimensions, however, constructing Tt with sufficient regularity and ensuring absolute continuity properties (and tractable computation of the analogue of DHμtΦt) is nontrivial. Any such extension would require a careful replacement of Cameron–Martin calculus by an appropriate differential structure on the transported measure.
Many targets of interest live on subsets of H or satisfy hard constraints: divergence-free vector fields, positivity constraints, unit-norm constraints, or conservation laws. A Gaussian reference on H is mismatched to such geometry, and unconstrained Langevin updates can violate constraints unless corrected. One direction is to incorporate constraints through projections (projected Langevin) or by formulating dynamics on a manifold (Riemannian Langevin in function spaces), where the preconditioner becomes a position-dependent metric. Learning a covariance Ct(u) would considerably increase expressivity but also raises substantial well-posedness questions for the resulting multiplicative-noise SPDE and for discretization-invariant implementation.
Beyond the extensions above, several theoretical and practical questions remain. On the theory side, it is natural to seek guarantees under weaker conditions than global Lipschitz/dissipativity, as many score models are only locally regular and many data measures induce drifts with polynomial growth. It is also of interest to quantify how covariance learning interacts with drift approximation error δ: while preconditioning can improve contraction, it may also amplify directions where Fθ is inaccurate, suggesting that joint learning should balance mixing proxies with approximation stability. On the practical side, one may enforce additional structure on t ↦ Ct(φ) (e.g. monotonicity of certain spectral quantities) to stabilize training, and one may co-design At with Ct when μ(Hμt) < 1 is expected. Finally, connecting covariance learning to bridge-based objectives (e.g. Schr"odinger bridges) may yield alternative training criteria in which the role of Ct is explicit as an entropic regularization parameter, potentially clarifying when and why learned preconditioning provides the most benefit.
In summary, covariance learning is powerful precisely because it alters the geometry of the sampling problem; for the same reason it must be constrained at the level of operators and measures. The uniform-equivalence framework provides a principled safety rail: it prevents singularity, supports discretization-invariant implementation, and places learned preconditioning on the same function-space footing as classical dimension-robust MCMC, while still leaving room for meaningful acceleration within an admissible family.