← Back

Preconditioner-Learned Denoising Diffusion Operators: Dimension-Free Acceleration via Learnable Trace-Class Covariances

Metadata

Table of Contents

  1. 1. Introduction: motivation (throughput beyond resolution), why covariance/preconditioning is the next bottleneck, contributions and summary of guarantees.
  2. 2. Background: DDO framework in Hilbert space; Gaussian reference measures, Cameron–Martin derivatives, denoising score matching objective, function-space Langevin and annealed Langevin.
  3. 3. Problem setup: joint learning of score drift and covariance schedule; admissible covariance families (uniform equivalence, trace-class, shared Cameron–Martin space); forward corruption with optional smoothing operators.
  4. 4. PL-DDO objective and parameterization: structured Ct(φ) (Fourier-diagonal + low-rank, SPDE-based, kernel hyperparameters); regularizers enforcing operator inequalities; practical GRF sampling (KL/FFT) and gradients through noise.
  5. 5. Sampling algorithm: annealed preconditioned Langevin with learned Ct; Euler–Maruyama vs Crank–Nicolson discretization; stability heuristics and step-size scheduling.
  6. 6. Theory I (well-posedness): existence/uniqueness of the continuum dynamics with time-varying preconditioner; conditions ensuring invariant/target measures are well-defined; finiteness of the training objective.
  7. 7. Theory II (dimension-free sampling error): W2 (or bounded-Lipschitz) bounds decomposing error into score approximation, discretization, and annealing; explicit dependence on m, M but not on grid resolution.
  8. 8. Theory III (acceleration and tightness): log-concave special case; mixing-time/spectral-gap proxy in terms of an operator condition number; upper bounds improved by learned Ct; matching lower bound showing near-optimality within the covariance family.
  9. 9. Experiments: (a) Gaussian-mixture GRFs (control ground-truth spectral structure), (b) Navier–Stokes pushforward measures (super-resolution + mixing), (c) conditional Darcy posteriors (many-sample throughput). Report step reduction, wall-clock, resolution invariance, and robustness to discretization changes.
  10. 10. Discussion: covariance learning pitfalls (singularity, overfitting to discretization), connections to function-space MCMC and adaptive preconditioning, limitations and future directions (non-Gaussian references, manifold constraints).

Content

1. Introduction: motivation (throughput beyond resolution), why covariance/preconditioning is the next bottleneck, contributions and summary of guarantees.

Many contemporary generative models for scientific fields are trained and deployed across a wide range of spatial resolutions. In such settings, it is natural to seek a description: the target law is a Borel probability measure on an infinite-dimensional state space, training data are evaluations of random fields on arbitrary grids, and the learned model is an operator that can be queried consistently on new discretizations. Neural operator architectures have largely resolved the representational side of this objective: one can parameterize drifts, scores, or transport maps in a manner that is stable under refinement and supports evaluation on meshes unseen during training. The remaining obstacle is . Even when evaluation of the learned operator is nearly grid-independent in complexity (up to FFT or spectral costs), the end-to-end cost of sampling is dominated by the number of iterative updates required to reach a prescribed accuracy in distribution. In other words, once expressivity and resolution invariance are in place, the practical bottleneck becomes the of the sampling dynamics rather than the cost of a single model evaluation.

We focus on annealed Langevin-type samplers in Hilbert spaces, where the generative procedure is implemented by a sequence of stochastic steps driven by a learned drift (or score) at a range of noise levels. For such samplers, it is well understood in finite dimensions that performance is governed by geometry: the drift may be accurate, yet progress is throttled along poorly conditioned directions. This phenomenon is not an artifact of discretization. In function space, anisotropy is typically more severe: high- and low-frequency components have vastly different natural scales, and naive isotropic noise schedules can force the algorithm to take step sizes constrained by the stiffest modes. Consequently, the number of iterations required to achieve a fixed accuracy can become large, even if each iteration is resolution-invariant. From a continuum standpoint, this is precisely a problem, and it suggests that learning only the drift is insufficient: to obtain genuine speedups, one must also adapt the covariance structure of the injected noise (equivalently, the reference Gaussian measure) that defines the geometry of the score and of the Langevin dynamics.

Our central premise is that the covariance schedule is the next essential degree of freedom once operator learning has provided resolution-invariant drift evaluation. We therefore propose to learn (i) a discretization-consistent drift operator and (ii) a family of trace-class covariance operators that act as learned preconditioners across annealing scales. The covariances are not treated as incidental hyperparameters but as trainable objects constrained to remain in an admissible set ensuring well-posedness in the infinite-dimensional setting. This changes the geometry in which the score is computed and in which the Langevin dynamics evolve, while preserving the underlying continuum formulation and enabling efficient sampling of the driving Gaussian random fields through structured representations (e.g. Fourier-diagonal or low-rank parameterizations).

Three considerations make this joint learning nontrivial in Hilbert space. First, Gaussian measures in infinite dimensions are mutually singular unless their covariances satisfy stringent equivalence conditions; the score is naturally defined along a Cameron–Martin space that depends on the reference measure. Second, the injected noise must be trace-class to define a bona fide stochastic evolution on the ambient space. Third, to retain discretization-independent guarantees, all bounds must be stable under Galerkin projection and must not degrade with dimension. These constraints motivate parameterizations that enforce uniform operator inequalities relative to a fixed baseline covariance. Under such constraints, the reference measures share a common Cameron–Martin space, the denoising objective is well-defined, and the resulting Langevin dynamics admit a continuum interpretation whose discretizations inherit dimension-free stability properties.

The contributions of this work are organized around this thesis.

We formulate a denoising score matching objective in a real separable Hilbert space in which the corruption noise is Gaussian with a trace-class covariance at each annealing scale. The learned drift targets a preconditioned score drift associated with the Radon–Nikodym derivative of the perturbed law with respect to the reference Gaussian measure. This formulation cleanly separates (a) the data distribution and its perturbations from (b) the choice of reference geometry, making explicit how covariance learning influences both the training signal and the sampling dynamics. Importantly, the objective is defined in function space; discretizations are only numerical surrogates used to approximate expectations and operator evaluations.

We introduce structured covariance parameterizations and regularizers that preserve trace-classness, positivity, and uniform equivalence to a baseline covariance. These admissibility conditions ensure that the family of Gaussian references remains mutually absolutely continuous and that the relevant Cameron–Martin derivatives are taken along a fixed subspace. On the sampling side, the same conditions support existence and uniqueness of solutions to the corresponding Langevin-type stochastic evolution (viewed as an SPDE or as an annealed sequence of such dynamics). This yields a principled foundation for learning preconditioners in infinite dimensions without silently changing the underlying measure-theoretic setting from scale to scale.

We provide an accuracy bound for the distribution produced by an annealed Euler–Maruyama sampler driven by the learned drift and the learned covariance schedule. The bound decomposes into three terms: a drift approximation component controlled by the mean-squared error of the learned drift under the intermediate perturbed laws, a discretization component controlled by the step sizes, and an annealing component capturing the mismatch induced by the finite noise schedule. Crucially, the constants depend on continuum quantities (such as Lipschitz/dissipativity parameters and uniform equivalence bounds for the covariances) and not on the discretization dimension. This establishes that, in the contractive regime, one can target a fixed Wasserstein accuracy with a number of sampling steps that does not grow with resolution, provided the drift approximation error and integrator stability are controlled uniformly.

In strongly log-concave regimes, the convergence rate of preconditioned Langevin dynamics is governed by a condition-number-like quantity in the metric induced by the covariance. We show that optimizing the covariance within an admissible family provably improves an upper bound on mixing time by reducing this proxy. Moreover, the improvement is not merely an artifact of the analysis: within the same covariance family and under first-order access, one cannot in general do better than the corresponding condition-number dependence. Thus, covariance learning is not only beneficial but, within structured families, essentially the correct and information-theoretically necessary lever for accelerating sampling. This perspective aligns with finite-dimensional preconditioning intuition while operating in a function-space framework in which the relevant geometry is encoded by operators rather than matrices.

From an algorithmic standpoint, the proposed method retains the operational simplicity of denoising-based training while adding a learnable preconditioning schedule that can be implemented efficiently in spectral bases. The resulting sampler trades expensive per-iteration costs for fewer iterations by improving contraction along difficult directions. Since the covariances are learned in the same continuum-first setting as the drift, the procedure is compatible with variable discretizations at both training and sampling time. In particular, one can exploit fast Gaussian random field sampling routines (FFT/KL) while maintaining the invariance properties expected of neural operator models.

The remainder of the paper develops these ideas systematically. We first recall the denoising drift–diffusion operator (DDO) framework in Hilbert space, emphasizing Gaussian reference measures, Cameron–Martin derivatives, and the function-space Langevin dynamics that underpin annealed sampling. We then introduce admissible covariance parameterizations, establish well-posedness and measure equivalence, and derive dimension-free error bounds and covariance-induced acceleration guarantees.


2. Background: DDO framework in Hilbert space; Gaussian reference measures, Cameron–Martin derivatives, denoising score matching objective, function-space Langevin and annealed Langevin.

We briefly recall the drift–diffusion operator (DDO) viewpoint for denoising-based learning in a real separable Hilbert space H, emphasizing the role of Gaussian reference measures, Cameron–Martin derivatives, and the resulting function-space Langevin dynamics. Throughout, we treat all objects at the continuum level; discretizations enter only as numerical approximations of the operators and expectations defined below.

Let C be a trace-class, positive, self-adjoint operator on H, and let μC := 𝒩(0, C) denote the centered Gaussian Borel probability measure on H. The associated Cameron–Martin space is
HμC := Im (C1/2) ⊆ H,   ⟨h1, h2HμC := ⟨C−1/2h1, C−1/2h2H,
where C−1/2 is understood on Im (C1/2). We use the standard Cameron–Martin theorem: for any h ∈ HμC, the shifted measure μCh := 𝒩(h, C) is equivalent to μC and

In contrast, if h ∉ HμC, then μCh ⟂ μC. This dichotomy is the basic measure-theoretic reason that, in infinite dimensions, the ``score’’ must be understood relative to a reference Gaussian and along HμC rather than along the full space H.

Given a measurable function Φ : H → ℝ and a point u ∈ H, we say that Φ is (Fr'echet) differentiable along HμC at u if there exists a continuous linear functional DHμCΦ(u) ∈ HμC* such that
Φ(u + h) = Φ(u) + DHμCΦ(u)[h] + o(∥hHμC),   h ∈ HμC.
We identify HμC* with HμC via the Riesz isomorphism on (HμC, ⟨⋅, ⋅⟩HμC): for  ∈ HμC*, we write R ∈ HμC for the unique element satisfying [h] = ⟨R, hHμC. Concretely, when [h] = ⟨g, hHμC, then R = g, and note the identity g, hHμC = ⟨C−1g, hH.

Fix a scale t and abbreviate Ct := Ct(φ), μt := 𝒩(0, Ct). We consider perturbations of the form
vt = Atu + ηt,   u ∼ μ,  ηt ∼ 𝒩(0, Ct)  independent,
where either At = Id and μ(Hμt) = 1, or At is a bounded linear smoother with At(H) ⊆ Hμt. Let νt denote the law of vt. Conditional on u, the random variable vt has law 𝒩(Atu, Ct), i.e. a Cameron–Martin shift of μt by Atu. Under the standing assumption Atu ∈ Hμt almost surely (either directly or by smoothing), yields the conditional Radon–Nikodym derivative
$$ \frac{d\mathcal N(A_t u, C_t)}{d\mu_t}(v) = \exp\Big( \langle C_t^{-1}A_t u, v\rangle_H - \tfrac12 \|A_t u\|_{H_{\mu_t}}^2 \Big), \qquad \mu_t\text{-a.s.} $$
Averaging over u shows νt ≪ μt with

We write $\Phi_t(v) := \log\big(\frac{d\nu_t}{d\mu_t}(v)\big)$, defined νt-a.s. (and μt-a.s. on the event where the right-hand side is finite and nonzero).

The DDO identity is the observation that the ``preconditioned score’’ RDHμtΦt coincides with a conditional mean of the latent clean component. Formally differentiating in a Cameron–Martin direction h ∈ Hμt gives
$$ D_{H_{\mu_t}}\Phi_t(v)[h] = \frac{\mathbb E\big[ \exp(\cdots)\,\langle C_t^{-1}A_t u, h\rangle_H\big]}{\mathbb E\big[\exp(\cdots)\big]} = \Big\langle C_t^{-1}\,\mathbb E\big[A_t u \mid v_t = v\big],\, h\Big\rangle_H = \Big\langle \mathbb E\big[A_t u \mid v_t = v\big],\, h\Big\rangle_{H_{\mu_t}}, $$
where exp (⋯) denotes the exponential weight in and the conditional expectation is taken with respect to the posterior induced by the joint law of (u, vt). By the Riesz identification in Hμt, we obtain the pointwise relation

Consequently, the preconditioned drift field

admits the equivalent denoising form

since vt = Atu + ηt. This is precisely the identity that makes denoising objectives compatible with learning the drift driving Langevin-type sampling.

Consider the mean-squared denoising functional at scale t,
𝒥t(G) := 𝔼u ∼ μηt ∼ 𝒩(0, Ct)ηt + G(Atu + ηt)∥H2,
defined for measurable G : H → H with 𝔼∥G(vt)∥H2 < ∞. By the Hilbert-space orthogonality principle (conditional expectation as L2-projection), the minimizer satisfies
G*(v) = − 𝔼[ηt ∣ vt = v],   νt-a.s.
Comparing with shows G*(⋅) = F(⋅, t; φ). Thus, training a parametrized operator Fθ(⋅, t) by minimizing 𝔼∥ηt + Fθ(vt, t)∥H2 targets the drift without requiring explicit access to Φt or its derivative. The role of the Gaussian reference is implicit but essential: it determines the Cameron–Martin geometry in which Φt is differentiable and in which the identity holds.

The drift is designed so that νt is invariant for a preconditioned Langevin evolution on H. Let WCt be a Ct-Wiener process on H (well-defined since Ct is trace-class). The corresponding Langevin SPDE reads

where t is fixed. Under standard dissipativity and Lipschitz hypotheses on F(⋅, t; φ), is well posed and admits νt as an invariant law; heuristically, the term Us represents the Ornstein–Uhlenbeck pullback associated with μt, while RDHμtΦt corrects the dynamics to target νt rather than μt.

Annealed Langevin sampling uses a finite sequence of scales t = 1, …, T and runs for a short time at each scale, with νt serving as an intermediate bridge between an initial Gaussian-like law and the target-like law at the final scale. In computation one typically replaces by an Euler–Maruyama discretization: for a step size ht > 0 and iterates {Un} ⊂ H,
$$ U_{n+1} = U_n + h_t F_\theta(U_n,t) + \sqrt{2h_t}\,\xi_n, \qquad \xi_n \sim \mathcal N(0,C_t), $$
where the learned field Fθ(⋅, t) approximates F(⋅, t; φ) and the Gaussian increments ξn implement the Ct-covariance noise. The continuum interpretation is that discretizations of inherit their stability and convergence properties from the Hilbert-space formulation, provided the covariance is trace-class and the drift is controlled in the ambient norm.


3. Problem setup: joint learning of score drift and covariance schedule; admissible covariance families (uniform equivalence, trace-class, shared Cameron–Martin space); forward corruption with optional smoothing operators.

We consider a continuum-first generative learning problem on a real separable Hilbert space H. Our data are i.i.d. samples {ui}i = 1N ⊂ H drawn from an unknown target law μ on (H, ℬ(H)). The goal is to construct, from these samples, a discretization-consistent sampler whose output law μ̂ approximates μ in a transport metric (typically W2), while keeping the number of drift evaluations and Gaussian random field (GRF) draws bounded independently of the numerical resolution used in computation.

Our approach is based on a multi-scale perturbation scheme indexed by a finite set I = [T] = {1, …, T}. At each scale t ∈ I we introduce a centered Gaussian reference measure
μt = 𝒩(0, Ct(φ)),
where Ct(φ) is a learnable covariance operator depending on parameters φ. Unlike fixed-noise denoising formulations, we treat {Ct(φ)}t ∈ I as part of the model to be learned jointly with the drift (or score) operator, with the intent of shaping the geometry of the intermediate distributions {νt} in a way that improves mixing and reduces the number of sampling steps.

Since all objects are defined on H, we must enforce operator-theoretic constraints ensuring that the associated Gaussian measures are well defined and comparable across scales. We fix once and for all a baseline trace-class, self-adjoint, positive operator C on H. We call a family {Ct(φ)}t ∈ I if, for all t ∈ I,

The operator inequalities constitute a uniform equivalence condition: they prevent degeneracy of the covariance at any scale and, crucially, imply that all Gaussian references {μt} share the same Cameron–Martin space. Indeed, by standard results for equivalent Gaussian measures, ensures
Hμt = Im (Ct(φ)1/2) = Im (C1/2)   for all t ∈ I,
with equivalence of the corresponding Cameron–Martin norms up to constants depending only on m and M. This shared Cameron–Martin structure is a primary technical requirement: it allows us to interpret Cameron–Martin derivatives, preconditioned scores, and stability estimates uniformly in t, and it prevents the learning problem from inadvertently moving between mutually singular reference geometries across scales.

For each t ∈ I we define a corruption map producing a noisy observation vt ∈ H from a clean sample u ∼ μ by

where ηt is independent of u, and At : H → H is either the identity or a bounded linear smoothing operator. We impose one of the following conditions.

The role of (A)–(B) is measure-theoretic rather than aesthetic. Conditional on u, the law of vt in is 𝒩(Atu, Ct(φ)). In infinite dimensions, Gaussian shifts by vectors outside the Cameron–Martin space produce measures that are singular with respect to μt. Thus, (A) or (B) is needed to ensure Atu ∈ Hμt almost surely, so that 𝒩(Atu, Ct(φ)) is a Cameron–Martin shift of μt for μ-almost every u, and the mixture law νt of vt is absolutely continuous with respect to μt. In practice, (B) is often the relevant regime: a mild smoothing At can be used to reconcile rough data laws μ with the Cameron–Martin regularity induced by trace-class Ct(φ).

We write νt := ℒ(vt) for the marginal law of the corrupted sample at scale t. The intermediate measures {νt} serve two purposes. First, they provide the training distributions for denoising-based regression of the drift field at each t. Second, they define an annealing path from an initial, noise-dominated distribution to a final, data-dominated distribution, to be traversed by a Langevin-type sampler.

At each t, the quantity of interest is the preconditioned score drift F(⋅, t; φ) associated with (νt, μt), defined (as in the preceding section) through the log-density Φt = log (dνt/dμt) by
F(u, t; φ) = −u + RDHμtΦt(u).
This drift depends on φ both explicitly through Ct(φ) (hence through μt and the Cameron–Martin geometry) and implicitly through the induced corrupted law νt. We approximate F(⋅, t; φ) by a parametrized, discretization-consistent map Fθ(⋅, t) : H → H, typically realized as a neural operator so that evaluation on different spatial grids corresponds to the same underlying continuum mapping.

The learning problem is therefore genuinely coupled: changing φ modifies the target of the regression problem for Fθ, while changing θ modifies the quality of the sampler that will ultimately be used with the covariances Ct(φ). Our objective is to choose (θ, φ) so that (i) Fθ(⋅, t) closely matches F(⋅, t; φ) in L2(νt; H) uniformly over t, and (ii) the resulting annealed Langevin dynamics, using {Ct(φ)} as its diffusion covariances, mixes rapidly at each scale. The admissibility constraints above are simultaneously a well-posedness requirement (trace-class noise and shared Cameron–Martin space) and a stabilization mechanism preventing the learned covariance schedule from collapsing onto low-dimensional or excessively stiff geometries.

Although all definitions are posed on H, training and sampling are implemented through finite-dimensional discretizations (e.g. Galerkin projections, grid evaluations, spectral truncations). We assume that the dataset can be queried on arbitrary discretizations and that, at sampling time, the learned maps Fθ(⋅, t) and covariance samplers for Ct(φ) can be evaluated at the same discretization without re-training. This motivates the explicit separation between (i) continuum objects μ, νt, μt, F(⋅, t; φ), and (ii) their numerical approximations. The uniform equivalence plays a key role here: it provides a scale- and parameter-uniform control of norms and regularity in the underlying function space, which is precisely what is needed to formulate discretization-robust error bounds for the induced samplers.

The remainder of the paper makes this problem setup constructive by specifying (a) an explicit joint training objective for (θ, φ) based on denoising, and (b) structured parameterizations of {Ct(φ)} that enforce admissibility while permitting efficient GRF sampling and differentiation with respect to φ.


4. PL-DDO objective and parameterization: structured Ct(φ) (Fourier-diagonal + low-rank, SPDE-based, kernel hyperparameters); regularizers enforcing operator inequalities; practical GRF sampling (KL/FFT) and gradients through noise.

Our training criterion is a preconditioned Langevin–denoising objective in which the covariance schedule is treated as a learnable component of the model. For each scale t ∈ I, recall that we observe corrupted samples vt = Atu + ηt with ηt ∼ 𝒩(0, Ct(φ)), and we seek a drift surrogate Fθ(⋅, t) approximating the preconditioned score drift F(⋅, t; φ) associated with (νt, μt). The empirical counterpart of this goal is captured by the population risk

where is a regularizer enforcing admissibility and additional structure. The appearance of ηt (rather than an explicit score) is deliberate: it yields a regression problem with a readily sampled target and admits an implementation by reparameterized Gaussian draws. Formally, under the standing assumptions ensuring νt ≪ μt and the existence of DHμtΦt, the minimizer over sufficiently rich function classes satisfies the usual denoising identity: at fixed t, the optimal drift field in L2(νt; H) coincides with the preconditioned score drift F(⋅, t; φ). Hence the dependence of on φ is essential rather than ancillary: changing φ changes the reference μt, the Cameron–Martin geometry, and the induced corrupted law νt, and therefore changes the regression target itself.

We enforce the operator inequalities mC ≼ Ct(φ) ≼ MC by adopting structured families in which admissibility is built in at the continuum level. A convenient template (used repeatedly below) is the congruence form

where St(φ) is bounded, self-adjoint, positive, and satisfies
σ(St(φ)) ⊆ [m, M] for all t. Then Ct(φ) is trace-class and automatically satisfies ; moreover, the corresponding Cameron–Martin spaces coincide with Im (C1/2), and all t-dependent norms are equivalent with constants depending only on m, M. In what follows we therefore focus on designing St(φ) with (i) spectrum control, (ii) efficient sampling of ηt, and (iii) differentiability with respect to φ as required by gradient-based training.

Assume that C is diagonal in a known orthonormal basis {ek}k ≥ 1 ⊂ H (e.g. the Fourier basis on a periodic domain), with Cek = λkek, kλk < ∞. A particularly efficient family is

In this case GRF sampling is trivial: if {ξk}k ≥ 1 are i.i.d. standard normals,

which is implemented by truncating to the active discretization modes and using FFTs for basis transforms when needed. To parameterize the spectral multipliers while respecting bounds, we may set

with unconstrained parameters αt, k ∈ ℝ and a smooth squashing map σ : ℝ → (0, 1) (e.g. logistic). This yields exact satisfaction of the operator inequalities and avoids nondifferentiable projection. Additionally, we can impose mild schedule regularity across t by penalizing discrete variations,

for the set 𝒦 of represented modes; this reduces spurious oscillations of the geometry between adjacent annealing scales.

Purely diagonal schedules cannot capture correlations across modes induced by nonstationarity or global constraints in the data. To enlarge expressivity while preserving tractability, we may augment a diagonal St by a finite-rank term:

where Dt is diagonal in the reference basis and Ut : H → ℝr (equivalently Ut* : ℝr → H) has rank r (active dimension). Exact spectral enforcement σ(St) ⊆ [m, M] for can be achieved by construction only in special cases; in practice we therefore combine (i) a base diagonal component parameterized as in with strict bounds Dt ≽ mI and Dt∥ ≤ M, and (ii) a soft spectral penalty controlling departures from [m, M] on the active discretization:

where (x)± = max {±x, 0} and λmin, λmax are computed on the finite-dimensional truncation. This empirical enforcement is consistent with the continuum intent when the parameterization is resolution-invariant and the rank r is fixed as resolution increases. Sampling from 𝒩(0, Ct) with – can be done by drawing a diagonal component plus a low-rank component, using a Woodbury-type factorization on the discretization; gradients are correspondingly efficient since all dependence on φ passes through diagonal scalings and r-dimensional linear algebra.

On domains where Fourier diagonalization is not appropriate (nonperiodic boundaries, manifolds, irregular geometries), we may define Ct(φ) implicitly through an elliptic SPDE. A canonical choice is a Mat'ern-type precision operator Lt(φ) (self-adjoint, positive) with

with hyperparameters φ = (τt, κt)t ∈ I and fixed integer p chosen so that Ct is trace-class on the spatial dimension of interest. Sampling proceeds by the factorization ηt = Lt(φ)−1/2ξ with ξ white noise, implemented numerically by solving a linear elliptic problem (or iterated solves when p > 1) at the working discretization. Admissibility relative to C is enforced by restricting (τt, κt) to compact intervals and, when needed, choosing C within the same SPDE family (so that follows from uniform ellipticity bounds). This yields a parameterization with a small number of physically interpretable degrees of freedom and a sampling cost comparable to a few PDE solves per Langevin step.

When H = L2(D) and Ct(φ) arises as an integral operator with kernel ktφ(x, y), we can parameterize ktφ by stationary kernels (e.g. squared exponential, Mat'ern) or mixtures thereof. On regular grids, stationary kernels are diagonal in Fourier space and reduce to with st, k determined by a low-dimensional spectral density. This connects kernel hyperparameters directly to the spectrum of Ct, allowing us to enforce bounds by constraining the spectral density between m and M multiples of the baseline density.

Beyond hard or soft enforcement of , we include terms promoting numerical stability and efficient sampling. Typical choices include: (i) trace control, e.g. t(Tr(Ct) − Tr(C))2, to prevent excessive noise at intermediate scales; (ii) smoothness across t as in ; and (iii) structural sparsity (for low-rank schedules) via UtHS2 or nuclear-norm surrogates. Importantly, all such penalties are posed in terms of continuum-consistent quantities (trace, Hilbert–Schmidt norms, spectral multipliers) so that their effect does not depend qualitatively on discretization.

Optimization of ℒ(θ, φ) requires differentiating through ηt ∼ 𝒩(0, Ct(φ)). We use the standard reparameterization

implemented by in the Fourier-diagonal case, by low-rank factorizations for , or by SPDE-based solvers for . This yields unbiased stochastic gradients of with respect to φ under the usual differentiability assumptions on the map φ ↦ Ct(φ)1/2 on the active discretization. In the diagonal case the derivatives are explicit and cheap; in SPDE families, automatic differentiation through the linear solve (or an adjoint solve) provides gradients with a cost comparable to sampling itself. Since the loss in is expressed in the ambient H-norm, we additionally ensure that discretized implementations use quadrature-consistent inner products so that training remains invariant under refinement (up to the numerical approximation of Fθ and of the sampling map φ ↦ ηt).

In summary, the PL-DDO formulation couples a denoising regression objective with a constrained, continuum-defined covariance schedule. The structured families above are designed so that (i) admissibility and common Cameron–Martin geometry are maintained uniformly, (ii) GRF sampling remains fast at arbitrary resolutions, and (iii) gradients with respect to φ can be computed reliably. These properties are the prerequisites for the sampling procedure developed next, where the learned Ct(φ) is used as the diffusion covariance in an annealed, preconditioned Langevin scheme.


5. Sampling algorithm: annealed preconditioned Langevin with learned Ct; Euler–Maruyama vs Crank–Nicolson discretization; stability heuristics and step-size scheduling.

Having learned a drift surrogate Fθ(⋅, t) together with an admissible covariance schedule Ct(φ), we define a sampler by discretizing a preconditioned Langevin dynamics on H whose diffusion at scale t is precisely Ct(φ). Throughout, we treat the annealing index t ∈ I = {1, …, T} as piecewise-constant ``time’’ and run a short Markov chain at each t, warm-started from the previous scale.

Fix a scale t. The ideal continuum dynamics corresponding to the preconditioned score drift F(⋅, t; φ) is the H-valued stochastic evolution

where WCt(φ) denotes a Ct(φ)-Wiener process. Replacing F by Fθ yields the learned dynamics. In the annealed setting we run for a finite horizon at each t, with the terminal state at t serving as the initial condition at t + 1. We emphasize that Ct(φ) plays a dual role: it specifies the diffusion covariance in and, through the training problem, it shapes the Cameron–Martin geometry in which the learned drift is an approximation to the true preconditioned score drift.

Let ht > 0 be a step size at scale t, and let M be the number of inner steps per scale. On a finite-dimensional discretization Hn ⊂ H (e.g. a Galerkin space) we use the explicit EM update

where {ξt(n)} are independent across iterations n and across scales t. In practice we draw ξt(n) via the reparameterization ξt(n) = Ct(φ)1/2z(n) with z(n) ∼ 𝒩(0, I) in the chosen basis. The update is the natural discretization of and is easy to implement when Fθ is a neural operator and Ct(φ) admits fast sampling.

At the continuum level, explicit EM is sensitive to stiffness in the drift. In our setting this stiffness is particularly visible because the preconditioned drift decomposes as

so that the linear damping term u is always present, while Gθ may be only locally Lipschitz on typical data regimes. This observation motivates a semi-implicit alternative that treats u more stably.

A standard stabilization in (infinite-dimensional) Langevin integrators is to apply a CN discretization to the linear Ornstein–Uhlenbeck component and keep the nonlinear part explicit. Concretely, applying CN to u and EM to Gθ yields

Solving for u(n + 1) gives the explicit affine form

Two properties of are particularly useful for our purposes. First, the linear factor satisfies |at| < 1 for all ht > 0, which prevents the explicit instability that can arise in when ht is not sufficiently small relative to the linear damping. Second, when Gθ ≡ 0 the CN scheme preserves the correct Gaussian invariant law for the linear preconditioned OU dynamics (up to discretization effects in Hn), a feature that is aligned with the role of μt = 𝒩(0, Ct(φ)) as the reference measure at each scale. For these reasons, CN-type updates are often preferable at small noise levels (late in the annealing) where stability constraints are most restrictive.

We stress that both and use exactly the same learned ingredients (Fθ, Ct(φ)); the distinction is purely numerical. Our theory in the next section is stated for the continuum dynamics and for stable discretizations; empirically, enlarges the range of usable step sizes without sacrificing resolution invariance.

We initialize at the coarsest/noisiest scale by drawing

and then, for t = 1, …, T, we run M iterations of either or with covariance Ct(φ), producing a terminal state ut(M). We set ut + 1(0) := ut(M) and proceed. The output is uT(M), which we interpret as an approximate sample from νT (and, in regimes where νT ≈ μ, from μ). This warm-started annealing is essential: it amortizes mixing by transporting mass through a sequence of intermediate geometries rather than attempting to reach νT in one long chain.

Because Ct(φ) is learned, the geometry of the sampler varies with φ and can change markedly across t. We therefore choose {ht} in a manner that is robust to such variation. At a minimum, stability of explicit EM suggests taking ht inversely proportional to an estimate of the Lipschitz constant of Fθ(⋅, t) on typical states. In practice we approximate this by one of the following proxies on the active discretization: (i) a running estimate of ∥∇uFθ(u, t)∥ via automatic differentiation on minibatches of chain states; (ii) a spectral proxy based on the linear part, namely ensuring ht is not so large that the effective contraction (1 − ht) becomes negative in ; or (iii) a conservative schedule decreasing with t when t corresponds to decreasing noise (late scales are typically more ``peaked’’ and hence less forgiving).

For CN updates, we may choose larger ht because the linear stiffness is controlled by |at| < 1. Nevertheless, the nonlinear term Gθ still imposes a practical limit: excessively large bt leads to poor approximation of the target drift and can cause transient divergence before the u term contracts. A simple and effective heuristic is to pick ht so that bt remains below a prescribed threshold (e.g. bt ≤ ) while allowing at to be close to 1 at early (high-noise) scales and smaller at late (low-noise) scales. We also note that the uniform equivalence bounds mC ≼ Ct(φ) ≼ MC imply that the typical noise energy and the induced Cameron–Martin norm remain comparable across t; this prevents the schedule from producing degenerate diffusion directions and thereby supports stable choices of ht that do not depend on discretization dimension.

All updates are executed on a finite-dimensional representation Hn determined by the query discretization. The drift Fθ(⋅, t) is evaluated as a discretization-consistent neural operator, and ξt(n) is drawn by sampling the truncated Gaussian 𝒩(0, PnCt(φ)Pn), where Pn : H → Hn is the projection associated with the discretization. Provided the sampling map Ct(φ)1/2 is implemented in a manner compatible with Pn (Fourier truncation, KL truncation, or discretized SPDE solves), the resulting chain is consistent under refinement in the sense that it approximates the same continuum algorithm, rather than a dimension-dependent surrogate.

The purpose of this section is to make explicit the algorithmic object to be analyzed: an annealed, preconditioned Langevin chain whose preconditioner is learned but constrained to an admissible operator family. In the next section we justify that, under the standing spectral bounds on Ct(φ) and standard monotonicity/Lipschitz hypotheses on the drift, the underlying continuum dynamics is well posed and the measures appearing in the construction are mathematically meaningful, so that the training and sampling procedures are posed on H rather than tied to a particular discretization.


6. Theory I (well-posedness): existence/uniqueness of the continuum dynamics with time-varying preconditioner; conditions ensuring invariant/target measures are well-defined; finiteness of the training objective.

We justify that, under our standing admissibility and regularity hypotheses, the continuum objects underlying training and sampling are mathematically meaningful on H: (i) the reference measures μt = 𝒩(0, Ct(φ)) and the perturbed laws νt are well defined and satisfy νt ∼ μt; (ii) the log-density Φt = log (dνt/dμt) admits a Cameron–Martin derivative so that the preconditioned score drift F(⋅, t; φ) is defined νt-a.s.; (iii) the continuum dynamics with time-varying covariance is well posed; and (iv) the denoising objective is finite.

Our basic structural requirement is the uniform operator inequality

for fixed 0 < m ≤ M < ∞ and trace-class C. First, implies each Ct(φ) is trace-class with a uniform trace bound

and hence ηt ∼ 𝒩(0, Ct(φ)) is H-valued with 𝔼∥ηt2 = Tr(Ct(φ)). Second, the equivalence yields a shared Cameron–Martin space: by standard operator monotonicity arguments (e.g. comparing Ct1/2 and C1/2 via functional calculus), we have

and the corresponding Cameron–Martin norms are uniformly equivalent. In particular, translations by elements of Hμ preserve measure equivalence for all μt, uniformly in t.

A convenient sufficient mechanism for is the congruence parameterization

where St(φ) is bounded, self-adjoint, positive, and σ(St(φ)) ⊂ [m, M]. Then – follow immediately from trace-class stability under bounded congruences and the spectral bounds on St(φ).

Fix t ∈ I. Under assumption (A) we have vt = u + ηt, while under (B) we have vt = Atu + ηt with At(H) ⊆ Hμ. In either case, conditioning on u yields that νt is a mixture of Cameron–Martin shifts of μt:
νt(⋅) = ∫μt(⋅ − Atu) μ(du),
where in case (A) we interpret At = Id and require μ(Hμ) = 1. For each fixed x ∈ Hμ, the Cameron–Martin theorem gives

Applying with x = Atu ∈ Hμ and averaging over u yields the explicit Radon–Nikodym derivative

The right-hand side is strictly positive μt-a.s. and finite provided, for instance, that 𝔼∥AtuHμ2 < ∞, which we take as a mild integrability condition (automatic in many applications where At is smoothing and u ∈ H has finite second moment). Consequently νt ≪ μt, and since each component measure μt(⋅ − Atu) is itself equivalent to μt, the mixture satisfies μt ≪ νt as well; hence νt ∼ μt and Φt = log (dνt/dμt) is μt-a.s. (equivalently νt-a.s.) well defined.

Moreover, is formulated intrinsically on H using Hμ duality, so it is consistent under discretization: Galerkin projections simply replace μt and νt by their pushforwards to Hn, without altering the continuum identity.

The preconditioned drift is defined by
F(u, t; φ) :=  − u + RDHμΦt(u),
where DHμΦt denotes the Fr'echet derivative along Hμ (i.e. directional differentiability in Cameron–Martin directions). In the present setting, existence of DHμΦt can be justified under standard regularity assumptions on the density . One sufficient condition is that the map
$$ v \;\mapsto\; \frac{d\nu_t}{d\mu_t}(v) $$
belongs to a Gaussian Sobolev class W1, 2(μt) on H, in which case DHμΦt exists μt-a.s. and is square integrable; see the Malliavin/abstract Wiener space literature on differentiability of log-densities with respect to Gaussian reference measures. For our purposes, we record this as an assumption: DHμΦt exists and is such that u ↦ F(u, t; φ) is globally Lipschitz and dissipative uniformly in t, as required in Theorem~1.

We consider the annealed dynamics in which t is piecewise constant. Let 0 = s0 < s1 < ⋯ < sT be switching times and set t(s) = k for s ∈ [sk − 1, sk). We may realize the noise via a cylindrical Wiener process Bs on H and write

By , each Ct(s)(φ)1/2 is Hilbert–Schmidt, so the stochastic integral in is well defined in H. If F(⋅, t; φ) is globally Lipschitz and satisfies a uniform dissipativity condition

then standard infinite-dimensional SDE/SPDE theory (e.g. via a fixed-point argument for mild solutions on each interval [sk − 1, sk) combined with Gr"onwall estimates and It^o isometry using ) yields existence and uniqueness of a strong solution with continuous H-valued paths, together with uniform moment bounds such as

where the constants depend on α, β, m, M, Tr(C) but not on any discretization dimension. The same statement applies to the learned dynamics obtained by replacing F with Fθ, provided Fθ(⋅, t) obeys analogous Lipschitz/dissipativity bounds (which can be enforced by architecture and/or spectral normalization and is compatible with our regularization of φ).

Fix t and consider the autonomous dynamics . The definition of F(⋅, t; φ) as a μt-preconditioned score drift implies that νt is formally invariant: for smooth cylindrical test functions f, the generator
tf(u) = ⟨F(u, t; φ), Df(u)⟩ + Tr (Ct(φ)D2f(u))
satisfies ∫ℒtfdνt = 0. This identity is the infinite-dimensional analogue of the finite-dimensional fact that Langevin with drift equal to the score of the target has the target as invariant law; here the integration-by-parts is performed with respect to the Gaussian reference μt and transferred to νt using Φt. Under the dissipativity condition , one obtains a Feller Markov semigroup with at most one invariant probability measure, hence νt is the unique invariant law at scale t.

Finally, we verify that the denoising objective is finite under mild growth conditions. Since ηt ∼ 𝒩(0, Ct(φ)) is H-valued with 𝔼∥ηt2 = Tr(Ct(φ)), we have by

Moreover vt = Atu + ηt is H-valued, and if Fθ(⋅, t) has at most linear growth uniformly in t, namely Fθ(u, t)∥ ≤ c0 + c1u, then
𝔼∥Fθ(vt, t)∥2 ≤ 2c02 + 2c12 𝔼∥vt2 < ∞,
whenever 𝔼∥u2 < ∞ and holds. Thus ℒ(θ, φ) is finite for admissible φ and such Fθ, and the regularizer ℛ(φ) may be viewed as ensuring the standing operator bounds needed for and . This completes the well-posedness layer required before we turn to quantitative sampling error bounds.


7. Theory II (dimension-free sampling error): W2 (or bounded-Lipschitz) bounds decomposing error into score approximation, discretization, and annealing; explicit dependence on m, M but not on grid resolution.

We now quantify the discrepancy between the distribution produced by the annealed Euler–Maruyama sampler driven by the learned drift Fθ and the terminal perturbed law νT (and, by a further triangle inequality, the target μ when νT ≈ μ). Our objective in this section is to exhibit an error decomposition whose constants depend on continuum quantities and on the uniform equivalence constants m, M, but not on any discretization resolution.

Fix t ∈ I and a step size ht > 0. We consider the (one-step) Euler kernels on H

where ξt ∼ 𝒩(0, Ct(φ)) is independent of u. For a probability measure ρ on H we write ρKth for its pushforward under the kernel. The full sampler is the composition of M steps per scale, hence the terminal law is
π̂ := ρ01hM2hMThM,   thM := (th)M,
with initialization ρ0 = μ1 = 𝒩(0, C1(φ)) for definiteness. We analogously define the ``ideal discrete-time’’ law
πh := ρ0K1hMK2hMKThM.

The estimates below rest on a standard contractivity condition for the fixed-scale dynamics. Concretely, we assume that for each t ∈ I the drift F(⋅, t; φ) is globally Lipschitz on H with constant L and is (strongly) monotone with constant α > 0,

uniformly in t. In preconditioned models the natural monotonicity is often formulated in the Ct(φ)−1-geometry; using mC ≼ Ct(φ) ≼ MC, one passes between the corresponding norms with constants depending only on m, M. In particular, any α stated in a Ct−1-metric yields an α in with α depending on m, M and the baseline geometry, but not on discretization dimension.

We decompose the terminal error as

The first term is the ; the second term splits into (Euler bias) and (finite number of scales / imperfect bridging of {νt}).

Let δ satisfy the uniform mean-square error bound

To convert into a Wasserstein bound for the Markov kernels, we use a synchronous coupling: evolve two chains from the same initial point and with the same Gaussian increment ξt at each step. For one Euler step at scale t, this gives
U+ − + = ht(F(U, t; φ) − Fθ(U, t)),
hence
𝔼∥U+ − +2 ≤ ht2 𝔼∥F(U, t; φ) − Fθ(U, t)∥2.
Under the standing moment bounds from Section~ and the fact that, for contractive kernels, the chain remains close (in W2) to νt after a short burn-in, one obtains a perturbation estimate of the form

with c1 depending on m, M and the uniform Lipschitz/dissipativity constants, but independent of any discretization. Iterating across M steps and T scales, and using the contractivity of Kth implied by when ht ≤ 1/L, yields

In the common regime ht ≡ h and fixed total ``algorithmic time’’ tMht = O(1), the drift-induced error is O(δ1/2).

Even with the true drift, the Euler kernel Kth has an invariant law νt, h which differs from the continuum invariant νt. Under standard regularity (global Lipschitz and monotonicity) and trace-class noise, one has a strong error estimate for the Euler scheme in Hilbert spaces which implies a Wasserstein bias bound

again with constants depending on the continuum parameters and on m, M through the noise geometry, but not on any discretization resolution.

Moreover, because we run only M steps per scale, we incur a finite-time mixing error to νt, h. By contractivity of Kth in W2 (a discrete analogue of the continuum contraction), there exist γt > 0 and Cmix < ∞ such that for any initial ρ with finite second moment,

with α from up to factors controlled by m, M. Combining – across scales yields an accumulated discretization-and-mixing contribution bounded by

where Ξt is a scale-dependent warm-start term (controlled by second moments, hence dimension-free under ). In particular, taking Mht ≳ γt−1log (1/ε) makes the second term O(ε) uniformly over discretizations.

The remaining discrepancy is intrinsic to the annealing/bridging design: even if we could sample from νt at each scale, a finite number of scales and the chosen forward map u ↦ Atu determine how closely νT approximates μ (and how difficult the transitions νt → νt + 1 are). We encode this as an additive term AnnealErr({νt}), for instance

or any sharper telescoping surrogate reflecting the actual schedule and the number of inner steps. This term depends on the modeling choice of scales and corruption operators, but it is a continuum quantity and therefore does not deteriorate with grid refinement.

Collecting , , , and the annealing contribution yields the following representative estimate, which is a refined version of Theorem~2 stated in the overview.

Since W2 metrizes weak convergence on sets with uniformly bounded second moments, Theorem~ immediately implies a bounded-Lipschitz estimate: for ∥ ⋅ ∥BL the dual norm over functions with f ≤ 1 and Lip(f) ≤ 1, we have
π̂ − νTBL ≤ W1(π̂, νT) ≤ W2(π̂, νT),
hence the same decomposition applies (with the same dimension-free constants) to bounded test-function errors.

Finally, we emphasize why the bounds are genuinely resolution-independent. The contraction parameters and Euler error constants are obtained at the level of the continuum Hilbert space H, using only (i) trace-class noise bounds and (ii) drift coercivity/Lipschitz properties measured in norms whose equivalence is controlled by m, M. For a Galerkin projection Pn : H → Hn, the projected dynamics inherit the same constants uniformly in n, and W2-stability under weak convergence allows passing to the n → ∞ limit. Thus, once δ, {ht}, and the annealing design are fixed, the number of steps TM needed to reach a prescribed accuracy bound does not blow up with the grid.


8. Theory III (acceleration and tightness): log-concave special case; mixing-time/spectral-gap proxy in terms of an operator condition number; upper bounds improved by learned Ct; matching lower bound showing near-optimality within the covariance family.

In this section we isolate a regime in which the dependence of mixing on the covariance choice Ct(φ) can be characterized by a single operator condition number, and in which optimizing Ct(φ) within an admissible family yields a provable (and essentially unimprovable) reduction in the number of inner Langevin steps required at each scale.

Fix t ∈ I. We assume that νt is absolutely continuous with respect to μt = 𝒩(0, Ct(φ)) and admits a log-concave density of the form

where Ψt : H → ℝ is differentiable along the Cameron–Martin space Hμt, with derivative DHμtΨt(u) ∈ Hμt*. With Φt = log (dνt/dμt), we have Φt = −Ψt + const, hence the true drift reads
F(u, t; φ) = −u − RDHμtΨt(u).
We further assume that Ψt is α-strongly convex and L-smooth , uniformly in t, in the following sense: for all u, v ∈ H and all t,

where wCt−1 := ∥Ct(φ)−1/2w on the common image space (well-defined thanks to uniform equivalence). Condition is the function-space analogue of Hessian bounds in preconditioned coordinates; it implies monotonicity and Lipschitz properties of F(⋅, t; φ) with constants that depend on (α, L) and the equivalence constants (m, M), but not on discretization.

For a bounded, positive, self-adjoint operator C satisfying mC ≼ C ≼ MC, define the preconditioned smoothness and convexity moduli by

and the associated proxy

When holds with C = Ct(φ), we have Lt(Ct(φ)) ≤ L and αt(Ct(φ)) ≥ α, hence κt(Ct(φ)) ≤ L/α. More generally, if one compares two covariances C and in the admissible family, the uniform inequalities mC ≼ C,  ≼ MC imply
wC−12 ≃m, M ∥w−12,
so κt(⋅) varies in a controlled manner over the family, with constants depending only on m, M.

Consider the fixed-scale Euler kernel Kth from the previous section with true drift F(⋅, t; φ). Under one obtains a discrete-time contraction estimate in W2 whose rate is governed by αt(Ct(φ)), with admissible step sizes determined by Lt(Ct(φ)). A representative statement is as follows.

The salient point is that the dependence on the covariance enters through κt(Ct(φ)), and the bounds are dimension-free because αt(C) and Lt(C) are continuum quantities expressed in operator norms on H (with equivalence controlled by m, M).

Let 𝒞 be an admissible family of covariances satisfying mC ≼ C ≼ MC, for instance the congruence family C = C1/2SC1/2 with S bounded, positive, self-adjoint and σ(S) ⊂ [m, M]. For each scale t, define the best achievable proxy in the family by
κt := infC ∈ 𝒞κt(C).
Since Proposition~ shows that Mt need only scale linearly with κt(C), any choice Ct(φ) that reduces κt(Ct(φ)) yields a commensurate reduction in the required number of inner steps. In particular, comparing to the baseline C, we obtain an improvement factor bounded by κt(C)/κt(Ct(φ)). From the modeling perspective, the role of the covariance-learning component φ is precisely to search, within 𝒞, for covariances that better match the local anisotropy of Ψt as seen through –; this is the function-space analogue of preconditioning by an approximate inverse Hessian.

We now show that the κ-dependence above cannot, in general, be improved when the covariance is restricted to a prescribed family 𝒞. The sharpest statement is obtained on quadratic instances, for which the proxy κt(C) is exact.

Fix t and consider the Gaussian target νt = 𝒩(0, Q) with Q trace-class and strictly positive. Suppose Q commutes with all C ∈ 𝒞 (e.g., all are diagonal in a common Fourier/KL basis). Then νt ≪ μt = 𝒩(0, C) with $\Psi_t(u)=\tfrac12\langle B u,u\rangle$ for a bounded positive operator B diagonal in the same basis, and the preconditioned Langevin dynamics decouple mode-wise. Writing {λi(C)} and {λi(Q)} for eigenvalues in the shared basis, one verifies that the relevant preconditioned spectrum is {λi(C)/λi(Q)}, hence
$$ \kappa_t(C) \;=\; \frac{\max_i \lambda_i(C)/\lambda_i(Q)}{\min_i \lambda_i(C)/\lambda_i(Q)}. $$
For the Euler scheme with step size h ≤ 1/Lt(C), the i-th mode evolves as a one-dimensional AR(1) recursion whose contraction factor is 1 − hai with ai ∈ [αt(C), Lt(C)]. Consequently, no choice of h can make all modes contract faster than the slowest one without violating stability on the fastest one, and the number of steps required to reduce the W2 distance by a factor ε must scale at least linearly in Lt(C)/αt(C).

Theorems~ and together justify the interpretation of covariance learning as an essentially optimal acceleration mechanism within a constrained operator class: improvements in mixing are possible precisely to the extent that one can reduce κt(C) over C ∈ 𝒞, and for commuting Gaussian instances this reduction is both necessary and sufficient up to universal constants.


9. Experiments: (a) Gaussian-mixture GRFs (control ground-truth spectral structure), (b) Navier–Stokes pushforward measures (super-resolution + mixing), (c) conditional Darcy posteriors (many-sample throughput). Report step reduction, wall-clock, resolution invariance, and robustness to discretization changes.

We present three experimental suites designed to isolate (i) whether learning the covariance schedule Ct(φ) produces measurable acceleration at fixed accuracy, (ii) whether this acceleration is stable across discretizations (continuum-first invariance), and (iii) whether the method remains effective in conditional settings where per-sample throughput is the primary constraint. In all experiments we train by the objective
ℒ(θ, φ) = 𝔼t 𝔼u, ηt ∥ηt + Fθ(Atu + ηt, t)∥2 + λ ℛ(φ),
with a schedule t ∈ [T] and inner sampling steps M per scale in the annealed Euler–Maruyama sampler. Unless stated otherwise, At = Id in unconditional tasks and At is chosen as a fixed low-pass smoother in tasks where μ(Hμt) < 1 is expected at fine scales. We parameterize Ct(φ) in a Fourier/KL basis via a congruence Ct(φ) = C1/2St(φ)C1/2 with St(φ) diagonal and σ(St) ⊂ [m, M] enforced by projection after each optimizer step; this ensures a shared Cameron–Martin space and eliminates discretization-dependent admissibility failures.

We report (1) : the smallest TM achieving a prescribed test metric threshold, (2) : end-to-end runtime including GRF sampling and operator evaluation at fixed hardware and batch size, and (3) : performance when training and sampling are carried out on different discretizations. For unconditional generative tasks we use sliced W2 (finite-dimensional projections), MMD with a Sobolev kernel, and domain-relevant summary statistics (spectral energy, structure functions). For conditional inverse problems we also report posterior predictive error, coverage of credible intervals, and effective sample size per second for downstream quantities of interest. Baselines include (i) fixed Ct ≡ C with identical Fθ architecture, (ii) hand-designed schedules (e.g. geometric scaling of C), and (iii) classical function-space samplers (pCN and preconditioned Crank–Nicolson variants) where applicable.

Our first suite uses synthetic measures μ on H = L2(𝕋d) constructed as mixtures of Gaussian random fields with known and intentionally mismatched spectral structure. Concretely, we sample a latent label z ∈ {1, …, K} and then draw u ∼ 𝒩(0, Qz), where each Qz is diagonal in the Fourier basis with eigenvalues λk(Qz) concentrated on disjoint bands and with prescribed anisotropy ratios across modes. This setting is designed so that (i) the optimal preconditioner in the commuting Gaussian case is explicit, and (ii) the κ-proxy varies over orders of magnitude depending on whether C matches the dominant bands of {Qz}.

We train PL-DDO with a Fourier neural operator for Fθ(⋅, t) and compare learned Ct(φ) against C chosen as a Mat'ern-type prior with smooth spectrum. We observe that learned schedules shift mass toward the active frequency bands and, as t increases, progressively sharpen the spectrum while remaining within the admissible bounds [m, M]. The principal quantitative outcome is that the inner step count M required to reach a fixed sliced-W2 threshold decreases proportionally to the reduction in the empirically estimated condition proxy
$$ \widehat\kappa_t(C) = \frac{\widehat L_t(C)}{\widehat \alpha_t(C)}, \qquad \widehat L_t(C),\widehat\alpha_t(C) \ \text{estimated from finite differences along sampled directions.} $$
In particular, at fixed T we find that replacing C by the learned Ct(φ) yields a consistent reduction in M across mixture weights and across different anisotropy ratios, while the fixed-covariance baseline requires small step sizes ht (stability dictated by the fastest mode) and therefore more inner steps.

To test discretization robustness we train on a coarse grid, but sample and evaluate on finer grids by drawing ηt ∼ 𝒩(0, Ct(φ)) directly at the evaluation resolution. Because Ct(φ) is defined through a basis-consistent spectral rule and Fθ is discretization-consistent, the step reduction persists with only minor degradation in summary statistics. We also repeat evaluation under random regridding (nonuniform sample points followed by interpolation to the operator input format) and find that the learned covariance remains effective; in contrast, hand-tuned schedules exhibit nontrivial sensitivity to the discretization change, typically manifesting as either overly rough samples (excess high-frequency power) or slow mixing in the low-frequency modes.

Our second suite targets non-Gaussian measures obtained as pushforwards of Gaussian initial conditions through a PDE flow. Let u0 ∼ 𝒩(0, Q) be a divergence-free random field and define u = 𝒢(u0), where 𝒢 maps initial vorticity to a time-τ state of 2D Navier–Stokes on 𝕋2 at moderate Reynolds number. The resulting μ exhibits intermittency, nontrivial phase correlations, and heavy-tailed pointwise statistics, hence constitutes a stringent test of whether covariance learning (a purely second-order geometric mechanism) can nevertheless accelerate sampling when combined with a learned drift.

We use two evaluation modes. In the mode we compare sample quality as a function of TM at a fixed resolution, reporting energy spectra E(k), vorticity kurtosis, and MMD. In the mode we train Fθ on low-resolution observations of u but sample at a higher resolution; this probes whether the learned objects remain meaningful in the continuum limit. We employ a low-pass smoother At at early noise scales to ensure vt lies in Hμt while preserving the large-scale structures that govern long-range dependence.

Empirically, learning Ct(φ) yields a clear reduction in the number of inner steps required to match the spectral statistics: relative to C, the learned schedule reallocates variance across modes in a manner consistent with the observed energy cascade, which in turn permits larger stable step sizes ht without destabilizing the high-frequency components. This translates to wall-clock improvements because each inner step includes both an operator evaluation and a GRF draw; decreasing M reduces both costs simultaneously. We emphasize that the measured acceleration is not an artifact of reduced accuracy: for each method we select TM to meet a fixed tolerance on a jointly chosen metric set (spectrum error, MMD, and a low-order moment vector), and then compare runtimes at that tolerance.

Resolution invariance is assessed by training on one discretization (with Fourier truncation n) and sampling on multiple larger truncations n. We observe that the learned Ct(φ) continues to provide the same qualitative spectral shaping across n, and that the required M to achieve a fixed tolerance does not grow with n within the tested range. In contrast, when Ct is hand-scaled without learning, we see either (i) a need to retune the schedule as n increases or (ii) degradation in high-frequency statistics. These behaviors are consistent with the theoretical emphasis on continuum-defined operator inequalities and shared Cameron–Martin geometry.

Our third suite addresses conditional sampling for elliptic inverse problems, with the goal of demonstrating many-sample throughput when repeatedly conditioning on new data. We consider a log-permeability field u ∈ H and observations y = 𝒪(𝒮(u)) + ξ, where 𝒮 solves the Darcy PDE and 𝒪 extracts sparse pressure measurements. The target is the posterior μ(⋅ ∣ y), which is non-Gaussian and anisotropic, and whose geometry varies with y. We train a conditional drift Fθ(⋅, t; y) by augmenting the neural operator input with a representation of y and the observation locations, while still learning a covariance schedule Ct(φ) shared across conditions; this isolates the effect of learning a global preconditioner that improves mixing uniformly over y.

We compare against (i) pCN initialized from the prior, (ii) pCN with hand-tuned step size, and (iii) PL-DDO with fixed C. We report posterior predictive RMSE on held-out sensor locations, coverage of pointwise credible intervals, and effective sample size per second for low-dimensional quantities of interest (e.g. average flux through a boundary). The primary outcome is that learning Ct(φ) reduces the number of inner Langevin steps needed to reach stable posterior summaries, which directly increases throughput when many posterior samples are required per observation instance. In addition, the learned Ct(φ) decreases sensitivity to discretization: training on a moderate mesh and sampling on refined meshes yields comparable posterior summaries without retuning, provided that the PDE solver and the GRF sampler are consistent with the continuum operators.

Finally, we perform an ablation in which we freeze φ after pretraining on unconditional data and then finetune only θ conditionally. This separates geometry learning'' fromscore fitting’’ and shows that a substantial fraction of the speedup persists, indicating that the covariance schedule captures reusable anisotropy common to the posterior family. Across all three suites, the aggregate evidence supports the central experimental claims: (i) covariance learning yields consistent step and wall-clock reductions at fixed accuracy, (ii) these gains are maintained across discretizations, and (iii) the method remains effective in conditional regimes where sampling throughput dominates training cost.


10. Discussion: covariance learning pitfalls (singularity, overfitting to discretization), connections to function-space MCMC and adaptive preconditioning, limitations and future directions (non-Gaussian references, manifold constraints).

Our central design choice is to a covariance schedule t ↦ Ct(φ) while keeping all definitions in the continuum Hilbert space H. This choice yields tangible acceleration in the regimes where preconditioning is known to control contraction, but it also introduces failure modes that do not arise when Ct is fixed . We summarize the main pitfalls and the mechanisms we use to avoid them, then relate the method to function-space MCMC and adaptive preconditioning, and finally outline limitations and future directions.

In infinite dimensions, covariance learning'' is not an innocuous parametric extension: changing \(C_t\) changes the reference Gaussian measure \(\mu_t=\mathcal N(0,C_t)\), and in general two Gaussian measures on \(H\) can be mutually singular. Since our objective is formulated through the score term \(D_{H_{\mu_t}}\Phi_t\) with \(\Phi_t=\log(d\nu_t/d\mu_t)\), we require \(\nu_t\ll\mu_t\) and a common Cameron--Martin geometry across \(t\). The uniform equivalence constraint \[ m C_\star \preceq C_t(\varphi) \preceq M C_\star \] is precisely the condition that preventsadmissibility collapse’’: it ensures Im (Ct1/2) = Im (C1/2) and rules out degenerate limits in which Ct loses rank in directions that matter to μ. Without such a constraint, stochastic gradient descent can drive some eigenvalues of Ct to 0 (to reduce injected noise and artificially simplify the denoising task) or to +∞ (to overwhelm the data term), and either behavior can render μt incompatible with νt in the Feldman–H'ajek sense. In practice we therefore treat admissibility as a hard constraint (projection/clipping of spectra in a fixed basis) augmented by a soft regularizer ℛ(φ) that discourages repeated saturation at the bounds m and M. The projection is not merely numerical hygiene; it enforces the measure-theoretic prerequisites under which the continuum objective is well-defined.

Even under admissibility, learning Ct and Fθ jointly can admit spurious optima. The denoising target ηt is drawn from 𝒩(0, Ct(φ)), so the distribution of training pairs (vt, ηt) changes with φ. This creates a feedback loop: the optimizer may alter Ct to generate ``easier’’ corruptions for the current Fθ, rather than covariances that are optimal for sampling. Two mitigations are important. First, we constrain the parameterization of Ct(φ) to be sufficiently low-dimensional and basis-consistent (e.g. diagonal or structured in a Fourier/KL basis), which limits the ability of Ct to fit idiosyncrasies of a finite dataset. Second, we interpret the learned schedule as a rather than a direct model of data covariance; accordingly, we evaluate it by downstream sampling performance at fixed accuracy rather than by training loss alone. This distinction matters because a lower denoising loss does not necessarily imply improved Langevin contraction if, for instance, Ct concentrates variance in directions where Fθ is inaccurate.

A more subtle pitfall is overfitting φ to the training discretization. If Ct(φ) is parameterized directly on a grid (e.g. as a dense matrix or pixel-wise kernel), then changing the resolution changes the parameterization itself, and the learned covariance can encode grid artifacts (aliasing, boundary effects, or anisotropies tied to coordinate axes). Such overfitting may even reduce the training objective while deteriorating continuum performance. Our continuum-first formulation is intended to preclude this failure mode: we choose Ct(φ) through operator rules that commute with refinement (Fourier multipliers, KL truncations consistent with a fixed basis, or SPDE-based covariances). Nevertheless, discretization overfitting can reappear if the chosen basis is not aligned with the domain geometry or if the implementation uses inconsistent quadrature/interpolation across grids. A practical diagnostic is to re-evaluate the learned schedule under multiple discretizations and regridding operators; a principled direction is to explicitly enforce discretization-consistency by constraining Ct(φ) to lie in a family with a well-defined continuum limit (e.g. pseudodifferential operators of fixed order, or Mat'ern-type SPDE covariances with learned coefficients).

From the viewpoint of function-space MCMC, our sampler is a learned, annealed analogue of preconditioned Langevin dynamics. When F is replaced by the exact gradient of a log-density and Ct ≡ C, one recovers the standard preconditioned Langevin SPDE studied in the Da~Prato–Zabczyk setting. The insistence on a fixed Cameron–Martin space mirrors the design of pCN and related dimension-robust methods: pCN achieves mesh-invariant behavior precisely because its proposal preserves a reference Gaussian measure and avoids discretization-dependent drift terms. Our approach departs from classical pCN in two ways. First, we use a learned drift Fθ to approximate a score, thereby enabling non-Gaussian targets without explicit likelihood evaluations. Second, we allow the reference covariance to vary with t, but only within a uniformly equivalent family to preserve the function-space geometry needed for well-posedness and stability.

There is also a clear connection to adaptive MCMC and quasi-Newton preconditioning, where one tunes a covariance online based on empirical chain statistics. In finite dimensions, covariance adaptation can improve mixing but must be controlled to preserve ergodicity (e.g. diminishing adaptation). In function spaces, naive covariance adaptation is particularly delicate because empirical covariances are typically finite-rank and therefore singular with respect to the ambient Gaussian reference. Our method can be viewed as an analogue: we learn Ct(φ) from data (and optionally from conditional contexts y) under explicit operator inequalities that preserve equivalence to C. This replaces online adaptation with a constrained optimization problem whose output can be reused across many sampling runs, a key advantage in conditional regimes where per-instance throughput dominates. A hybrid direction is to allow mild online adjustment within the admissible family (e.g. updating a small number of spectral parameters) while enforcing a diminishing adaptation schedule and maintaining mC ≼ Ct ≼ MC at all times.

A fundamental limitation is that our reference measures μt are Gaussian. While learning Ct can reshape second-order geometry, it cannot represent genuinely non-Gaussian reference structure (e.g. skewness, heavy tails, or nonlinear constraints). In highly non-log-concave or strongly multimodal settings, covariance learning may yield limited gains because the bottleneck is not a condition number but rather energy barriers and metastability. One possible extension is to replace μt by a non-Gaussian reference obtained via a learned transport map Tt applied to a Gaussian base, leading to μt = (Tt)#𝒩(0, C). In infinite dimensions, however, constructing Tt with sufficient regularity and ensuring absolute continuity properties (and tractable computation of the analogue of DHμtΦt) is nontrivial. Any such extension would require a careful replacement of Cameron–Martin calculus by an appropriate differential structure on the transported measure.

Many targets of interest live on subsets of H or satisfy hard constraints: divergence-free vector fields, positivity constraints, unit-norm constraints, or conservation laws. A Gaussian reference on H is mismatched to such geometry, and unconstrained Langevin updates can violate constraints unless corrected. One direction is to incorporate constraints through projections (projected Langevin) or by formulating dynamics on a manifold (Riemannian Langevin in function spaces), where the preconditioner becomes a position-dependent metric. Learning a covariance Ct(u) would considerably increase expressivity but also raises substantial well-posedness questions for the resulting multiplicative-noise SPDE and for discretization-invariant implementation.

Beyond the extensions above, several theoretical and practical questions remain. On the theory side, it is natural to seek guarantees under weaker conditions than global Lipschitz/dissipativity, as many score models are only locally regular and many data measures induce drifts with polynomial growth. It is also of interest to quantify how covariance learning interacts with drift approximation error δ: while preconditioning can improve contraction, it may also amplify directions where Fθ is inaccurate, suggesting that joint learning should balance mixing proxies with approximation stability. On the practical side, one may enforce additional structure on t ↦ Ct(φ) (e.g. monotonicity of certain spectral quantities) to stabilize training, and one may co-design At with Ct when μ(Hμt) < 1 is expected. Finally, connecting covariance learning to bridge-based objectives (e.g. Schr"odinger bridges) may yield alternative training criteria in which the role of Ct is explicit as an entropic regularization parameter, potentially clarifying when and why learned preconditioning provides the most benefit.

In summary, covariance learning is powerful precisely because it alters the geometry of the sampling problem; for the same reason it must be constrained at the level of operators and measures. The uniform-equivalence framework provides a principled safety rail: it prevents singularity, supports discretization-invariant implementation, and places learned preconditioning on the same function-space footing as classical dimension-robust MCMC, while still leaving room for meaningful acceleration within an admissible family.