Preconditioner-Learned Denoising Diffusion Operators: Dimension-Free Acceleration via Learnable Trace-Class Covariances

Metadata

Total Words: 11,513
Export Date: 2026-01-19 22:25:09
Description: Score-based diffusion models have recently been lifted from finite-dimensional Euclidean spaces to infinite-dimensional function spaces via Denoising Diffusion Operators (DDOs), which define scores as logarithmic derivatives with respect to Gaussian reference measures and generate samples by (annealed) function-space Langevin dynamics. However, practical throughput in 2026-scale workflows is dominated not only by resolution but by sampler mixing time, especially for conditional generation and Bayesian inverse problems that require many samples. We propose Preconditioner-Learned DDOs (PL-DDO), which jointly learn (i) the score drift operator and (ii) a structured schedule of trace-class Gaussian covariances {C_t} that defines both the corruption process and the preconditioning in Langevin sampling. We restrict C_t to a family uniformly equivalent to a baseline covariance to preserve a common Cameron–Martin space, ensuring the DDO objective remains finite and the continuum diffusion remains well-posed. We provide sufficient conditions for existence/uniqueness of the resulting infinite-dimensional Langevin dynamics with time-varying learned preconditioning, and prove a dimension-free decomposition of sampling error into score approximation error, time discretization error, and annealing error. In the strongly log-concave setting we show that optimizing C_t reduces a principled mixing-time proxy tied to an operator condition number, yielding tight acceleration up to constants. Empirically, we validate 5–20× reductions in sampling steps at fixed accuracy while maintaining discretization invariance on Gaussian-mixture GRFs, Navier–Stokes pushforward measures, and conditional Darcy-flow posteriors.

1. Introduction: motivation (throughput beyond resolution), why covariance/preconditioning is the next bottleneck, contributions and summary of guarantees.
2. Background: DDO framework in Hilbert space; Gaussian reference measures, Cameron–Martin derivatives, denoising score matching objective, function-space Langevin and annealed Langevin.
3. Problem setup: joint learning of score drift and covariance schedule; admissible covariance families (uniform equivalence, trace-class, shared Cameron–Martin space); forward corruption with optional smoothing operators.
4. PL-DDO objective and parameterization: structured C_t(φ) (Fourier-diagonal + low-rank, SPDE-based, kernel hyperparameters); regularizers enforcing operator inequalities; practical GRF sampling (KL/FFT) and gradients through noise.
5. Sampling algorithm: annealed preconditioned Langevin with learned C_t; Euler–Maruyama vs Crank–Nicolson discretization; stability heuristics and step-size scheduling.
6. Theory I (well-posedness): existence/uniqueness of the continuum dynamics with time-varying preconditioner; conditions ensuring invariant/target measures are well-defined; finiteness of the training objective.
7. Theory II (dimension-free sampling error): W₂ (or bounded-Lipschitz) bounds decomposing error into score approximation, discretization, and annealing; explicit dependence on m, M but not on grid resolution.
8. Theory III (acceleration and tightness): log-concave special case; mixing-time/spectral-gap proxy in terms of an operator condition number; upper bounds improved by learned C_t; matching lower bound showing near-optimality within the covariance family.
9. Experiments: (a) Gaussian-mixture GRFs (control ground-truth spectral structure), (b) Navier–Stokes pushforward measures (super-resolution + mixing), (c) conditional Darcy posteriors (many-sample throughput). Report step reduction, wall-clock, resolution invariance, and robustness to discretization changes.
10. Discussion: covariance learning pitfalls (singularity, overfitting to discretization), connections to function-space MCMC and adaptive preconditioning, limitations and future directions (non-Gaussian references, manifold constraints).

Content

1. Introduction: motivation (throughput beyond resolution), why covariance/preconditioning is the next bottleneck, contributions and summary of guarantees.

Many contemporary generative models for scientific fields are trained and deployed across a wide range of spatial resolutions. In such settings, it is natural to seek a description: the target law is a Borel probability measure on an infinite-dimensional state space, training data are evaluations of random fields on arbitrary grids, and the learned model is an operator that can be queried consistently on new discretizations. Neural operator architectures have largely resolved the representational side of this objective: one can parameterize drifts, scores, or transport maps in a manner that is stable under refinement and supports evaluation on meshes unseen during training. The remaining obstacle is . Even when evaluation of the learned operator is nearly grid-independent in complexity (up to FFT or spectral costs), the end-to-end cost of sampling is dominated by the number of iterative updates required to reach a prescribed accuracy in distribution. In other words, once expressivity and resolution invariance are in place, the practical bottleneck becomes the of the sampling dynamics rather than the cost of a single model evaluation.

We focus on annealed Langevin-type samplers in Hilbert spaces, where the generative procedure is implemented by a sequence of stochastic steps driven by a learned drift (or score) at a range of noise levels. For such samplers, it is well understood in finite dimensions that performance is governed by geometry: the drift may be accurate, yet progress is throttled along poorly conditioned directions. This phenomenon is not an artifact of discretization. In function space, anisotropy is typically more severe: high- and low-frequency components have vastly different natural scales, and naive isotropic noise schedules can force the algorithm to take step sizes constrained by the stiffest modes. Consequently, the number of iterations required to achieve a fixed accuracy can become large, even if each iteration is resolution-invariant. From a continuum standpoint, this is precisely a problem, and it suggests that learning only the drift is insufficient: to obtain genuine speedups, one must also adapt the covariance structure of the injected noise (equivalently, the reference Gaussian measure) that defines the geometry of the score and of the Langevin dynamics.

Our central premise is that the covariance schedule is the next essential degree of freedom once operator learning has provided resolution-invariant drift evaluation. We therefore propose to learn (i) a discretization-consistent drift operator and (ii) a family of trace-class covariance operators that act as learned preconditioners across annealing scales. The covariances are not treated as incidental hyperparameters but as trainable objects constrained to remain in an admissible set ensuring well-posedness in the infinite-dimensional setting. This changes the geometry in which the score is computed and in which the Langevin dynamics evolve, while preserving the underlying continuum formulation and enabling efficient sampling of the driving Gaussian random fields through structured representations (e.g. Fourier-diagonal or low-rank parameterizations).

Three considerations make this joint learning nontrivial in Hilbert space. First, Gaussian measures in infinite dimensions are mutually singular unless their covariances satisfy stringent equivalence conditions; the score is naturally defined along a Cameron–Martin space that depends on the reference measure. Second, the injected noise must be trace-class to define a bona fide stochastic evolution on the ambient space. Third, to retain discretization-independent guarantees, all bounds must be stable under Galerkin projection and must not degrade with dimension. These constraints motivate parameterizations that enforce uniform operator inequalities relative to a fixed baseline covariance. Under such constraints, the reference measures share a common Cameron–Martin space, the denoising objective is well-defined, and the resulting Langevin dynamics admit a continuum interpretation whose discretizations inherit dimension-free stability properties.

The contributions of this work are organized around this thesis.

We formulate a denoising score matching objective in a real separable Hilbert space in which the corruption noise is Gaussian with a trace-class covariance at each annealing scale. The learned drift targets a preconditioned score drift associated with the Radon–Nikodym derivative of the perturbed law with respect to the reference Gaussian measure. This formulation cleanly separates (a) the data distribution and its perturbations from (b) the choice of reference geometry, making explicit how covariance learning influences both the training signal and the sampling dynamics. Importantly, the objective is defined in function space; discretizations are only numerical surrogates used to approximate expectations and operator evaluations.

We introduce structured covariance parameterizations and regularizers that preserve trace-classness, positivity, and uniform equivalence to a baseline covariance. These admissibility conditions ensure that the family of Gaussian references remains mutually absolutely continuous and that the relevant Cameron–Martin derivatives are taken along a fixed subspace. On the sampling side, the same conditions support existence and uniqueness of solutions to the corresponding Langevin-type stochastic evolution (viewed as an SPDE or as an annealed sequence of such dynamics). This yields a principled foundation for learning preconditioners in infinite dimensions without silently changing the underlying measure-theoretic setting from scale to scale.

We provide an accuracy bound for the distribution produced by an annealed Euler–Maruyama sampler driven by the learned drift and the learned covariance schedule. The bound decomposes into three terms: a drift approximation component controlled by the mean-squared error of the learned drift under the intermediate perturbed laws, a discretization component controlled by the step sizes, and an annealing component capturing the mismatch induced by the finite noise schedule. Crucially, the constants depend on continuum quantities (such as Lipschitz/dissipativity parameters and uniform equivalence bounds for the covariances) and not on the discretization dimension. This establishes that, in the contractive regime, one can target a fixed Wasserstein accuracy with a number of sampling steps that does not grow with resolution, provided the drift approximation error and integrator stability are controlled uniformly.

In strongly log-concave regimes, the convergence rate of preconditioned Langevin dynamics is governed by a condition-number-like quantity in the metric induced by the covariance. We show that optimizing the covariance within an admissible family provably improves an upper bound on mixing time by reducing this proxy. Moreover, the improvement is not merely an artifact of the analysis: within the same covariance family and under first-order access, one cannot in general do better than the corresponding condition-number dependence. Thus, covariance learning is not only beneficial but, within structured families, essentially the correct and information-theoretically necessary lever for accelerating sampling. This perspective aligns with finite-dimensional preconditioning intuition while operating in a function-space framework in which the relevant geometry is encoded by operators rather than matrices.

From an algorithmic standpoint, the proposed method retains the operational simplicity of denoising-based training while adding a learnable preconditioning schedule that can be implemented efficiently in spectral bases. The resulting sampler trades expensive per-iteration costs for fewer iterations by improving contraction along difficult directions. Since the covariances are learned in the same continuum-first setting as the drift, the procedure is compatible with variable discretizations at both training and sampling time. In particular, one can exploit fast Gaussian random field sampling routines (FFT/KL) while maintaining the invariance properties expected of neural operator models.

The remainder of the paper develops these ideas systematically. We first recall the denoising drift–diffusion operator (DDO) framework in Hilbert space, emphasizing Gaussian reference measures, Cameron–Martin derivatives, and the function-space Langevin dynamics that underpin annealed sampling. We then introduce admissible covariance parameterizations, establish well-posedness and measure equivalence, and derive dimension-free error bounds and covariance-induced acceleration guarantees.

2. Background: DDO framework in Hilbert space; Gaussian reference measures, Cameron–Martin derivatives, denoising score matching objective, function-space Langevin and annealed Langevin.

We briefly recall the drift–diffusion operator (DDO) viewpoint for denoising-based learning in a real separable Hilbert space H, emphasizing the role of Gaussian reference measures, Cameron–Martin derivatives, and the resulting function-space Langevin dynamics. Throughout, we treat all objects at the continuum level; discretizations enter only as numerical approximations of the operators and expectations defined below.

Let C be a trace-class, positive, self-adjoint operator on H, and let μ_C := 𝒩(0, C) denote the centered Gaussian Borel probability measure on H. The associated Cameron–Martin space is
H_{μ_C} := Im (C^1/2) ⊆ H, ⟨h₁, h₂⟩_{H_{μ_C}} := ⟨C^−1/2h₁, C^−1/2h₂⟩_H,
where C^−1/2 is understood on Im (C^1/2). We use the standard Cameron–Martin theorem: for any h ∈ H_{μ_C}, the shifted measure μ_C^h := 𝒩(h, C) is equivalent to μ_C and

In contrast, if h ∉ H_{μ_C}, then μ_C^h ⟂ μ_C. This dichotomy is the basic measure-theoretic reason that, in infinite dimensions, the ``score’’ must be understood relative to a reference Gaussian and along H_{μ_C} rather than along the full space H.

Given a measurable function Φ : H → ℝ and a point u ∈ H, we say that Φ is (Fr'echet) differentiable along H_{μ_C} at u if there exists a continuous linear functional D_{H_{μ_C}}Φ(u) ∈ H_{μ_C}^* such that
Φ(u + h) = Φ(u) + D_{H_{μ_C}}Φ(u)[h] + o(∥h∥_{H_{μ_C}}), h ∈ H_{μ_C}.
We identify H_{μ_C}^* with H_{μ_C} via the Riesz isomorphism on (H_{μ_C}, ⟨⋅, ⋅⟩_{H_{μ_C}}): for ℓ ∈ H_{μ_C}^*, we write Rℓ ∈ H_{μ_C} for the unique element satisfying ℓ[h] = ⟨Rℓ, h⟩_{H_{μ_C}}. Concretely, when ℓ[h] = ⟨g, h⟩_{H_{μ_C}}, then Rℓ = g, and note the identity ⟨g, h⟩_{H_{μ_C}} = ⟨C⁻¹g, h⟩_H.

Fix a scale t and abbreviate C_t := C_t(φ), μ_t := 𝒩(0, C_t). We consider perturbations of the form
v_t = A_tu + η_t, u ∼ μ, η_t ∼ 𝒩(0, C_t) independent,
where either A_t = Id and μ(H_{μ_t}) = 1, or A_t is a bounded linear smoother with A_t(H) ⊆ H_{μ_t}. Let ν_t denote the law of v_t. Conditional on u, the random variable v_t has law 𝒩(A_tu, C_t), i.e. a Cameron–Martin shift of μ_t by A_tu. Under the standing assumption A_tu ∈ H_{μ_t} almost surely (either directly or by smoothing), yields the conditional Radon–Nikodym derivative
$$ \frac{d\mathcal N(A_t u, C_t)}{d\mu_t}(v) = \exp\Big( \langle C_t^{-1}A_t u, v\rangle_H - \tfrac12 \|A_t u\|_{H_{\mu_t}}^2 \Big), \qquad \mu_t\text{-a.s.} $$
Averaging over u shows ν_t ≪ μ_t with

We write $\Phi_t(v) := \log\big(\frac{d\nu_t}{d\mu_t}(v)\big)$, defined ν_t-a.s. (and μ_t-a.s. on the event where the right-hand side is finite and nonzero).

The DDO identity is the observation that the ``preconditioned score’’ RD_{H_{μ_t}}Φ_t coincides with a conditional mean of the latent clean component. Formally differentiating in a Cameron–Martin direction h ∈ H_{μ_t} gives
$$ D_{H_{\mu_t}}\Phi_t(v)[h] = \frac{\mathbb E\big[ \exp(\cdots)\,\langle C_t^{-1}A_t u, h\rangle_H\big]}{\mathbb E\big[\exp(\cdots)\big]} = \Big\langle C_t^{-1}\,\mathbb E\big[A_t u \mid v_t = v\big],\, h\Big\rangle_H = \Big\langle \mathbb E\big[A_t u \mid v_t = v\big],\, h\Big\rangle_{H_{\mu_t}}, $$
where exp (⋯) denotes the exponential weight in and the conditional expectation is taken with respect to the posterior induced by the joint law of (u, v_t). By the Riesz identification in H_{μ_t}, we obtain the pointwise relation

Consequently, the preconditioned drift field

admits the equivalent denoising form

since v_t = A_tu + η_t. This is precisely the identity that makes denoising objectives compatible with learning the drift driving Langevin-type sampling.

Consider the mean-squared denoising functional at scale t,
𝒥_t(G) := 𝔼_{u ∼ μ, η_t ∼ 𝒩(0, C_t)}∥η_t + G(A_tu + η_t)∥_H²,
defined for measurable G : H → H with 𝔼∥G(v_t)∥_H² < ∞. By the Hilbert-space orthogonality principle (conditional expectation as L²-projection), the minimizer satisfies
G^*(v) = − 𝔼[η_t ∣ v_t = v], ν_t-a.s.
Comparing with shows G^*(⋅) = F(⋅, t; φ). Thus, training a parametrized operator F_θ(⋅, t) by minimizing 𝔼∥η_t + F_θ(v_t, t)∥_H² targets the drift without requiring explicit access to Φ_t or its derivative. The role of the Gaussian reference is implicit but essential: it determines the Cameron–Martin geometry in which Φ_t is differentiable and in which the identity holds.

The drift is designed so that ν_t is invariant for a preconditioned Langevin evolution on H. Let W^C_t be a C_t-Wiener process on H (well-defined since C_t is trace-class). The corresponding Langevin SPDE reads

where t is fixed. Under standard dissipativity and Lipschitz hypotheses on F(⋅, t; φ), is well posed and admits ν_t as an invariant law; heuristically, the term −U_s represents the Ornstein–Uhlenbeck pullback associated with μ_t, while RD_{H_{μ_t}}Φ_t corrects the dynamics to target ν_t rather than μ_t.

Annealed Langevin sampling uses a finite sequence of scales t = 1, …, T and runs for a short time at each scale, with ν_t serving as an intermediate bridge between an initial Gaussian-like law and the target-like law at the final scale. In computation one typically replaces by an Euler–Maruyama discretization: for a step size h_t > 0 and iterates {U_n} ⊂ H,
$$ U_{n+1} = U_n + h_t F_\theta(U_n,t) + \sqrt{2h_t}\,\xi_n, \qquad \xi_n \sim \mathcal N(0,C_t), $$
where the learned field F_θ(⋅, t) approximates F(⋅, t; φ) and the Gaussian increments ξ_n implement the C_t-covariance noise. The continuum interpretation is that discretizations of inherit their stability and convergence properties from the Hilbert-space formulation, provided the covariance is trace-class and the drift is controlled in the ambient norm.

3. Problem setup: joint learning of score drift and covariance schedule; admissible covariance families (uniform equivalence, trace-class, shared Cameron–Martin space); forward corruption with optional smoothing operators.

We consider a continuum-first generative learning problem on a real separable Hilbert space H. Our data are i.i.d. samples {u_i}_i = 1^N ⊂ H drawn from an unknown target law μ on (H, ℬ(H)). The goal is to construct, from these samples, a discretization-consistent sampler whose output law μ̂ approximates μ in a transport metric (typically W₂), while keeping the number of drift evaluations and Gaussian random field (GRF) draws bounded independently of the numerical resolution used in computation.

Our approach is based on a multi-scale perturbation scheme indexed by a finite set I = [T] = {1, …, T}. At each scale t ∈ I we introduce a centered Gaussian reference measure
μ_t = 𝒩(0, C_t(φ)),
where C_t(φ) is a learnable covariance operator depending on parameters φ. Unlike fixed-noise denoising formulations, we treat {C_t(φ)}_t ∈ I as part of the model to be learned jointly with the drift (or score) operator, with the intent of shaping the geometry of the intermediate distributions {ν_t} in a way that improves mixing and reduces the number of sampling steps.

Since all objects are defined on H, we must enforce operator-theoretic constraints ensuring that the associated Gaussian measures are well defined and comparable across scales. We fix once and for all a baseline trace-class, self-adjoint, positive operator C_⋆ on H. We call a family {C_t(φ)}_t ∈ I if, for all t ∈ I,

The operator inequalities constitute a uniform equivalence condition: they prevent degeneracy of the covariance at any scale and, crucially, imply that all Gaussian references {μ_t} share the same Cameron–Martin space. Indeed, by standard results for equivalent Gaussian measures, ensures
H_{μ_t} = Im (C_t(φ)^1/2) = Im (C_⋆^1/2) for all t ∈ I,
with equivalence of the corresponding Cameron–Martin norms up to constants depending only on m and M. This shared Cameron–Martin structure is a primary technical requirement: it allows us to interpret Cameron–Martin derivatives, preconditioned scores, and stability estimates uniformly in t, and it prevents the learning problem from inadvertently moving between mutually singular reference geometries across scales.

For each t ∈ I we define a corruption map producing a noisy observation v_t ∈ H from a clean sample u ∼ μ by

where η_t is independent of u, and A_t : H → H is either the identity or a bounded linear smoothing operator. We impose one of the following conditions.

The role of (A)–(B) is measure-theoretic rather than aesthetic. Conditional on u, the law of v_t in is 𝒩(A_tu, C_t(φ)). In infinite dimensions, Gaussian shifts by vectors outside the Cameron–Martin space produce measures that are singular with respect to μ_t. Thus, (A) or (B) is needed to ensure A_tu ∈ H_{μ_t} almost surely, so that 𝒩(A_tu, C_t(φ)) is a Cameron–Martin shift of μ_t for μ-almost every u, and the mixture law ν_t of v_t is absolutely continuous with respect to μ_t. In practice, (B) is often the relevant regime: a mild smoothing A_t can be used to reconcile rough data laws μ with the Cameron–Martin regularity induced by trace-class C_t(φ).

We write ν_t := ℒ(v_t) for the marginal law of the corrupted sample at scale t. The intermediate measures {ν_t} serve two purposes. First, they provide the training distributions for denoising-based regression of the drift field at each t. Second, they define an annealing path from an initial, noise-dominated distribution to a final, data-dominated distribution, to be traversed by a Langevin-type sampler.

At each t, the quantity of interest is the preconditioned score drift F(⋅, t; φ) associated with (ν_t, μ_t), defined (as in the preceding section) through the log-density Φ_t = log (dν_t/dμ_t) by
F(u, t; φ) = −u + RD_{H_{μ_t}}Φ_t(u).
This drift depends on φ both explicitly through C_t(φ) (hence through μ_t and the Cameron–Martin geometry) and implicitly through the induced corrupted law ν_t. We approximate F(⋅, t; φ) by a parametrized, discretization-consistent map F_θ(⋅, t) : H → H, typically realized as a neural operator so that evaluation on different spatial grids corresponds to the same underlying continuum mapping.

The learning problem is therefore genuinely coupled: changing φ modifies the target of the regression problem for F_θ, while changing θ modifies the quality of the sampler that will ultimately be used with the covariances C_t(φ). Our objective is to choose (θ, φ) so that (i) F_θ(⋅, t) closely matches F(⋅, t; φ) in L²(ν_t; H) uniformly over t, and (ii) the resulting annealed Langevin dynamics, using {C_t(φ)} as its diffusion covariances, mixes rapidly at each scale. The admissibility constraints above are simultaneously a well-posedness requirement (trace-class noise and shared Cameron–Martin space) and a stabilization mechanism preventing the learned covariance schedule from collapsing onto low-dimensional or excessively stiff geometries.

Although all definitions are posed on H, training and sampling are implemented through finite-dimensional discretizations (e.g. Galerkin projections, grid evaluations, spectral truncations). We assume that the dataset can be queried on arbitrary discretizations and that, at sampling time, the learned maps F_θ(⋅, t) and covariance samplers for C_t(φ) can be evaluated at the same discretization without re-training. This motivates the explicit separation between (i) continuum objects μ, ν_t, μ_t, F(⋅, t; φ), and (ii) their numerical approximations. The uniform equivalence plays a key role here: it provides a scale- and parameter-uniform control of norms and regularity in the underlying function space, which is precisely what is needed to formulate discretization-robust error bounds for the induced samplers.

The remainder of the paper makes this problem setup constructive by specifying (a) an explicit joint training objective for (θ, φ) based on denoising, and (b) structured parameterizations of {C_t(φ)} that enforce admissibility while permitting efficient GRF sampling and differentiation with respect to φ.

4. PL-DDO objective and parameterization: structured C_t(φ) (Fourier-diagonal + low-rank, SPDE-based, kernel hyperparameters); regularizers enforcing operator inequalities; practical GRF sampling (KL/FFT) and gradients through noise.

Our training criterion is a preconditioned Langevin–denoising objective in which the covariance schedule is treated as a learnable component of the model. For each scale t ∈ I, recall that we observe corrupted samples v_t = A_tu + η_t with η_t ∼ 𝒩(0, C_t(φ)), and we seek a drift surrogate F_θ(⋅, t) approximating the preconditioned score drift F(⋅, t; φ) associated with (ν_t, μ_t). The empirical counterpart of this goal is captured by the population risk

where ℛ is a regularizer enforcing admissibility and additional structure. The appearance of η_t (rather than an explicit score) is deliberate: it yields a regression problem with a readily sampled target and admits an implementation by reparameterized Gaussian draws. Formally, under the standing assumptions ensuring ν_t ≪ μ_t and the existence of D_{H_{μ_t}}Φ_t, the minimizer over sufficiently rich function classes satisfies the usual denoising identity: at fixed t, the optimal drift field in L²(ν_t; H) coincides with the preconditioned score drift F(⋅, t; φ). Hence the dependence of ℒ on φ is essential rather than ancillary: changing φ changes the reference μ_t, the Cameron–Martin geometry, and the induced corrupted law ν_t, and therefore changes the regression target itself.

We enforce the operator inequalities mC_⋆ ≼ C_t(φ) ≼ MC_⋆ by adopting structured families in which admissibility is built in at the continuum level. A convenient template (used repeatedly below) is the congruence form

where S_t(φ) is bounded, self-adjoint, positive, and satisfies
σ(S_t(φ)) ⊆ [m, M] for all t. Then C_t(φ) is trace-class and automatically satisfies ; moreover, the corresponding Cameron–Martin spaces coincide with Im (C_⋆^1/2), and all t-dependent norms are equivalent with constants depending only on m, M. In what follows we therefore focus on designing S_t(φ) with (i) spectrum control, (ii) efficient sampling of η_t, and (iii) differentiability with respect to φ as required by gradient-based training.

Assume that C_⋆ is diagonal in a known orthonormal basis {e_k}_k ≥ 1 ⊂ H (e.g. the Fourier basis on a periodic domain), with C_⋆e_k = λ_k^⋆e_k, ∑_kλ_k^⋆ < ∞. A particularly efficient family is

In this case GRF sampling is trivial: if {ξ_k}_k ≥ 1 are i.i.d. standard normals,

which is implemented by truncating to the active discretization modes and using FFTs for basis transforms when needed. To parameterize the spectral multipliers while respecting bounds, we may set

with unconstrained parameters α_t, k ∈ ℝ and a smooth squashing map σ : ℝ → (0, 1) (e.g. logistic). This yields exact satisfaction of the operator inequalities and avoids nondifferentiable projection. Additionally, we can impose mild schedule regularity across t by penalizing discrete variations,

for the set 𝒦 of represented modes; this reduces spurious oscillations of the geometry between adjacent annealing scales.

Purely diagonal schedules cannot capture correlations across modes induced by nonstationarity or global constraints in the data. To enlarge expressivity while preserving tractability, we may augment a diagonal S_t by a finite-rank term:

where D_t is diagonal in the reference basis and U_t : H → ℝ^r (equivalently U_t^* : ℝ^r → H) has rank r≪ (active dimension). Exact spectral enforcement σ(S_t) ⊆ [m, M] for can be achieved by construction only in special cases; in practice we therefore combine (i) a base diagonal component parameterized as in with strict bounds D_t ≽ mI and ∥D_t∥ ≤ M, and (ii) a soft spectral penalty controlling departures from [m, M] on the active discretization:

where (x)_± = max {±x, 0} and λ_min, λ_max are computed on the finite-dimensional truncation. This empirical enforcement is consistent with the continuum intent when the parameterization is resolution-invariant and the rank r is fixed as resolution increases. Sampling from 𝒩(0, C_t) with – can be done by drawing a diagonal component plus a low-rank component, using a Woodbury-type factorization on the discretization; gradients are correspondingly efficient since all dependence on φ passes through diagonal scalings and r-dimensional linear algebra.

On domains where Fourier diagonalization is not appropriate (nonperiodic boundaries, manifolds, irregular geometries), we may define C_t(φ) implicitly through an elliptic SPDE. A canonical choice is a Mat'ern-type precision operator L_t(φ) (self-adjoint, positive) with

with hyperparameters φ = (τ_t, κ_t)_t ∈ I and fixed integer p chosen so that C_t is trace-class on the spatial dimension of interest. Sampling proceeds by the factorization η_t = L_t(φ)^−1/2ξ with ξ white noise, implemented numerically by solving a linear elliptic problem (or iterated solves when p > 1) at the working discretization. Admissibility relative to C_⋆ is enforced by restricting (τ_t, κ_t) to compact intervals and, when needed, choosing C_⋆ within the same SPDE family (so that follows from uniform ellipticity bounds). This yields a parameterization with a small number of physically interpretable degrees of freedom and a sampling cost comparable to a few PDE solves per Langevin step.

When H = L²(D) and C_t(φ) arises as an integral operator with kernel k_t^φ(x, y), we can parameterize k_t^φ by stationary kernels (e.g. squared exponential, Mat'ern) or mixtures thereof. On regular grids, stationary kernels are diagonal in Fourier space and reduce to with s_t, k determined by a low-dimensional spectral density. This connects kernel hyperparameters directly to the spectrum of C_t, allowing us to enforce bounds by constraining the spectral density between m and M multiples of the baseline density.

Beyond hard or soft enforcement of , we include terms promoting numerical stability and efficient sampling. Typical choices include: (i) trace control, e.g. ∑_t(Tr(C_t) − Tr(C_⋆))², to prevent excessive noise at intermediate scales; (ii) smoothness across t as in ; and (iii) structural sparsity (for low-rank schedules) via ∥U_t∥_HS² or nuclear-norm surrogates. Importantly, all such penalties are posed in terms of continuum-consistent quantities (trace, Hilbert–Schmidt norms, spectral multipliers) so that their effect does not depend qualitatively on discretization.

Optimization of ℒ(θ, φ) requires differentiating through η_t ∼ 𝒩(0, C_t(φ)). We use the standard reparameterization

implemented by in the Fourier-diagonal case, by low-rank factorizations for , or by SPDE-based solvers for . This yields unbiased stochastic gradients of ℒ with respect to φ under the usual differentiability assumptions on the map φ ↦ C_t(φ)^1/2 on the active discretization. In the diagonal case the derivatives are explicit and cheap; in SPDE families, automatic differentiation through the linear solve (or an adjoint solve) provides gradients with a cost comparable to sampling itself. Since the loss in is expressed in the ambient H-norm, we additionally ensure that discretized implementations use quadrature-consistent inner products so that training remains invariant under refinement (up to the numerical approximation of F_θ and of the sampling map φ ↦ η_t).

In summary, the PL-DDO formulation couples a denoising regression objective with a constrained, continuum-defined covariance schedule. The structured families above are designed so that (i) admissibility and common Cameron–Martin geometry are maintained uniformly, (ii) GRF sampling remains fast at arbitrary resolutions, and (iii) gradients with respect to φ can be computed reliably. These properties are the prerequisites for the sampling procedure developed next, where the learned C_t(φ) is used as the diffusion covariance in an annealed, preconditioned Langevin scheme.

5. Sampling algorithm: annealed preconditioned Langevin with learned C_t; Euler–Maruyama vs Crank–Nicolson discretization; stability heuristics and step-size scheduling.

Having learned a drift surrogate F_θ(⋅, t) together with an admissible covariance schedule C_t(φ), we define a sampler by discretizing a preconditioned Langevin dynamics on H whose diffusion at scale t is precisely C_t(φ). Throughout, we treat the annealing index t ∈ I = {1, …, T} as piecewise-constant ``time’’ and run a short Markov chain at each t, warm-started from the previous scale.

Fix a scale t. The ideal continuum dynamics corresponding to the preconditioned score drift F(⋅, t; φ) is the H-valued stochastic evolution

where W^C_t(φ) denotes a C_t(φ)-Wiener process. Replacing F by F_θ yields the learned dynamics. In the annealed setting we run for a finite horizon at each t, with the terminal state at t serving as the initial condition at t + 1. We emphasize that C_t(φ) plays a dual role: it specifies the diffusion covariance in and, through the training problem, it shapes the Cameron–Martin geometry in which the learned drift is an approximation to the true preconditioned score drift.

Let h_t > 0 be a step size at scale t, and let M be the number of inner steps per scale. On a finite-dimensional discretization H_n ⊂ H (e.g. a Galerkin space) we use the explicit EM update

where {ξ_t⁽ⁿ⁾} are independent across iterations n and across scales t. In practice we draw ξ_t⁽ⁿ⁾ via the reparameterization ξ_t⁽ⁿ⁾ = C_t(φ)^1/2z⁽ⁿ⁾ with z⁽ⁿ⁾ ∼ 𝒩(0, I) in the chosen basis. The update is the natural discretization of and is easy to implement when F_θ is a neural operator and C_t(φ) admits fast sampling.

At the continuum level, explicit EM is sensitive to stiffness in the drift. In our setting this stiffness is particularly visible because the preconditioned drift decomposes as

so that the linear damping term −u is always present, while G_θ may be only locally Lipschitz on typical data regimes. This observation motivates a semi-implicit alternative that treats −u more stably.

A standard stabilization in (infinite-dimensional) Langevin integrators is to apply a CN discretization to the linear Ornstein–Uhlenbeck component and keep the nonlinear part explicit. Concretely, applying CN to −u and EM to G_θ yields

Solving for u^(n + 1) gives the explicit affine form

Two properties of are particularly useful for our purposes. First, the linear factor satisfies |a_t| < 1 for all h_t > 0, which prevents the explicit instability that can arise in when h_t is not sufficiently small relative to the linear damping. Second, when G_θ ≡ 0 the CN scheme preserves the correct Gaussian invariant law for the linear preconditioned OU dynamics (up to discretization effects in H_n), a feature that is aligned with the role of μ_t = 𝒩(0, C_t(φ)) as the reference measure at each scale. For these reasons, CN-type updates are often preferable at small noise levels (late in the annealing) where stability constraints are most restrictive.

We stress that both and use exactly the same learned ingredients (F_θ, C_t(φ)); the distinction is purely numerical. Our theory in the next section is stated for the continuum dynamics and for stable discretizations; empirically, enlarges the range of usable step sizes without sacrificing resolution invariance.

We initialize at the coarsest/noisiest scale by drawing

and then, for t = 1, …, T, we run M iterations of either or with covariance C_t(φ), producing a terminal state u_t^(M). We set u_t + 1⁽⁰⁾ := u_t^(M) and proceed. The output is u_T^(M), which we interpret as an approximate sample from ν_T (and, in regimes where ν_T ≈ μ, from μ). This warm-started annealing is essential: it amortizes mixing by transporting mass through a sequence of intermediate geometries rather than attempting to reach ν_T in one long chain.

Because C_t(φ) is learned, the geometry of the sampler varies with φ and can change markedly across t. We therefore choose {h_t} in a manner that is robust to such variation. At a minimum, stability of explicit EM suggests taking h_t inversely proportional to an estimate of the Lipschitz constant of F_θ(⋅, t) on typical states. In practice we approximate this by one of the following proxies on the active discretization: (i) a running estimate of ∥∇_uF_θ(u, t)∥ via automatic differentiation on minibatches of chain states; (ii) a spectral proxy based on the linear part, namely ensuring h_t is not so large that the effective contraction (1 − h_t) becomes negative in ; or (iii) a conservative schedule decreasing with t when t corresponds to decreasing noise (late scales are typically more ``peaked’’ and hence less forgiving).

For CN updates, we may choose larger h_t because the linear stiffness is controlled by |a_t| < 1. Nevertheless, the nonlinear term G_θ still imposes a practical limit: excessively large b_t leads to poor approximation of the target drift and can cause transient divergence before the −u term contracts. A simple and effective heuristic is to pick h_t so that b_t remains below a prescribed threshold (e.g. b_t ≤ b̄) while allowing a_t to be close to 1 at early (high-noise) scales and smaller at late (low-noise) scales. We also note that the uniform equivalence bounds mC_⋆ ≼ C_t(φ) ≼ MC_⋆ imply that the typical noise energy and the induced Cameron–Martin norm remain comparable across t; this prevents the schedule from producing degenerate diffusion directions and thereby supports stable choices of h_t that do not depend on discretization dimension.

All updates are executed on a finite-dimensional representation H_n determined by the query discretization. The drift F_θ(⋅, t) is evaluated as a discretization-consistent neural operator, and ξ_t⁽ⁿ⁾ is drawn by sampling the truncated Gaussian 𝒩(0, P_nC_t(φ)P_n), where P_n : H → H_n is the projection associated with the discretization. Provided the sampling map C_t(φ)^1/2 is implemented in a manner compatible with P_n (Fourier truncation, KL truncation, or discretized SPDE solves), the resulting chain is consistent under refinement in the sense that it approximates the same continuum algorithm, rather than a dimension-dependent surrogate.

The purpose of this section is to make explicit the algorithmic object to be analyzed: an annealed, preconditioned Langevin chain whose preconditioner is learned but constrained to an admissible operator family. In the next section we justify that, under the standing spectral bounds on C_t(φ) and standard monotonicity/Lipschitz hypotheses on the drift, the underlying continuum dynamics is well posed and the measures appearing in the construction are mathematically meaningful, so that the training and sampling procedures are posed on H rather than tied to a particular discretization.

6. Theory I (well-posedness): existence/uniqueness of the continuum dynamics with time-varying preconditioner; conditions ensuring invariant/target measures are well-defined; finiteness of the training objective.

We justify that, under our standing admissibility and regularity hypotheses, the continuum objects underlying training and sampling are mathematically meaningful on H: (i) the reference measures μ_t = 𝒩(0, C_t(φ)) and the perturbed laws ν_t are well defined and satisfy ν_t ∼ μ_t; (ii) the log-density Φ_t = log (dν_t/dμ_t) admits a Cameron–Martin derivative so that the preconditioned score drift F(⋅, t; φ) is defined ν_t-a.s.; (iii) the continuum dynamics with time-varying covariance is well posed; and (iv) the denoising objective is finite.

Our basic structural requirement is the uniform operator inequality

for fixed 0 < m ≤ M < ∞ and trace-class C_⋆. First, implies each C_t(φ) is trace-class with a uniform trace bound

and hence η_t ∼ 𝒩(0, C_t(φ)) is H-valued with 𝔼∥η_t∥² = Tr(C_t(φ)). Second, the equivalence yields a shared Cameron–Martin space: by standard operator monotonicity arguments (e.g. comparing C_t^1/2 and C_⋆^1/2 via functional calculus), we have

and the corresponding Cameron–Martin norms are uniformly equivalent. In particular, translations by elements of H_μ preserve measure equivalence for all μ_t, uniformly in t.

A convenient sufficient mechanism for is the congruence parameterization

where S_t(φ) is bounded, self-adjoint, positive, and σ(S_t(φ)) ⊂ [m, M]. Then – follow immediately from trace-class stability under bounded congruences and the spectral bounds on S_t(φ).

Fix t ∈ I. Under assumption (A) we have v_t = u + η_t, while under (B) we have v_t = A_tu + η_t with A_t(H) ⊆ H_μ. In either case, conditioning on u yields that ν_t is a mixture of Cameron–Martin shifts of μ_t:
ν_t(⋅) = ∫μ_t(⋅ − A_tu) μ(du),
where in case (A) we interpret A_t = Id and require μ(H_μ) = 1. For each fixed x ∈ H_μ, the Cameron–Martin theorem gives

Applying with x = A_tu ∈ H_μ and averaging over u yields the explicit Radon–Nikodym derivative

The right-hand side is strictly positive μ_t-a.s. and finite provided, for instance, that 𝔼∥A_tu∥_{H_μ}² < ∞, which we take as a mild integrability condition (automatic in many applications where A_t is smoothing and u ∈ H has finite second moment). Consequently ν_t ≪ μ_t, and since each component measure μ_t(⋅ − A_tu) is itself equivalent to μ_t, the mixture satisfies μ_t ≪ ν_t as well; hence ν_t ∼ μ_t and Φ_t = log (dν_t/dμ_t) is μ_t-a.s. (equivalently ν_t-a.s.) well defined.

Moreover, is formulated intrinsically on H using H_μ duality, so it is consistent under discretization: Galerkin projections simply replace μ_t and ν_t by their pushforwards to H_n, without altering the continuum identity.

The preconditioned drift is defined by
F(u, t; φ) := − u + R D_{H_μ}Φ_t(u),
where D_{H_μ}Φ_t denotes the Fr'echet derivative along H_μ (i.e. directional differentiability in Cameron–Martin directions). In the present setting, existence of D_{H_μ}Φ_t can be justified under standard regularity assumptions on the density . One sufficient condition is that the map
$$ v \;\mapsto\; \frac{d\nu_t}{d\mu_t}(v) $$
belongs to a Gaussian Sobolev class W^1, 2(μ_t) on H, in which case D_{H_μ}Φ_t exists μ_t-a.s. and is square integrable; see the Malliavin/abstract Wiener space literature on differentiability of log-densities with respect to Gaussian reference measures. For our purposes, we record this as an assumption: D_{H_μ}Φ_t exists and is such that u ↦ F(u, t; φ) is globally Lipschitz and dissipative uniformly in t, as required in Theorem~1.

We consider the annealed dynamics in which t is piecewise constant. Let 0 = s₀ < s₁ < ⋯ < s_T be switching times and set t(s) = k for s ∈ [s_k − 1, s_k). We may realize the noise via a cylindrical Wiener process B_s on H and write

By , each C_t(s)(φ)^1/2 is Hilbert–Schmidt, so the stochastic integral in is well defined in H. If F(⋅, t; φ) is globally Lipschitz and satisfies a uniform dissipativity condition

then standard infinite-dimensional SDE/SPDE theory (e.g. via a fixed-point argument for mild solutions on each interval [s_k − 1, s_k) combined with Gr"onwall estimates and It^o isometry using ) yields existence and uniqueness of a strong solution with continuous H-valued paths, together with uniform moment bounds such as

where the constants depend on α, β, m, M, Tr(C_⋆) but not on any discretization dimension. The same statement applies to the learned dynamics obtained by replacing F with F_θ, provided F_θ(⋅, t) obeys analogous Lipschitz/dissipativity bounds (which can be enforced by architecture and/or spectral normalization and is compatible with our regularization of φ).

Fix t and consider the autonomous dynamics . The definition of F(⋅, t; φ) as a μ_t-preconditioned score drift implies that ν_t is formally invariant: for smooth cylindrical test functions f, the generator
ℒ_tf(u) = ⟨F(u, t; φ), Df(u)⟩ + Tr (C_t(φ)D²f(u))
satisfies ∫ℒ_tf dν_t = 0. This identity is the infinite-dimensional analogue of the finite-dimensional fact that Langevin with drift equal to the score of the target has the target as invariant law; here the integration-by-parts is performed with respect to the Gaussian reference μ_t and transferred to ν_t using Φ_t. Under the dissipativity condition , one obtains a Feller Markov semigroup with at most one invariant probability measure, hence ν_t is the unique invariant law at scale t.

Finally, we verify that the denoising objective is finite under mild growth conditions. Since η_t ∼ 𝒩(0, C_t(φ)) is H-valued with 𝔼∥η_t∥² = Tr(C_t(φ)), we have by

Moreover v_t = A_tu + η_t is H-valued, and if F_θ(⋅, t) has at most linear growth uniformly in t, namely ∥F_θ(u, t)∥ ≤ c₀ + c₁∥u∥, then
𝔼∥F_θ(v_t, t)∥² ≤ 2c₀² + 2c₁² 𝔼∥v_t∥² < ∞,
whenever 𝔼∥u∥² < ∞ and holds. Thus ℒ(θ, φ) is finite for admissible φ and such F_θ, and the regularizer ℛ(φ) may be viewed as ensuring the standing operator bounds needed for and . This completes the well-posedness layer required before we turn to quantitative sampling error bounds.

7. Theory II (dimension-free sampling error): W₂ (or bounded-Lipschitz) bounds decomposing error into score approximation, discretization, and annealing; explicit dependence on m, M but not on grid resolution.

We now quantify the discrepancy between the distribution produced by the annealed Euler–Maruyama sampler driven by the learned drift F_θ and the terminal perturbed law ν_T (and, by a further triangle inequality, the target μ when ν_T ≈ μ). Our objective in this section is to exhibit an error decomposition whose constants depend on continuum quantities and on the uniform equivalence constants m, M, but not on any discretization resolution.

Fix t ∈ I and a step size h_t > 0. We consider the (one-step) Euler kernels on H

where ξ_t ∼ 𝒩(0, C_t(φ)) is independent of u. For a probability measure ρ on H we write ρK_t^h for its pushforward under the kernel. The full sampler is the composition of M steps per scale, hence the terminal law is
π̂ := ρ₀ K̂₁^h M K̂₂^h M⋯K̂_T^h M, K̂_t^h M := (K̂_t^h)^M,
with initialization ρ₀ = μ₁ = 𝒩(0, C₁(φ)) for definiteness. We analogously define the ``ideal discrete-time’’ law
π^h := ρ₀ K₁^h M K₂^h M⋯K_T^h M.

The estimates below rest on a standard contractivity condition for the fixed-scale dynamics. Concretely, we assume that for each t ∈ I the drift F(⋅, t; φ) is globally Lipschitz on H with constant L and is (strongly) monotone with constant α > 0,

uniformly in t. In preconditioned models the natural monotonicity is often formulated in the C_t(φ)⁻¹-geometry; using mC_⋆ ≼ C_t(φ) ≼ MC_⋆, one passes between the corresponding norms with constants depending only on m, M. In particular, any α stated in a C_t⁻¹-metric yields an α^′ in with α^′ depending on m, M and the baseline geometry, but not on discretization dimension.

We decompose the terminal error as

The first term is the ; the second term splits into (Euler bias) and (finite number of scales / imperfect bridging of {ν_t}).

Let δ satisfy the uniform mean-square error bound

To convert into a Wasserstein bound for the Markov kernels, we use a synchronous coupling: evolve two chains from the same initial point and with the same Gaussian increment ξ_t at each step. For one Euler step at scale t, this gives
U⁺ − Û⁺ = h_t(F(U, t; φ) − F_θ(U, t)),
hence
𝔼∥U⁺ − Û⁺∥² ≤ h_t² 𝔼∥F(U, t; φ) − F_θ(U, t)∥².
Under the standing moment bounds from Section~ and the fact that, for contractive kernels, the chain remains close (in W₂) to ν_t after a short burn-in, one obtains a perturbation estimate of the form

with c₁ depending on m, M and the uniform Lipschitz/dissipativity constants, but independent of any discretization. Iterating across M steps and T scales, and using the contractivity of K_t^h implied by when h_t ≤ 1/L, yields

In the common regime h_t ≡ h and fixed total ``algorithmic time’’ ∑_tMh_t = O(1), the drift-induced error is O(δ^1/2).

Even with the true drift, the Euler kernel K_t^h has an invariant law ν_t, h which differs from the continuum invariant ν_t. Under standard regularity (global Lipschitz and monotonicity) and trace-class noise, one has a strong error estimate for the Euler scheme in Hilbert spaces which implies a Wasserstein bias bound

again with constants depending on the continuum parameters and on m, M through the noise geometry, but not on any discretization resolution.

Moreover, because we run only M steps per scale, we incur a finite-time mixing error to ν_t, h. By contractivity of K_t^h in W₂ (a discrete analogue of the continuum contraction), there exist γ_t > 0 and C_mix < ∞ such that for any initial ρ with finite second moment,

with α from up to factors controlled by m, M. Combining – across scales yields an accumulated discretization-and-mixing contribution bounded by

where Ξ_t is a scale-dependent warm-start term (controlled by second moments, hence dimension-free under ). In particular, taking Mh_t ≳ γ_t⁻¹log (1/ε) makes the second term O(ε) uniformly over discretizations.

The remaining discrepancy is intrinsic to the annealing/bridging design: even if we could sample from ν_t at each scale, a finite number of scales and the chosen forward map u ↦ A_tu determine how closely ν_T approximates μ (and how difficult the transitions ν_t → ν_t + 1 are). We encode this as an additive term AnnealErr({ν_t}), for instance

or any sharper telescoping surrogate reflecting the actual schedule and the number of inner steps. This term depends on the modeling choice of scales and corruption operators, but it is a continuum quantity and therefore does not deteriorate with grid refinement.

Collecting , , , and the annealing contribution yields the following representative estimate, which is a refined version of Theorem~2 stated in the overview.

Since W₂ metrizes weak convergence on sets with uniformly bounded second moments, Theorem~ immediately implies a bounded-Lipschitz estimate: for ∥ ⋅ ∥_BL the dual norm over functions with ∥f∥_∞ ≤ 1 and Lip(f) ≤ 1, we have
∥π̂ − ν_T∥_BL ≤ W₁(π̂, ν_T) ≤ W₂(π̂, ν_T),
hence the same decomposition applies (with the same dimension-free constants) to bounded test-function errors.

Finally, we emphasize why the bounds are genuinely resolution-independent. The contraction parameters and Euler error constants are obtained at the level of the continuum Hilbert space H, using only (i) trace-class noise bounds and (ii) drift coercivity/Lipschitz properties measured in norms whose equivalence is controlled by m, M. For a Galerkin projection P_n : H → H_n, the projected dynamics inherit the same constants uniformly in n, and W₂-stability under weak convergence allows passing to the n → ∞ limit. Thus, once δ, {h_t}, and the annealing design are fixed, the number of steps TM needed to reach a prescribed accuracy bound does not blow up with the grid.

8. Theory III (acceleration and tightness): log-concave special case; mixing-time/spectral-gap proxy in terms of an operator condition number; upper bounds improved by learned C_t; matching lower bound showing near-optimality within the covariance family.

In this section we isolate a regime in which the dependence of mixing on the covariance choice C_t(φ) can be characterized by a single operator condition number, and in which optimizing C_t(φ) within an admissible family yields a provable (and essentially unimprovable) reduction in the number of inner Langevin steps required at each scale.

Fix t ∈ I. We assume that ν_t is absolutely continuous with respect to μ_t = 𝒩(0, C_t(φ)) and admits a log-concave density of the form

where Ψ_t : H → ℝ is differentiable along the Cameron–Martin space H_{μ_t}, with derivative D_{H_{μ_t}}Ψ_t(u) ∈ H_{μ_t}^*. With Φ_t = log (dν_t/dμ_t), we have Φ_t = −Ψ_t + const, hence the true drift reads
F(u, t; φ) = −u − RD_{H_{μ_t}}Ψ_t(u).
We further assume that Ψ_t is α-strongly convex and L-smooth , uniformly in t, in the following sense: for all u, v ∈ H and all t,

where ∥w∥_{C_t⁻¹} := ∥C_t(φ)^−1/2w∥ on the common image space (well-defined thanks to uniform equivalence). Condition is the function-space analogue of Hessian bounds in preconditioned coordinates; it implies monotonicity and Lipschitz properties of F(⋅, t; φ) with constants that depend on (α, L) and the equivalence constants (m, M), but not on discretization.

For a bounded, positive, self-adjoint operator C satisfying mC_⋆ ≼ C ≼ MC_⋆, define the preconditioned smoothness and convexity moduli by

and the associated proxy

When holds with C = C_t(φ), we have L_t(C_t(φ)) ≤ L and α_t(C_t(φ)) ≥ α, hence κ_t(C_t(φ)) ≤ L/α. More generally, if one compares two covariances C and C̃ in the admissible family, the uniform inequalities mC_⋆ ≼ C, C̃ ≼ MC_⋆ imply
∥w∥_C⁻¹² ≃_m, M ∥w∥_C̃⁻¹²,
so κ_t(⋅) varies in a controlled manner over the family, with constants depending only on m, M.

Consider the fixed-scale Euler kernel K_t^h from the previous section with true drift F(⋅, t; φ). Under one obtains a discrete-time contraction estimate in W₂ whose rate is governed by α_t(C_t(φ)), with admissible step sizes determined by L_t(C_t(φ)). A representative statement is as follows.

The salient point is that the dependence on the covariance enters through κ_t(C_t(φ)), and the bounds are dimension-free because α_t(C) and L_t(C) are continuum quantities expressed in operator norms on H (with equivalence controlled by m, M).

Let 𝒞 be an admissible family of covariances satisfying mC_⋆ ≼ C ≼ MC_⋆, for instance the congruence family C = C_⋆^1/2SC_⋆^1/2 with S bounded, positive, self-adjoint and σ(S) ⊂ [m, M]. For each scale t, define the best achievable proxy in the family by
κ_t^⋆ := inf_{C ∈ 𝒞}κ_t(C).
Since Proposition~ shows that M_t need only scale linearly with κ_t(C), any choice C_t(φ) that reduces κ_t(C_t(φ)) yields a commensurate reduction in the required number of inner steps. In particular, comparing to the baseline C_⋆, we obtain an improvement factor bounded by κ_t(C_⋆)/κ_t(C_t(φ)). From the modeling perspective, the role of the covariance-learning component φ is precisely to search, within 𝒞, for covariances that better match the local anisotropy of Ψ_t as seen through –; this is the function-space analogue of preconditioning by an approximate inverse Hessian.

We now show that the κ-dependence above cannot, in general, be improved when the covariance is restricted to a prescribed family 𝒞. The sharpest statement is obtained on quadratic instances, for which the proxy κ_t(C) is exact.

Fix t and consider the Gaussian target ν_t = 𝒩(0, Q) with Q trace-class and strictly positive. Suppose Q commutes with all C ∈ 𝒞 (e.g., all are diagonal in a common Fourier/KL basis). Then ν_t ≪ μ_t = 𝒩(0, C) with $\Psi_t(u)=\tfrac12\langle B u,u\rangle$ for a bounded positive operator B diagonal in the same basis, and the preconditioned Langevin dynamics decouple mode-wise. Writing {λ_i(C)} and {λ_i(Q)} for eigenvalues in the shared basis, one verifies that the relevant preconditioned spectrum is {λ_i(C)/λ_i(Q)}, hence
$$ \kappa_t(C) \;=\; \frac{\max_i \lambda_i(C)/\lambda_i(Q)}{\min_i \lambda_i(C)/\lambda_i(Q)}. $$
For the Euler scheme with step size h ≤ 1/L_t(C), the i-th mode evolves as a one-dimensional AR(1) recursion whose contraction factor is 1 − ha_i with a_i ∈ [α_t(C), L_t(C)]. Consequently, no choice of h can make all modes contract faster than the slowest one without violating stability on the fastest one, and the number of steps required to reduce the W₂ distance by a factor ε must scale at least linearly in L_t(C)/α_t(C).

Theorems~ and together justify the interpretation of covariance learning as an essentially optimal acceleration mechanism within a constrained operator class: improvements in mixing are possible precisely to the extent that one can reduce κ_t(C) over C ∈ 𝒞, and for commuting Gaussian instances this reduction is both necessary and sufficient up to universal constants.

9. Experiments: (a) Gaussian-mixture GRFs (control ground-truth spectral structure), (b) Navier–Stokes pushforward measures (super-resolution + mixing), (c) conditional Darcy posteriors (many-sample throughput). Report step reduction, wall-clock, resolution invariance, and robustness to discretization changes.

We present three experimental suites designed to isolate (i) whether learning the covariance schedule C_t(φ) produces measurable acceleration at fixed accuracy, (ii) whether this acceleration is stable across discretizations (continuum-first invariance), and (iii) whether the method remains effective in conditional settings where per-sample throughput is the primary constraint. In all experiments we train by the objective
ℒ(θ, φ) = 𝔼_t 𝔼_{u, η_t} ∥η_t + F_θ(A_tu + η_t, t)∥² + λ ℛ(φ),
with a schedule t ∈ [T] and inner sampling steps M per scale in the annealed Euler–Maruyama sampler. Unless stated otherwise, A_t = Id in unconditional tasks and A_t is chosen as a fixed low-pass smoother in tasks where μ(H_{μ_t}) < 1 is expected at fine scales. We parameterize C_t(φ) in a Fourier/KL basis via a congruence C_t(φ) = C_⋆^1/2S_t(φ)C_⋆^1/2 with S_t(φ) diagonal and σ(S_t) ⊂ [m, M] enforced by projection after each optimizer step; this ensures a shared Cameron–Martin space and eliminates discretization-dependent admissibility failures.

We report (1) : the smallest TM achieving a prescribed test metric threshold, (2) : end-to-end runtime including GRF sampling and operator evaluation at fixed hardware and batch size, and (3) : performance when training and sampling are carried out on different discretizations. For unconditional generative tasks we use sliced W₂ (finite-dimensional projections), MMD with a Sobolev kernel, and domain-relevant summary statistics (spectral energy, structure functions). For conditional inverse problems we also report posterior predictive error, coverage of credible intervals, and effective sample size per second for downstream quantities of interest. Baselines include (i) fixed C_t ≡ C_⋆ with identical F_θ architecture, (ii) hand-designed schedules (e.g. geometric scaling of C_⋆), and (iii) classical function-space samplers (pCN and preconditioned Crank–Nicolson variants) where applicable.

Our first suite uses synthetic measures μ on H = L²(𝕋^d) constructed as mixtures of Gaussian random fields with known and intentionally mismatched spectral structure. Concretely, we sample a latent label z ∈ {1, …, K} and then draw u ∼ 𝒩(0, Q_z), where each Q_z is diagonal in the Fourier basis with eigenvalues λ_k(Q_z) concentrated on disjoint bands and with prescribed anisotropy ratios across modes. This setting is designed so that (i) the optimal preconditioner in the commuting Gaussian case is explicit, and (ii) the κ-proxy varies over orders of magnitude depending on whether C matches the dominant bands of {Q_z}.

We train PL-DDO with a Fourier neural operator for F_θ(⋅, t) and compare learned C_t(φ) against C_⋆ chosen as a Mat'ern-type prior with smooth spectrum. We observe that learned schedules shift mass toward the active frequency bands and, as t increases, progressively sharpen the spectrum while remaining within the admissible bounds [m, M]. The principal quantitative outcome is that the inner step count M required to reach a fixed sliced-W₂ threshold decreases proportionally to the reduction in the empirically estimated condition proxy
$$ \widehat\kappa_t(C) = \frac{\widehat L_t(C)}{\widehat \alpha_t(C)}, \qquad \widehat L_t(C),\widehat\alpha_t(C) \ \text{estimated from finite differences along sampled directions.} $$
In particular, at fixed T we find that replacing C_⋆ by the learned C_t(φ) yields a consistent reduction in M across mixture weights and across different anisotropy ratios, while the fixed-covariance baseline requires small step sizes h_t (stability dictated by the fastest mode) and therefore more inner steps.

To test discretization robustness we train on a coarse grid, but sample and evaluate on finer grids by drawing η_t ∼ 𝒩(0, C_t(φ)) directly at the evaluation resolution. Because C_t(φ) is defined through a basis-consistent spectral rule and F_θ is discretization-consistent, the step reduction persists with only minor degradation in summary statistics. We also repeat evaluation under random regridding (nonuniform sample points followed by interpolation to the operator input format) and find that the learned covariance remains effective; in contrast, hand-tuned schedules exhibit nontrivial sensitivity to the discretization change, typically manifesting as either overly rough samples (excess high-frequency power) or slow mixing in the low-frequency modes.

Our second suite targets non-Gaussian measures obtained as pushforwards of Gaussian initial conditions through a PDE flow. Let u₀ ∼ 𝒩(0, Q) be a divergence-free random field and define u = 𝒢(u₀), where 𝒢 maps initial vorticity to a time-τ state of 2D Navier–Stokes on 𝕋² at moderate Reynolds number. The resulting μ exhibits intermittency, nontrivial phase correlations, and heavy-tailed pointwise statistics, hence constitutes a stringent test of whether covariance learning (a purely second-order geometric mechanism) can nevertheless accelerate sampling when combined with a learned drift.

We use two evaluation modes. In the mode we compare sample quality as a function of TM at a fixed resolution, reporting energy spectra E(k), vorticity kurtosis, and MMD. In the mode we train F_θ on low-resolution observations of u but sample at a higher resolution; this probes whether the learned objects remain meaningful in the continuum limit. We employ a low-pass smoother A_t at early noise scales to ensure v_t lies in H_{μ_t} while preserving the large-scale structures that govern long-range dependence.

Empirically, learning C_t(φ) yields a clear reduction in the number of inner steps required to match the spectral statistics: relative to C_⋆, the learned schedule reallocates variance across modes in a manner consistent with the observed energy cascade, which in turn permits larger stable step sizes h_t without destabilizing the high-frequency components. This translates to wall-clock improvements because each inner step includes both an operator evaluation and a GRF draw; decreasing M reduces both costs simultaneously. We emphasize that the measured acceleration is not an artifact of reduced accuracy: for each method we select TM to meet a fixed tolerance on a jointly chosen metric set (spectrum error, MMD, and a low-order moment vector), and then compare runtimes at that tolerance.

Resolution invariance is assessed by training on one discretization (with Fourier truncation n) and sampling on multiple larger truncations n^′. We observe that the learned C_t(φ) continues to provide the same qualitative spectral shaping across n^′, and that the required M to achieve a fixed tolerance does not grow with n^′ within the tested range. In contrast, when C_t is hand-scaled without learning, we see either (i) a need to retune the schedule as n^′ increases or (ii) degradation in high-frequency statistics. These behaviors are consistent with the theoretical emphasis on continuum-defined operator inequalities and shared Cameron–Martin geometry.

Our third suite addresses conditional sampling for elliptic inverse problems, with the goal of demonstrating many-sample throughput when repeatedly conditioning on new data. We consider a log-permeability field u ∈ H and observations y = 𝒪(𝒮(u)) + ξ, where 𝒮 solves the Darcy PDE and 𝒪 extracts sparse pressure measurements. The target is the posterior μ(⋅ ∣ y), which is non-Gaussian and anisotropic, and whose geometry varies with y. We train a conditional drift F_θ(⋅, t; y) by augmenting the neural operator input with a representation of y and the observation locations, while still learning a covariance schedule C_t(φ) shared across conditions; this isolates the effect of learning a global preconditioner that improves mixing uniformly over y.

We compare against (i) pCN initialized from the prior, (ii) pCN with hand-tuned step size, and (iii) PL-DDO with fixed C_⋆. We report posterior predictive RMSE on held-out sensor locations, coverage of pointwise credible intervals, and effective sample size per second for low-dimensional quantities of interest (e.g. average flux through a boundary). The primary outcome is that learning C_t(φ) reduces the number of inner Langevin steps needed to reach stable posterior summaries, which directly increases throughput when many posterior samples are required per observation instance. In addition, the learned C_t(φ) decreases sensitivity to discretization: training on a moderate mesh and sampling on refined meshes yields comparable posterior summaries without retuning, provided that the PDE solver and the GRF sampler are consistent with the continuum operators.

Finally, we perform an ablation in which we freeze φ after pretraining on unconditional data and then finetune only θ conditionally. This separates geometry learning'' fromscore fitting’’ and shows that a substantial fraction of the speedup persists, indicating that the covariance schedule captures reusable anisotropy common to the posterior family. Across all three suites, the aggregate evidence supports the central experimental claims: (i) covariance learning yields consistent step and wall-clock reductions at fixed accuracy, (ii) these gains are maintained across discretizations, and (iii) the method remains effective in conditional regimes where sampling throughput dominates training cost.

10. Discussion: covariance learning pitfalls (singularity, overfitting to discretization), connections to function-space MCMC and adaptive preconditioning, limitations and future directions (non-Gaussian references, manifold constraints).

Our central design choice is to a covariance schedule t ↦ C_t(φ) while keeping all definitions in the continuum Hilbert space H. This choice yields tangible acceleration in the regimes where preconditioning is known to control contraction, but it also introduces failure modes that do not arise when C_t is fixed . We summarize the main pitfalls and the mechanisms we use to avoid them, then relate the method to function-space MCMC and adaptive preconditioning, and finally outline limitations and future directions.

In infinite dimensions, covariance learning'' is not an innocuous parametric extension: changing $C_t$ changes the reference Gaussian measure $\mu_t=\mathcal N(0,C_t)$, and in general two Gaussian measures on $H$ can be mutually singular. Since our objective is formulated through the score term $D_{H_{\mu_t}}\Phi_t$ with $\Phi_t=\log(d\nu_t/d\mu_t)$, we require $\nu_t\ll\mu_t$ and a common Cameron--Martin geometry across $t$. The uniform equivalence constraint \[ m C_\star \preceq C_t(\varphi) \preceq M C_\star \] is precisely the condition that preventsadmissibility collapse’’: it ensures Im (C_t^1/2) = Im (C_⋆^1/2) and rules out degenerate limits in which C_t loses rank in directions that matter to μ. Without such a constraint, stochastic gradient descent can drive some eigenvalues of C_t to 0 (to reduce injected noise and artificially simplify the denoising task) or to +∞ (to overwhelm the data term), and either behavior can render μ_t incompatible with ν_t in the Feldman–H'ajek sense. In practice we therefore treat admissibility as a hard constraint (projection/clipping of spectra in a fixed basis) augmented by a soft regularizer ℛ(φ) that discourages repeated saturation at the bounds m and M. The projection is not merely numerical hygiene; it enforces the measure-theoretic prerequisites under which the continuum objective is well-defined.

Even under admissibility, learning C_t and F_θ jointly can admit spurious optima. The denoising target η_t is drawn from 𝒩(0, C_t(φ)), so the distribution of training pairs (v_t, η_t) changes with φ. This creates a feedback loop: the optimizer may alter C_t to generate ``easier’’ corruptions for the current F_θ, rather than covariances that are optimal for sampling. Two mitigations are important. First, we constrain the parameterization of C_t(φ) to be sufficiently low-dimensional and basis-consistent (e.g. diagonal or structured in a Fourier/KL basis), which limits the ability of C_t to fit idiosyncrasies of a finite dataset. Second, we interpret the learned schedule as a rather than a direct model of data covariance; accordingly, we evaluate it by downstream sampling performance at fixed accuracy rather than by training loss alone. This distinction matters because a lower denoising loss does not necessarily imply improved Langevin contraction if, for instance, C_t concentrates variance in directions where F_θ is inaccurate.

A more subtle pitfall is overfitting φ to the training discretization. If C_t(φ) is parameterized directly on a grid (e.g. as a dense matrix or pixel-wise kernel), then changing the resolution changes the parameterization itself, and the learned covariance can encode grid artifacts (aliasing, boundary effects, or anisotropies tied to coordinate axes). Such overfitting may even reduce the training objective while deteriorating continuum performance. Our continuum-first formulation is intended to preclude this failure mode: we choose C_t(φ) through operator rules that commute with refinement (Fourier multipliers, KL truncations consistent with a fixed basis, or SPDE-based covariances). Nevertheless, discretization overfitting can reappear if the chosen basis is not aligned with the domain geometry or if the implementation uses inconsistent quadrature/interpolation across grids. A practical diagnostic is to re-evaluate the learned schedule under multiple discretizations and regridding operators; a principled direction is to explicitly enforce discretization-consistency by constraining C_t(φ) to lie in a family with a well-defined continuum limit (e.g. pseudodifferential operators of fixed order, or Mat'ern-type SPDE covariances with learned coefficients).

From the viewpoint of function-space MCMC, our sampler is a learned, annealed analogue of preconditioned Langevin dynamics. When F is replaced by the exact gradient of a log-density and C_t ≡ C_⋆, one recovers the standard preconditioned Langevin SPDE studied in the Da~Prato–Zabczyk setting. The insistence on a fixed Cameron–Martin space mirrors the design of pCN and related dimension-robust methods: pCN achieves mesh-invariant behavior precisely because its proposal preserves a reference Gaussian measure and avoids discretization-dependent drift terms. Our approach departs from classical pCN in two ways. First, we use a learned drift F_θ to approximate a score, thereby enabling non-Gaussian targets without explicit likelihood evaluations. Second, we allow the reference covariance to vary with t, but only within a uniformly equivalent family to preserve the function-space geometry needed for well-posedness and stability.

There is also a clear connection to adaptive MCMC and quasi-Newton preconditioning, where one tunes a covariance online based on empirical chain statistics. In finite dimensions, covariance adaptation can improve mixing but must be controlled to preserve ergodicity (e.g. diminishing adaptation). In function spaces, naive covariance adaptation is particularly delicate because empirical covariances are typically finite-rank and therefore singular with respect to the ambient Gaussian reference. Our method can be viewed as an analogue: we learn C_t(φ) from data (and optionally from conditional contexts y) under explicit operator inequalities that preserve equivalence to C_⋆. This replaces online adaptation with a constrained optimization problem whose output can be reused across many sampling runs, a key advantage in conditional regimes where per-instance throughput dominates. A hybrid direction is to allow mild online adjustment within the admissible family (e.g. updating a small number of spectral parameters) while enforcing a diminishing adaptation schedule and maintaining mC_⋆ ≼ C_t ≼ MC_⋆ at all times.

A fundamental limitation is that our reference measures μ_t are Gaussian. While learning C_t can reshape second-order geometry, it cannot represent genuinely non-Gaussian reference structure (e.g. skewness, heavy tails, or nonlinear constraints). In highly non-log-concave or strongly multimodal settings, covariance learning may yield limited gains because the bottleneck is not a condition number but rather energy barriers and metastability. One possible extension is to replace μ_t by a non-Gaussian reference obtained via a learned transport map T_t applied to a Gaussian base, leading to μ_t = (T_t)_#𝒩(0, C_⋆). In infinite dimensions, however, constructing T_t with sufficient regularity and ensuring absolute continuity properties (and tractable computation of the analogue of D_{H_{μ_t}}Φ_t) is nontrivial. Any such extension would require a careful replacement of Cameron–Martin calculus by an appropriate differential structure on the transported measure.

Many targets of interest live on subsets of H or satisfy hard constraints: divergence-free vector fields, positivity constraints, unit-norm constraints, or conservation laws. A Gaussian reference on H is mismatched to such geometry, and unconstrained Langevin updates can violate constraints unless corrected. One direction is to incorporate constraints through projections (projected Langevin) or by formulating dynamics on a manifold (Riemannian Langevin in function spaces), where the preconditioner becomes a position-dependent metric. Learning a covariance C_t(u) would considerably increase expressivity but also raises substantial well-posedness questions for the resulting multiplicative-noise SPDE and for discretization-invariant implementation.

Beyond the extensions above, several theoretical and practical questions remain. On the theory side, it is natural to seek guarantees under weaker conditions than global Lipschitz/dissipativity, as many score models are only locally regular and many data measures induce drifts with polynomial growth. It is also of interest to quantify how covariance learning interacts with drift approximation error δ: while preconditioning can improve contraction, it may also amplify directions where F_θ is inaccurate, suggesting that joint learning should balance mixing proxies with approximation stability. On the practical side, one may enforce additional structure on t ↦ C_t(φ) (e.g. monotonicity of certain spectral quantities) to stabilize training, and one may co-design A_t with C_t when μ(H_{μ_t}) < 1 is expected. Finally, connecting covariance learning to bridge-based objectives (e.g. Schr"odinger bridges) may yield alternative training criteria in which the role of C_t is explicit as an entropic regularization parameter, potentially clarifying when and why learned preconditioning provides the most benefit.

In summary, covariance learning is powerful precisely because it alters the geometry of the sampling problem; for the same reason it must be constrained at the level of operators and measures. The uniform-equivalence framework provides a principled safety rail: it prevents singularity, supports discretization-invariant implementation, and places learned preconditioning on the same function-space footing as classical dimension-robust MCMC, while still leaving room for meaningful acceleration within an admissible family.