Many contemporary inverse problems, generative models for physical fields, and Bayesian formulations of PDE-constrained learning lead naturally to probability measures on infinite-dimensional state spaces. In such settings the object of interest is not a vector in ℝd but a function u taking values in a separable Hilbert space H. Any implementable algorithm necessarily produces a finite-dimensional approximation u(n) ∈ Hn, obtained for instance by truncating a Fourier series or by using a grid-based discretization. A recurring practical observation is that some learned samplers—especially those based on neural operators and denoising objectives—appear to behave stably as the resolution n is refined: one trains at a certain discretization and samples at another, with only mild degradation. Our goal is to formulate and prove a precise form of this in a setting where both the analysis and the algorithm are naturally expressed at the function-space level.
The central difficulty is that the usual convergence theory for Markov chain Monte Carlo (MCMC) algorithms is typically dimension dependent. Even when one proves geometric ergodicity in finite dimensions, the constants often deteriorate with d, and naive limits as d → ∞ can be misleading. In the function-space context one must distinguish two sources of instability. First, the measure may become increasingly ill-conditioned under discretization (e.g., if the potential introduces high-frequency stiffness). Second, the may inject discretization artifacts (e.g., by using noise models that are not H-valued, or by training objectives whose risk necessarily grows with resolution). If either phenomenon occurs, uniform performance across n is impossible, and any apparent empirical stability must be explained by additional structure.
The present work is motivated by denoising diffusion operators (DDOs)
and related score-based constructions in which a learned operator
approximates a score or drift acting on functions. Empirically, such
models are often trained from samples corrupted by Gaussian fields and
then used to drive Langevin-type dynamics. While one can justify
individual components of this pipeline (e.g., consistency of denoising
estimators under suitable conditions), a complete statement of the
form
sampling error ≤ mixing error + learning
error + time-discretization error,
with constants independent of spatial resolution, has been missing. This
gap is not merely technical: without an explicit function-space
statement, one cannot disentangle whether a method is robust because it
genuinely approximates a well-posed infinite-dimensional algorithm, or
because discretization-dependent effects happen to cancel at the tested
resolutions.
We therefore adopt a viewpoint. We formulate the target distribution ν on H through a Gaussian reference measure μ0 = 𝒩(0, C) and a potential Φ, and we study the preconditioned Langevin dynamics whose invariant measure is ν. The algorithm we ultimately implement is Euler–Maruyama applied to a learned drift F̂θ, with additive trace-class Gaussian noise. The analysis proceeds in three steps: (i) contraction and mixing of the exact continuum Langevin semigroup in W2; (ii) stability of the law with respect to drift perturbations, quantified by an L2(ν; H) error level δ; and (iii) a W2 control of the Euler–Maruyama discretization error, scaling as h1/2 under minimal assumptions. The key point is that each step admits bounds whose constants are determined by function-space quantities (monotonicity and Lipschitz constants in the Cameron–Martin geometry, and Tr(C)), and hence remain uniform over Galerkin projections that commute with the covariance.
Our contributions are the following.The scope of the theory is deliberately focused. We work in a strongly log-concave regime, where the relevant monotonicity and Lipschitz constants are uniform and where trace-class noise ensures that the driving Gaussian perturbations are H-valued. These hypotheses are restrictive relative to the most challenging diffusion-model applications, but they delineate a regime in which resolution-invariant guarantees are mathematically natural and provably attainable. They also clarify failure modes: if the corruption or driving noise is not trace class (e.g., effectively white in the discretization limit), then the learned error δ cannot, in general, remain bounded as n → ∞, and dimension-free sampling bounds are not available without additional regularity.
The remainder of the paper develops the necessary background and then formalizes the three-step argument above. In the next section we recall Gaussian reference measures and the Cameron–Martin structure, describe how scores appear as logarithmic derivatives, and explain how DDO denoising objectives lead to operator-valued score approximations compatible with the function-space Langevin dynamics.
We recall the objects that allow one to speak meaningfully about
``scores’’ and Langevin drifts on an infinite-dimensional Hilbert space.
Throughout, μ0 = 𝒩(0, C) is
a centered Gaussian measure on H with C self-adjoint, strictly positive,
and trace-class. The trace-class assumption implies that μ0(H) = 1 (in
particular, a μ0-distributed random
element is H-valued), and it
ensures that the Cameron–Martin space
Hμ0 := Im(C1/2), ∥h∥Hμ0 := ∥C−1/2h∥H,
is continuously embedded in H.
The Cameron–Martin theorem characterizes the quasi-invariance of μ0 under translations by
h ∈ Hμ0,
and hence provides the correct geometric notion of differentiability for
densities built from μ0.
A basic calculus tool is the Gaussian integration-by-parts formula
along Cameron–Martin directions. If f is a sufficiently smooth
cylindrical function on H and
h ∈ Hμ0,
then
where ∂hf(u)
denotes the directional derivative along h. In particular, identifies the
logarithmic derivative of μ0 along Hμ0
as the map u ↦ −C−1u,
understood in the weak sense encoded by . This provides the
infinite-dimensional analogue of the finite-dimensional identity ∇log 𝒩(0, C)(u) = −C−1u.
Now let ν ≪ μ0 be given
by
ν(du) = Z−1exp ( − Φ(u)) μ0(du), Z := ∫He−Φ dμ0,
where Φ is differentiable
along Hμ0.
Differentiating the density along Cameron–Martin directions and
combining with yields the integration-by-parts formula under ν:
Thus the logarithmic derivative of ν relative to the is the Hμ0-valued
map
$$
u \longmapsto -\nabla_{H_{\mu_0}}\Phi(u)
=
\nabla_{H_{\mu_0}}\log\frac{d\nu}{d\mu_0}(u),
$$
whereas the full logarithmic derivative of ν in the sense of contains the
additional term C−1u inherited
from μ0. This
distinction is convenient: in function space, there is no Lebesgue
reference, and the object that is stable under discretization is
typically the μ0-relative score ∇Hμ0log (dν/dμ0).
The preconditioned Langevin SPDE is the diffusion on H whose generator encodes and for
which ν is invariant.
Writing
F(u) := −(u + C∇Hμ0Φ(u)),
the dynamics read
$$
du_t = F(u_t)\,dt + \sqrt{2}\,dW_t,
$$
where Wt
is a C-Wiener process.
Heuristically, F consists of
the Ornstein–Uhlenbeck drift −u associated with μ0 and the additional
term −C∇Hμ0Φ
that accounts for the potential. Equivalently,
$$
F(u) = -u + C\,\nabla_{H_{\mu_0}}\log\frac{d\nu}{d\mu_0}(u),
$$
so the drift is obtained by applying the preconditioner C to the μ0-relative score and
then adding the stabilizing linear term −u. Under standard dissipativity and
regularity conditions, one verifies invariance of ν by checking that the corresponding
Kolmogorov operator is symmetric (or at least has ν as a stationary measure) on
cylindrical test functions; is precisely the identity needed for this
computation.
We next summarize how denoising diffusion operators (DDOs) produce
approximations of these score objects in a manner compatible with the
function-space setting. Fix a noise level τ > 0, and consider the
corruption model
$$
y = u + \sqrt{\tau}\,\xi,
\qquad
u\sim \nu,\ \ \xi\sim \mathcal{N}(0,C)\ \text{independent}.
$$
Since C is trace-class, ξ ∈ H almost surely and
hence y ∈ H almost
surely. A DDO is an operator-valued estimator Gθ( ⋅ , τ) : H → H
trained by the denoising objective
minθ 𝔼[∥Gθ(y, τ) − u∥H2],
or a variant in which the loss is measured in a Cameron–Martin norm. The
minimizer of this objective (over all measurable maps) is the
conditional mean G*(y, τ) = 𝔼[u ∣ y].
Under suitable regularity, this conditional mean determines the score of
the corrupted law via an infinite-dimensional Tweedie-type identity: the
difference G*(y, τ) − y
is a smoothed version of a gradient of the log-density of the corrupted
measure, and applying C−1 transfers this
difference into the Cameron–Martin space. Concretely, one may define a
score surrogate by
ŝθ(y, τ) := τ−1C−1(Gθ(y, τ) − y) ∈ Hμ0,
which, when Gθ ≈ G*,
approximates an appropriate logarithmic derivative of the corrupted law
along Hμ0.
In the small-noise limit (or in a multi-noise training scheme), such
estimators are used to approximate ∇Hμ0log (dν/dμ0) = −∇Hμ0Φ,
and hence to construct a drift approximation of the form
F̂θ(u) = −u + C ŝθ(u),
which matches the structure of the continuum drift F. The subsequent analysis treats
F̂θ
abstractly as a measurable approximation of F, with its quality summarized by an
L2(ν; H)
error level; this is the point of contact between statistical training
guarantees for DDOs and stability properties of Langevin sampling in
function space.
We study sampling from a target probability measure ν on H given in Radon–Nikodym form with
respect to the Gaussian reference μ0 = 𝒩(0, C),
ν(du) = Z−1exp ( − Φ(u)) μ0(du), Z := ∫He−Φ dμ0,
where Φ : H → ℝ is
differentiable along the Cameron–Martin space Hμ0 = Im(C1/2).
The function-space geometry is governed by Hμ0
and by the preconditioner C;
in particular, the drift associated with the preconditioned Langevin
dynamics is
F(u) = −(u + C∇Hμ0Φ(u)).
Our target class is the strongly log-concave regime in the Gaussian base
geometry: we assume that ∇Hμ0Φ
is globally L-Lipschitz and
m-strongly monotone in Hμ0,
i.e.,
∥∇Hμ0Φ(u) − ∇Hμ0Φ(v)∥Hμ0 ≤ L∥u − v∥Hμ0,
and
⟨∇Hμ0Φ(u) − ∇Hμ0Φ(v), u − v⟩Hμ0 ≥ m∥u − v∥Hμ02, u, v ∈ Hμ0.
These conditions are the function-space analogue of uniform convexity
and smoothness; they imply well-posedness of the continuum Langevin SPDE
and yield quantitative contraction of its Markov semigroup in transport
metrics. We emphasize that the constants m, L are assumed to be
intrinsic to the model and, in particular, do not depend on any spatial
discretization.
Our sampling algorithm will not typically access F exactly. Instead, we are given a
measurable approximation F̂θ : H → H
(e.g. induced by a learned score), whose quality is summarized by an
L2(ν; H)
error level
We will treat δ as an
exogenous parameter that captures learning and model-mismatch errors;
the analysis below propagates δ through the sampling dynamics in a
stable way.
The notion of ``closeness to ν’’ we use is the 2-Wasserstein distance on H. For Borel probability measures
ρ, ρ′ ∈ 𝒫2(H)
with finite second moments, we define
W2(ρ, ρ′)2 := infπ ∈ Π(ρ, ρ′)∫H × H∥x − y∥H2 π(dx, dy),
where Π(ρ, ρ′)
denotes the set of couplings of ρ and ρ′. Since H is separable, the infimum is
attained, and W2
metrizes weak convergence plus convergence of second moments on 𝒫2(H). The choice of
W2 is dictated by
two features that we will exploit repeatedly: first, it is compatible
with synchronous couplings of Langevin dynamics; second, it behaves well
under Lipschitz maps. In particular, if T : H → H is 1-Lipschitz, then
a fact we will use with T = Pn
a projection.
We implement the sampler by Euler–Maruyama time stepping. Given h > 0 and T = Nh, we
consider
with initialization u0 ∈ H (possibly
random). The trace-class assumption on C ensures that ξk ∈ H
almost surely, so defines an H-valued Markov chain under mild
growth conditions on F̂θ (in our
setting we will assume global Lipschitzness/dissipativity as needed for
the discretization estimates). Our primary output is the law ℒ(uN), and the
goal is to bound W2(ℒ(uN), ν)
in terms of the runtime T, the
step size h, and the drift
error δ.
To connect the continuum-first analysis to practical discretizations,
we introduce a family of finite-dimensional subspaces {Hn}n ≥ 1
with orthogonal projections Pn : H → Hn.
We assume that Pn is
nonexpansive, ∥Pnx∥H ≤ ∥x∥H,
and that the discretization is consistent with the Gaussian structure in
the sense that
The commutation is natural when Pn truncates in
an eigenbasis of C
(e.g. Fourier/Karhunen–Lo`eve truncations), and it ensures that Pnξ ∼ 𝒩(0, PnCPn)
and that projected C-Wiener
processes are PnCPn-Wiener
processes on Hn.
The corresponding projected Euler–Maruyama chain is
By construction, evolves in Hn. The natural
finite-dimensional target is the pushforward Pn♯ν,
i.e. the law of Pnu
when u ∼ ν. The
stability of the learning error under projection follows immediately
from nonexpansiveness:
𝔼u ∼ ν[∥PnF̂θ(u) − PnF(u)∥H2] ≤ 𝔼u ∼ ν[∥F̂θ(u) − F(u)∥H2] ≤ δ2.
We now formalize what we mean by a resolution-invariant sampling
bound. For each n, the
algorithm induces a law ℒ(uN(n)) ∈ 𝒫2(Hn),
which we view as a measure on H supported on Hn. A bound is
resolution-invariant if it takes the form
where the functions Mix, Learn, Disc
and all constants implicit in them depend only on intrinsic model
parameters (e.g. m, L, Tr(C) and
regularity of F̂θ) and on n. In particular, we seek bounds of
the type
Mix(T) = e−κTW2(ℒ(u0), ν), Learn(δ) = Kδ, Disc(h) = Kh1/2,
with κ, K independent
of resolution. The next section establishes the key ingredient behind
Mix(T): a W2-contraction estimate
for the exact continuum Langevin dynamics, from which mixing-time bounds
follow without reference to any discretization.
We consider the continuum preconditioned Langevin dynamics on H,
driven by a C-Wiener process
Wt. Under
the standing Lipschitz and monotonicity assumptions on ∇Hμ0Φ,
the SPDE is well posed and defines a time-homogeneous Markov process
{ut}t ≥ 0
on H. We denote by utu
the solution started at u0u = u ∈ H.
The associated Markov semigroup {Pt}t ≥ 0
acts on bounded measurable test functions φ : H → ℝ by
(Ptφ)(u) := 𝔼[φ(utu)],
and, by duality, on probability measures ρ ∈ 𝒫2(H) via
ρPt := ℒ(utU0)
when U0 ∼ ρ. The
measure ν is invariant for
(indeed, is the standard μ0-preconditioned
Langevin diffusion targeting ν); we will use invariance only as
an identity νPt = ν
and not rely on any explicit formula for the transition kernels.
The key quantitative property we need is a contraction estimate for
the semigroup in W2. This follows from a
synchronous coupling argument. Fix u, v ∈ H, and let
ut := utu
and vt := utv
solve driven by the Wiener path Wt. Defining the
difference et := ut − vt,
we observe that the noise cancels and hence
We estimate the time derivative of ∥et∥H2.
Using and the chain rule,
The drift F is one-sided
dissipative in the H-inner
product. Indeed, for any x, y ∈ H such that
∇Hμ0Φ(x), ∇Hμ0Φ(y) ∈ Hμ0,
we compute
Inserting into yields
$$
\frac{d}{dt}\|e_t\|_H^2 \le -2\|e_t\|_H^2,
$$
and therefore, by Gr"onwall,
In particular, taking expectations (trivial here since is pathwise)
shows that the synchronous coupling contracts second moments at rate
e−2t.
We translate into a W2-contraction bound for
the semigroup. Let ρ, ρ′ ∈ 𝒫2(H)
and let (U0, V0)
be any coupling with ℒ(U0) = ρ and
ℒ(V0) = ρ′.
Drive the coupled dynamics from (U0, V0)
using the same Wiener process, producing (Ut, Vt).
Then ℒ(Ut) = ρPt
and ℒ(Vt) = ρ′Pt,
and (Ut, Vt)
is a coupling of these laws. Hence,
W2(ρPt, ρ′Pt)2 ≤ 𝔼[∥Ut − Vt∥H2] ≤ e−2t 𝔼[∥U0 − V0∥H2].
Taking the infimum over all initial couplings (U0, V0)
gives the contraction estimate
We emphasize that the rate in is intrinsic to the continuum drift
structure and does not depend on any spatial discretization parameter.
Moreover, implies uniqueness of an invariant measure in 𝒫2(H) whenever such a
measure exists; in our setting ν is invariant, so it is the unique
invariant measure with finite second moment.
Finally, choosing ρ′ = ν and using
νPt = ν
yields the mixing estimate
Equivalently, for any ε > 0, it suffices to take t ≥ log (W2(ℒ(u0), ν)/ε)
to ensure W2(ℒ(ut), ν) ≤ ε.
This contraction/mixing estimate provides the Mix(T) term in . In the next section
we quantify how this picture degrades when the drift F is replaced by a learned
approximation F̂θ, leading to
an additive error contribution controlled by ∥F̂θ − F∥L2(ν; H).
We next quantify the effect of replacing the true drift F in by a learned approximation
F̂θ. At the
continuum level we consider the perturbed dynamics
driven by the C-Wiener process
Wt. We
assume throughout that F̂θ : H → H
is globally Lipschitz (so is well posed) and that its pointwise
discrepancy from F is
controlled in the target measure,
The goal is an additive W2-error term
proportional to δ, uniform in
time after mixing.
The basic estimate is obtained by synchronous coupling and an energy
inequality for the error process. Let (ut)t ≥ 0
solve and (ût)t ≥ 0
solve , started from a coupling (u0, û0)
and driven by the same Wt. Define et := ût − ut.
Then the noise cancels and
ėt = F̂θ(ût) − F(ut) = (F(ût) − F(ut)) + (F̂θ(ût) − F(ût)).
Using the chain rule and the one-sided dissipativity of F,
where the last step is Young’s inequality. Gr"onwall yields, for all
t ≥ 0,
To turn into a closed W2-bound, we relate the
integrand to . In the strongly log-concave regime, ν has finite second moments with
constants depending only on (m, L, Tr(C)), and
by the law of ûs remains close
to ν once s is moderately large. Under the
additional mild regularity that the discrepancy map G := F̂θ − F
is globally Lipschitz, one can bound
where K0 depends on
the Lipschitz constant of G
and on ν-moment bounds, hence
ultimately on (m, L, Tr(C)) and
the (dimension-free) Lipschitz controls imposed on F̂θ. Combining –
with the definition of W2 via couplings, we
obtain a stability estimate of the form
with constants κ > 0 and
K1 < ∞
independent of any spatial discretization. In particular, if ν̂θ ∈ 𝒫2(H)
denotes an invariant measure (when it exists) for , then setting û0 ∼ ν̂θ
and using invariance in gives
Thus, the learned drift incurs an irreducible bias controlled by the
L2(ν; H)
drift error.
Finally, we connect δ in to
score approximation error in denoising diffusion operators. In the μ0-preconditioned
Langevin setting the drift is F(u) = −(u + C∇Hμ0Φ(u)).
If a learned score model provides sθ(u) ≈ ∇Hμ0Φ(u)
and we set
F̂θ(u) := −(u + Csθ(u)),
then G(u) = F̂θ(u) − F(u) = −C(sθ(u) − ∇Hμ0Φ(u))
and hence
Moreover, using the continuous embedding ∥h∥H ≤ ∥C1/2∥ℒ(H)∥h∥Hμ0
for h ∈ Hμ0,
we also have
δ ≤ ∥C∥ℒ(H) ∥C1/2∥ℒ(H) ∥sθ − ∇Hμ0Φ∥L2(ν; Hμ0).
Consequently, any DDO training guarantee yielding an L2(ν) score
error in either H or Hμ0
translates directly into the perturbation term K1δ in , and
therefore into the overall end-to-end W2 bound once
discretization error is added in the next section.
We now control the bias introduced by replacing the continuous-time
perturbed dynamics by its Euler–Maruyama discretization. Throughout this
section we fix a globally Lipschitz drift field F̂θ : H → H
with Lipschitz constant Lip(F̂θ), and we
assume a one-sided dissipativity condition ensuring uniform moment
bounds for both the SPDE and the chain: there exist α > 0 and b ≥ 0 such that
(Under the strongly log-concave hypotheses on the target drift F, and for drift approximations
F̂θ that
preserve the same coercive structure up to a controlled perturbation, is
natural; we only use it here to keep constants from growing in
time.)
Let ût
solve with initial condition û0 ∈ L2(Ω; H).
Fix h ∈ (0, h0],
tk := kh,
T = Nh. The
Euler–Maruyama chain is
which is equivalent to the update with $\sqrt{2h}\,\xi_k$ and ξk ∼ 𝒩(0, C).
We couple {uk}k ≥ 0
and {ût}t ≥ 0
by taking the same C-Wiener
process Wt
in and in , and by setting u0 = û0.
This coupling directly yields a strong error estimate, and hence a W2-bound via the standard
inequality W2(ℒ(X), ℒ(Y))2 ≤ 𝔼∥X − Y∥H2.
To compare ûtk
and uk, we
write the exact increment of the SPDE on [tk, tk + 1]:
$$
\widehat{u}_{t_{k+1}}
=\widehat{u}_{t_k} +
\int_{t_k}^{t_{k+1}}\widehat{F}_\theta(\widehat{u}_s)\,ds
+ \sqrt{2}\,(W_{t_{k+1}}-W_{t_k}).
$$
Subtracting gives the recursion for the error ek := uk − ûtk:
The remainder term rk is controlled
by Lipschitz continuity:
We next bound 𝔼∥ûs − ûtk∥H2
for s ∈ [tk, tk + 1].
By It^o isometry and (a + b)2 ≤ 2a2 + 2b2,
$$
\widehat{u}_s-\widehat{u}_{t_k}
=\int_{t_k}^{s}\widehat{F}_\theta(\widehat{u}_\tau)\,d\tau +
\sqrt{2}\,(W_s-W_{t_k}),
$$
hence
Since Wt
is C-Wiener, 𝔼∥Ws − Wtk∥H2 = (s − tk) Tr(C).
Moreover, under and global Lipschitzness, standard Lyapunov estimates
yield a uniform-in-time bound supt ∈ [0, T]𝔼∥ût∥H2 ≤ MT
and thus supt ∈ [0, T]𝔼∥F̂θ(ût)∥H2 ≤ M̃T,
with M̃T
depending on Lip(F̂θ), (α, b), Tr(C), and 𝔼∥û0∥H2,
but not on any spatial resolution. Inserting these bounds into shows
𝔼∥ûs − ûtk∥H2 ≤ c1(s − tk)
for s ∈ [tk, tk + 1],
where c1 depends on
(M̃T, Tr(C)).
Combining with yields
We return to . Squaring and taking expectations, using (a + b + c)2 ≤ (1 + η)a2 + (1 + η)b2 + (1 + 1/η)c2
and Lipschitzness of F̂θ, we obtain
for h ≤ h0
small enough (depending on Lip(F̂θ)) a
discrete Gr"onwall inequality of the form
with γ > 0 depending on the
dissipativity in (or, more generally, on a one-sided Lipschitz constant
for F̂θ).
Iterating and inserting yields 𝔼∥eN∥H2 ≤ KTh
for a constant KT < ∞
depending on Lip(F̂θ), (α, b), Tr(C), and T. Consequently,
for a constant K2
with the same parameter dependence.
The h1/2 rate in is the natural strong order of Euler–Maruyama for additive-noise stochastic evolution equations under global Lipschitz conditions. Under additional smoothness (e.g. F̂θ ∈ C1 with Lipschitz derivative and suitable trace/Hilbert–Schmidt compatibility conditions ensuring higher-order It^o–Taylor remainders are controlled in H), one can improve the strong rate and obtain W2(ℒ(uN), ℒ(ûT)) ≤ K̃2 h. We do not pursue such refinements, since the end-to-end sampling bound only requires a dimension-free control that vanishes as h → 0.
Finally, we note that an implicit or Crank–Nicolson-type variant may
be preferable numerically when F̂θ contains a
stiff linear component. For instance, splitting F̂θ(u) = −u + Ĝθ(u),
one may treat −u implicitly
and Ĝθ
explicitly:
$$
u_{k+1} = \frac{1}{1+h}u_k + \frac{h}{1+h}\widehat{G}_\theta(u_k) +
\frac{\sqrt{2}}{1+h}(W_{t_{k+1}}-W_{t_k}),
$$
which preserves contractivity properties uniformly in h and yields the same O(h1/2) strong
rate under analogous assumptions. The subsequent analysis is unchanged:
we always reduce W2-control to a
synchronous strong error bound and track constants through dissipativity
and the trace-class noise structure.
We now formalize the ``discretize-later’’ principle for our bounds: when the algorithm is implemented in a finite-dimensional subspace Hn ⊂ H via an orthogonal projection Pn : H → Hn, the constants appearing in the W2-error estimates can be chosen independently of n. The key point is that, under natural consistency assumptions on (Pn, C) and the implementation of the drift oracle, the projected dynamics inherit the same Lipschitz/dissipativity structure and the same trace-class noise control, with no dimension-dependent degradation.
We assume that Pn is the
orthogonal projection onto Hn, hence
nonexpansive:
We also assume (basis alignment):
This is automatic when Pn is defined by
truncation in an eigenbasis of C, and it is precisely the condition
that allows noise sampling and preconditioning to be consistent across
resolutions.
Define the truncated covariance Cn := PnCPn
as an operator on Hn. By we have
Cn = CPn = PnC
and therefore
so any estimate in which the noise enters only through Tr(C) remains valid uniformly in
n.
Let Wt
be a C-Wiener process on H and set Wt(n) := PnWt.
For any x, y ∈ Hn,
𝔼[⟨Wt(n), x⟩H⟨Wt(n), y⟩H] = 𝔼[⟨Wt, x⟩H⟨Wt, y⟩H] = t⟨Cx, y⟩H = t⟨Cnx, y⟩H,
where we used x = Pnx,
y = Pny,
and . Hence Wt(n)
is a Cn-Wiener
process on Hn, and the
projected noise increments satisfy
Wtk + 1(n) − Wtk(n) ∼ 𝒩(0, h Cn), equivalently Pnξk ∼ 𝒩(0, Cn).
In particular, we may and will couple all resolutions {Hn} and the
continuum by using the underlying Wt and then
projecting, which is the natural resolution-consistent coupling.
Given a drift field G : H → H, we
define its Galerkin version by Gn := PnG.
If G is globally Lipschitz on
H, then by
so Lip(Gn) ≤ Lip(G)
with no dependence on n.
Likewise, if G satisfies the
one-sided dissipativity condition , then for u ∈ Hn,
so holds for Gn with the
(α, b).
Applying these observations to G = F̂θ
shows that the projected perturbed SPDE
and its Euler–Maruyama discretization
inherit the same global Lipschitz constant and the same dissipativity
parameters as their continuum counterparts. Consequently, all moment
bounds used in the time-discretization analysis (and hence the constant
K2 in ) may be
chosen uniformly in n, with
Tr(C) replacing Tr(Cn) via .
The contractivity and stability estimates in W2 are likewise uniform.
Indeed, any bound obtained by synchronous coupling and estimates of the
form
⟨x − y, G(x) − G(y)⟩H ≤ −κ∥x − y∥H2
is preserved under projection on Hn: for x, y ∈ Hn,
⟨x − y, Gn(x) − Gn(y)⟩H = ⟨x − y, Pn(G(x) − G(y))⟩H = ⟨x − y, G(x) − G(y)⟩H.
Thus the same contraction rate κ applies to the projected exact
dynamics (with drift Fn := PnF)
and to the projected perturbed dynamics (with drift F̂θ, n := PnF̂θ).
Finally, the learned-drift error does not worsen under projection: by
,
𝔼u ∼ ν∥Fn(u) − F̂θ, n(u)∥H2 = 𝔼u ∼ ν∥Pn(F(u) − F̂θ(u))∥H2 ≤ 𝔼u ∼ ν∥F(u) − F̂θ(u)∥H2 ≤ δ2.
Moreover, since Pn is 1-Lipschitz, W2 is nonexpansive under
pushforward: for any H-valued
random variables X, Y,
Combining with the continuum bounds (mixing, drift perturbation
stability, and Euler–Maruyama error) yields the claimed
discretization-uniform estimate for the Galerkin chain targeting Pn♯ν:
W2(ℒ(uN(n)), Pn♯ν) ≤ e−κTW2(ℒ(u0(n)), Pn♯ν) + K(δ + h1/2),
where κ and K depend only on (m, L, Tr(C)) and
on the dissipativity/Lipschitz parameters imposed on F̂θ, but not on
n. This is the sense in which
the sampling guarantee is : the number of steps required to reach a
prescribed accuracy is controlled by the statistical and temporal errors
(δ, h), and not by
the spatial discretization level.
In many applications the learned drift (or score-induced drift) is
available not only for the target ν, but for a of
noised'' measures \(\{\nu_t\}_{t\in[0,1]}\) which interpolates between aneasy’’
reference at t = 1 and the
desired target at t = 0. The
typical score-based instantiation is a noise scale σ ↦ νσ,
where
so that νσ
becomes progressively more Gaussian as σ increases. We treat the general
situation in which for each t
we have a strongly log-concave invariant measure νt for a
preconditioned Langevin dynamics with drift
Ft(u) := −(u + C∇Hμ0Φt(u)),
and we have a learned approximation F̂θ, t
satisfying a uniform-in-resolution error control
together with Lipschitz/dissipativity assumptions (uniform in t, or at least controlled along the
schedule) sufficient to apply the continuum perturbation and
Euler–Maruyama estimates at each level.
Fix a decreasing schedule 1 = t0 > t1 > ⋯ > tM = 0.
For each level j ∈ {0, …, M − 1}, we run
an Euler–Maruyama chain with step hj and Nj steps
(sampling time τj := Njhj)
targeting νtj + 1:
initialized by u0j := uNj − 1j − 1
(with u00 given).
Denote by ν̂j := ℒ(uNjj)
the distribution after level j, so the final output law is ν̂M − 1, intended
to approximate νtM = ν0.
Iterating the end-to-end bound level-by-level yields a decomposition
into (i) residual mixing from the initial condition at the first level,
(ii) the accumulated learned-drift and time-discretization errors, and
(iii) a term measuring the Wasserstein distance between consecutive
targets. Concretely, under a uniform contraction rate κ > 0 and a uniform constant
K (depending on the same
structural parameters as before, but independent of any Galerkin
resolution), we obtain
The middle sum is the : even if each level were sampled exactly from
νtj + 1,
a finite schedule cannot avoid paying for the cumulative discrepancy
between successive targets, unless additional time is spent to contract
these errors away (as reflected by the exponential weights).
In the convolutional noise model , we can make this term fully
explicit. Let σ0 > σ1 > ⋯ > σM = 0
and write νσj
in place of νtj.
Coupling U + σjΞ
and U + σj + 1Ξ
with the (U, Ξ)
gives
$$
W_2(\nu_{\sigma_j},\nu_{\sigma_{j+1}})
\le \bigl(\mathbb{E}\|(\sigma_j-\sigma_{j+1})\Xi\|_H^2\bigr)^{1/2}
=|\sigma_j-\sigma_{j+1}|\,\sqrt{\mathrm{Tr}(C)}.
$$
Thus the annealing error is controlled by the discrete total variation
of the noise schedule, scaled by $\sqrt{\mathrm{Tr}(C)}$, with no dependence
on spatial resolution.
To reach a prescribed accuracy ε, we balance the three contributions in .
First, we choose per-level sampling times τj so that the
exponential weights make earlier errors negligible, e.g.
$$
e^{-\kappa \tau_j}\approx \frac{1}{M}
\quad\Longleftrightarrow\quad
\tau_j \approx \kappa^{-1}\log M,
$$
or larger at the smallest-noise levels where one typically observes
larger learned error δσ.
Second, we choose the number of levels M and schedule increments so that
∑jW2(νtj, νtj + 1) ≲ ε.
In the convolutional case, it suffices to enforce
$$
\sum_{j=0}^{M-1} |\sigma_j-\sigma_{j+1}|\,\sqrt{\mathrm{Tr}(C)} \lesssim
\varepsilon,
$$
so a simple choice is a geometric ladder σj + 1 = γσj
with γ ∈ (0, 1) and σ0 fixed large enough
that νσ0
is easy to initialize. Then ∑j|σj − σj + 1| = σ0,
and increasing M primarily
improves the annealing term in by allowing contraction between closer
consecutive targets.
Third, we select hj to control the discretization error hj1/2 (or hj under higher regularity) and to satisfy the stability constraint hj ≤ h0 required by the Euler–Maruyama analysis. If Lipschitz constants improve at larger noise (as is typical for DDO/NCSN training), one may take hj larger for large σj and decrease hj as σj ↓ 0, while keeping hj ≲ ε2 if the h1/2 rate is the operative one.
All statements above remain valid for the Galerkin-projected implementation, with the same constants, by combining with the uniform-in-n arguments from Section~.
We record practical considerations for implementing the sampler and for empirically assessing the quantities that enter the dimension-free bound, with particular emphasis on estimating the drift error level δt and on verifying that performance does not degrade as the Galerkin resolution increases.
The bound depends on an L2(νt; H) quantity,To preserve the commutation PnC = CPn and obtain resolution-invariant constants in practice, we implement in a basis {ei}i ≥ 1 diagonalizing C, i.e. Cei = λiei, and represent $u=\sum_{i=1}^n u_i e_i$. Then ξ ∼ 𝒩(0, C) is sampled by $\xi_i=\sqrt{\lambda_i}z_i$ with $z_i\stackrel{i.i.d.}{\sim}\mathcal{N}(0,1)$, and $P_n\xi=\sum_{i=1}^n \xi_i e_i$. When H = L2(D) and C is stationary, this is naturally realized by FFT-based sampling; otherwise, we precompute a truncated Karhunen–Lo`eve expansion. We approximate ∥ ⋅ ∥H consistently with the basis (or with quadrature weights on a grid) and report all errors in this norm to avoid spurious resolution effects.
Beyond monitoring observables, we emphasize diagnostics aligned with the proof mechanism (contraction plus perturbation). We recommend: (i) , where two chains started from different initializations but driven by the same Gaussian noises are used to estimate empirical decay of 𝔼∥uk − vk∥H2 and hence an effective κ; (ii) , comparing histograms and second moments of ⟨u, ei⟩H for the first r modes across resolutions; (iii) , tracking Φ(uk) (when computable) and ∥uk∥H2 to detect step-size instability; and (iv) , computing W2 on a fixed low-dimensional projection (first r modes) via standard finite-dimensional solvers, which yields a resolution-comparable metric even when full W2 is impractical.
We consider a non-Gaussian target on H defined as a mixture of two
Gaussian measures with shared covariance C,
$$
\nu := \tfrac12 \mathcal{N}(m_1,C)+\tfrac12 \mathcal{N}(m_2,C),
$$
with m1, m2 ∈ Hμ0
supported on low frequencies. This target intentionally violates strong
log-concavity, and therefore serves to probe failure modes while keeping
the score analytically tractable (the mixture score can be evaluated in
closed form on each Hn). We train
F̂θ, σ
from the corruption model νσ and evaluate:
(i) how δ̂σ, n
varies with n; (ii) how many
steps N are needed for the
projected W2 error
on the first r modes to reach
a fixed tolerance; and (iii) whether empirical contraction deteriorates
with n. The expected outcome,
if the method is resolution-consistent, is that the required N is stable in n until representation error
(insufficient network capacity for high frequencies) becomes
dominant.
We consider an elliptic inverse problem with prior μ0 = 𝒩(0, C) on
the log-permeability field u,
and observations y = G(u) + η,
η ∼ 𝒩(0, Γ). The
posterior satisfies ν ≪ μ0
with
$$
\Phi(u)=\tfrac12\bigl\|\Gamma^{-1/2}(G(u)-y)\bigr\|^2.
$$
We obtain reference samples on a moderate discretization n⋆ using a standard
function-space MCMC method (e.g. pCN) and train F̂θ, σ
using the trace-class corruption νσ. We then run
the learned sampler on resolutions n ∈ {n⋆, 2n⋆, 4n⋆}
without retraining and compare: (i) posterior predictive statistics
(pushforwards through G and
sensor predictions), (ii) low-mode marginals against the reference chain
projected to the same modes, and (iii) the dependence of effective
sample size per unit compute on n. The central empirical claim
corresponding to the theory is that, once δσ is controlled
uniformly, the iteration complexity needed to stabilize these
diagnostics should not increase with n, up to the expected Õ(nlog n) per-step
cost.
Our analysis is intentionally confined to a regime in which three ingredients interact cleanly: (i) trace-class, strictly positive noise with a fixed covariance operator C; (ii) strong log-concavity of ν along Hμ0, encoded by m-strong monotonicity and L-Lipschitzness of ∇Hμ0Φ; and (iii) a learned drift error controlled in an L2(ν; H) sense by δ. These assumptions are sufficient to make the proof mechanism essentially one-dimensional: synchronous coupling yields contraction, perturbation yields an additive bias O(δ), and Euler–Maruyama yields a discretization bias O(h1/2). The benefit is that the resulting constants are independent of Galerkin resolution, but the scope is correspondingly limited.
A first limitation is nonconvexity (or multimodality). If Φ is not strongly convex along the
Cameron–Martin directions, we generally lose global contractivity in
W2 for the
continuum semigroup and therefore lose the straightforward mixing term
e−κT.
Even when the invariant measure exists and the SPDE is well-posed, the
relevant convergence mechanism can be metastable and dominated by
barrier-crossing times, for which dimension-free bounds are
substantially more delicate. One can still pursue bounds of the
form
W2(ℒ(uT), ν) ≤ Mix(T; V) + Bias(δ, h),
where Mix(T; V) is
controlled via a Lyapunov function V and a minorization condition on
suitable sublevel sets. However, establishing such conditions in
function space, and doing so uniformly over Pn, typically
requires additional structure (e.g. a globally dissipative drift plus a
noise that is sufficiently nondegenerate on a ``small set’’). In the
nonconvex setting, we therefore view the present result as giving a
baseline: it characterizes when one expect resolution-invariant
behavior, and it isolates precisely which part of the argument fails
when contractivity is absent.
A second limitation concerns small-noise or singular-noise regimes. Our setting assumes C is trace-class and fixed as the resolution increases, so ξ ∼ 𝒩(0, C) is H-valued almost surely and Tr(C) < ∞ enters the constants. If one instead considers ``whiter’’ noise (e.g. covariance approaching the identity on finer grids), then Tr(C) may diverge with n, and both the implementation and the bounds can degrade. This is not merely technical: if the corruption/noise injects energy at arbitrarily high frequencies, then learning a drift in an H-norm that remains uniformly accurate becomes information-theoretically and computationally harder, and the error level δ typically grows with resolution. This small-noise singularity also arises in annealing limits where one lets the effective diffusion scale go to zero; in such limits, the bias term induced by δ is amplified unless δ simultaneously vanishes at a compatible rate.
A third limitation is that the learning assumption is expressed under
the law ν:
𝔼u ∼ ν∥F(u) − F̂θ(u)∥H2 ≤ δ2,
whereas the sampler operates out of stationarity during burn-in and may
explore regions under its own transient law. In strongly contractive
regimes this mismatch is mitigated, but, strictly speaking, an
off-distribution control (e.g. uniform Lipschitzness of F̂θ and a
weighted L2 bound
under a Lyapunov measure) would be more satisfactory. This suggests an
open direction: replace the L2(ν; H)
error by a stability notion that can be verified from training data and
that is robust under the sampling dynamics, for example a bound of the
form
supρ ∈ 𝒫2(H): ρ(V) ≤ M 𝔼u ∼ ρ∥F(u) − F̂θ(u)∥H2 ≤ δM2,
for a coercive V. Proving an
end-to-end bound with δM would better
reflect practical training and would more naturally accommodate
non-log-concave targets.
It is also useful to place the present formulation among other
infinite-dimensional diffusion constructions. The SPDE
$$
du_t\ =\ -(u_t+C\nabla_{H_{\mu_0}}\Phi(u_t))\,dt+\sqrt{2}\,dW_t
$$
may be viewed as an overdamped Langevin dynamics preconditioned by the
prior covariance, with Wt a C-Wiener process; this is closely
related to function-space MALA and to the idea that the prior should set
the geometry of the proposal. From this perspective, our
resolution-invariant estimate can be read as a quantitative stability
result for a μ0-reversible diffusion
under simultaneous drift approximation and time discretization. Other
diffusion-based approaches (e.g. Schr"odinger bridge formulations or
score-based reverse-time SDEs) can also be posed on Hilbert spaces, but
they often require either time-inhomogeneous drifts or additional
regularity to justify changes of measure. An interesting conceptual link
is that, in all cases, the trace-class nature of the noise is the key
property that prevents high-frequency divergences and makes the
continuum-first viewpoint meaningful.
Finally, we highlight two open directions that we expect to be central in applications. The first is to move beyond strong log-concavity while retaining resolution-invariant control, potentially by combining (a) contractivity on low modes, (b) coercivity of the prior on high modes, and (c) localized minorization driven by the nondegenerate directions of C. The second is to develop adaptive or learned preconditioners. Replacing C by an operator K (possibly state-dependent) could accelerate mixing, but it changes both the invariant measure and the discretization stability. Even in the linear case, choosing K to match posterior curvature resembles operator preconditioning; in the learned setting, one may attempt to parameterize K and the drift jointly while enforcing that the resulting diffusion preserves ν. Establishing dimension-free guarantees in such adaptive geometries appears feasible in principle, but it will require a careful extension of the contraction argument to nonconstant metrics and a corresponding notion of drift error aligned with the new geometry.