In contemporary diffusion training pipelines, the phenomenon colloquially called is typically discussed as a property of the model class, the dataset, and the total optimization effort. We shall treat it instead as a consequence of a more primitive design choice that is usually left implicit: the m used to form the denoising score matching (DSM) regression targets at a given diffusion time. Concretely, at time t one observes pairs (xi, x̃i, j) where $\tilde x_{i,j}=a_t x_i+\sqrt{h_t}\,z_{i,j}$ with zi, j ∼ 𝒩(0, Id), and one fits a score estimator to approximate ∇x̃log Pt(x̃) via a regression surrogate. The standard implementation choice is to draw one noise sample per datum per iteration, which corresponds (in our stylized per-time analysis) to m = 1. A seemingly innocuous alternative is to reuse data by drawing many independent noises per datum (large m, or an idealized m = ∞), thereby reducing Monte Carlo variability in the regression target. Since increasing m also increases the number of effective training constraints (and hence the compute spent evaluating the score loss), it is natural to regard m as an explicit compute allocation variable rather than a fixed nuisance parameter.
Our starting point is that, in high-dimensional regimes where random-feature score models exhibit sharp interpolation transitions, m is not merely a variance-reduction knob: it changes which object the learned score is statistically encouraged to approximate. In particular, when the function class is rich enough to closely fit the DSM targets, the learned estimator can become closer to the (a kernel-density-type score built from the finite sample) than to the population-optimal score. This is the operational sense in which memorization arises in our analysis: the estimator aligns with the dataset-induced score field, producing strong attraction toward training points in the small-noise region. The key observation is that large m makes the empirical DSM objective closer to its conditional-expectation limit given the dataset, and therefore (in regimes allowing interpolation) can move the global minimizer this empirical score, even if that move is detrimental to out-of-sample DSM error. Thus, compute spent on increasing m may increase the propensity to memorize, rather than mitigate it.
This dependence is time-dependent and therefore inseparable from the diffusion geometry. At small diffusion times t, the forward corruption $\tilde x=a_t x+\sqrt{h_t}z$ preserves fine-scale information about individual samples (since at ≈ 1 and ht ≈ 0). In that window, the empirical score field is highly non-smooth at the scale relevant to generalization, and the DSM regression becomes capable of encoding sample-specific structure. Accordingly, if we are in an overparameterized regime (in the proportional limit, a representative condition is ψp > ψn), then increasing m can reduce the mismatch to the empirical optimal score while simultaneously worsening the DSM test error. In contrast, at larger times t the diffusion marginal Pt becomes closer to an isotropic Gaussian and the population score becomes near-linear, so that increasing m behaves as one expects from standard regression: it reduces target noise and improves generalization. This qualitative crossover suggests that any compute spent on noise multiplicity should not be uniform across time, and that ``more data augmentation’’ (in the form of more noise draws per datum) is not universally beneficial.
From these considerations it follows that the relevant optimization
problem is not simply choose $m$'' but ratherchoose a
schedule t ↦ m(t)’’ subject
to finite resources. Once we discretize diffusion time into bins t1, …, tK,
we may allocate a multiplicity mk to each time
tk. Doing
so induces three competing functionals: a proxy, measured by an
integrated DSM test error across times; a proxy, measured by the total
number of noise evaluations ∑kmkΔtk;
and a proxy, measured by an increasing function of the estimator’s
proximity to the empirical score, but restricted to a small-noise window
t ≤ tmin
where memorization is mechanistically plausible. The problem therefore
becomes a constrained scheduling task: allocate noise multiplicity
across time so as to reduce global DSM error while ensuring that
early-time memorization risk stays within a prescribed budget and total
compute does not exceed an available budget.
The central message is that this scheduling problem acquires a tractable structure precisely because diffusion time imposes an ordering on the learning curves. Under mild monotonicity conditions (which are verified in the source material for the endpoints m ∈ {1, ∞} via exact proportional-asymptotic formulas), the optimization collapses to a near-threshold rule: keep mk minimal at sufficiently small times when the model is capable of interpolating, and allocate larger mk at moderate/large times where additional noise draws strictly improve test performance while having negligible effect on the memorization proxy. In the binary policy class mk ∈ {1, ∞}, the resulting solution is provably optimal via a standard exchange argument; in bounded-integer policies mk ∈ {1, …, mmax}, the same reasoning yields a greedy rule based on marginal gains that is near-optimal under natural diminishing-returns conditions. In either case, the schedule induces an explicit compute–privacy–fidelity Pareto frontier: increasing compute improves fidelity only insofar as it is spent at times where the gain is positive, while privacy (or anti-memorization) constraints restrict spending in the small-noise region.
We emphasize that the role of random features and proportional asymptotics is not merely technical convenience. The analytic learning curves provide closed-form access to the dependence of εtest(t; m) and the mismatch-to-empirical-score Mt(m) on (ψn, ψp, λ, 𝜚, t) at the extreme multiplicities m = 1 and m = ∞, which is exactly what is needed to compute marginal gains and to certify monotonicity regimes. This permits a clean separation between (i) a statistical statement about how multiplicity changes generalization and memorization as functions of diffusion time, and (ii) an algorithmic statement about how to allocate multiplicity under budgets. The remainder of the paper formalizes these statements, beginning with the OU diffusion and DSM setup, and then deriving the scheduling structure from the asymptotic formulas.
We briefly recall the OU forward process and the denoising score
matching (DSM) regression view in a form that makes the parameter m explicit. For t ≥ 0, the OU corruption of a clean
sample x ∼ P0 is
$$
\tilde x = a_t x + \sqrt{h_t}\,z,\qquad z\sim\mathcal{N}(0,I_d),
$$
where at = e−t
and ht = 1 − e−2t
satisfy at2 + ht = 1.
The conditional law is x̃ ∣ x ∼ 𝒩(atx, htId),
hence its (Stein) score with respect to x̃ is
$$
\nabla_{\tilde x}\log p(\tilde x\mid x)= -\frac{\tilde x-a_t x}{h_t} =
-\frac{1}{\sqrt{h_t}}\,z.
$$
The population target score at time t is s⋆(t, x̃) = ∇x̃log Pt(x̃),
where Pt
is the x̃-marginal induced by
x ∼ P0 and
the OU channel.
In DSM one learns a function s(⋅) ≈ s⋆(t, ⋅)
by regressing to the conditional score. For a fixed time t and a function class ℋ (here the RFNN score class), we consider
the regularized empirical objective built from a dataset S = {xi}i = 1n
and m independent noise draws
per datum,
$$
\hat L_{t,m}(s)
:= \frac{1}{nm}\sum_{i=1}^n\sum_{j=1}^m
\Bigl\|\, s(\tilde x_{i,j}) + \frac{\tilde x_{i,j}-a_t
x_i}{h_t}\Bigr\|_2^2
\;+\;\lambda \|s\|_{\mathcal{H}}^2,
\qquad
\tilde x_{i,j}=a_t x_i+\sqrt{h_t}\,z_{i,j}.
$$
The dependence on m is not
merely cosmetic: it controls how accurately the empirical loss
approximates its conditional expectation given the dataset. Indeed,
conditioning on S, we may view
L̂t, m
as an m-sample Monte Carlo
approximation to the functional
$$
L_{t,\infty}(s)
:= \frac{1}{n}\sum_{i=1}^n
\mathbb{E}_{z\sim\mathcal{N}(0,I_d)}
\Bigl[\Bigl\|\, s(a_t x_i+\sqrt{h_t}\,z) +
\frac{z}{\sqrt{h_t}}\Bigr\|_2^2\Bigr]
\;+\;\lambda \|s\|_{\mathcal{H}}^2,
$$
which corresponds to the idealized multiplicity m = ∞. In this sense, m interpolates between a
single-noise'' stochastic regression problem ($m=1$) and aconditional-expectation’’
regression problem (m = ∞) in
which the only randomness comes from the finite dataset S itself.
A central object in the source analysis is the se(t, ⋅),
defined as the (unregularized) pointwise regression function obtained
when we replace the model class by all measurable functions .
Concretely, for a fixed t and
dataset S, consider the
empirical mixture distribution
$$
P_t^S := \frac{1}{n}\sum_{i=1}^n \mathcal{N}(a_t x_i,\,h_t I_d),
$$
which is exactly the law of x̃
when we first sample an index I uniformly in {1, …, n} and then sample x̃ ∼ 𝒩(atxI, htId).
A standard calculation shows that the corresponding score is
$$
\nabla_{\tilde x}\log P_t^S(\tilde x)
=
\sum_{i=1}^n w_i(\tilde x)\,\Bigl(-\frac{\tilde x-a_t x_i}{h_t}\Bigr),
\qquad
w_i(\tilde x):=\frac{\exp\!\bigl(-\frac{\|\tilde x-a_t
x_i\|_2^2}{2h_t}\bigr)}{\sum_{\ell=1}^n \exp\!\bigl(-\frac{\|\tilde
x-a_t x_\ell\|_2^2}{2h_t}\bigr)}.
$$
We denote this score by se(t, x̃);
it is a kernel-density-estimator (KDE) score field with Gaussian
bandwidth $\sqrt{h_t}$ and centers
{atxi}.
In particular, as t ↓ 0 one
has at → 1
and ht → 0, so se(t, ⋅)
develops sharp attraction basins around the training samples at a
spatial scale comparable to $\sqrt{h_t}$. Conversely, for large t the mixture PtS
is strongly smoothed and approaches a near-isotropic Gaussian in many
regimes, so se(t, ⋅)
becomes comparatively regular and close to the population score s⋆(t, ⋅).
The role of m may now be phrased as follows. When m is large, the empirical DSM objective is a high-accuracy approximation to the conditional-expectation objective given S, and in rich function classes the optimizer is driven toward the dataset-conditioned regression function—hence, toward se(t, ⋅) rather than s⋆(t, ⋅). When m is small (notably m = 1), the additional Monte Carlo variability in the target acts as an implicit noise injection in the regression problem; in interpolation-capable regimes this can prevent the estimator from locking onto the fine-scale structure of se(t, ⋅), especially at small t where that structure is most ``sample-specific.’’
The source material makes this qualitative picture quantitative for
RFNN score models in proportional asymptotics. In that setting, the
width p and sample size n scale linearly with d, and the ratio ψp/ψn
governs an interpolation transition: roughly, ψp > ψn
corresponds to a regime where the random-features design is expressive
enough to fit (or nearly fit) the DSM constraints, while ψp < ψn
corresponds to a constrained regime where the estimator cannot
interpolate. The novelty is that the sign of the benefit of increasing
m depends on which side of
this transition we are on on the diffusion time. In overparameterized
regimes ψp > ψn,
increasing m can worsen the
DSM test error at sufficiently small t even though it reduces the
mismatch to the empirical score, because it encourages convergence
toward the spiky se(t, ⋅).
At larger t, where the target
score is effectively low-complexity and less dataset-dependent,
increasing m behaves like
classical variance reduction and improves generalization. Around the
critical region p ≈ n
(i.e. ψp ≈ ψn),
the learning curves sharpen and the crossover in t becomes particularly pronounced:
small changes in effective constraint-to-parameter ratio, induced here
by changing m, can move the
estimator between population-like'' andempirical-like’’
solutions.
This generalization–memorization crossover is the mechanism that motivates time-dependent allocation of m. The same multiplicity that is beneficial when the regression problem is well-posed (moderate/large t, or ψp < ψn) may be harmful when the estimator can track sample-specific score features (small t with ψp > ψn). The remainder of our development will treat the quantities induced by this dependence—test error and mismatch-to-empirical-score, evaluated at the endpoints m ∈ {1, ∞} and then scheduled over t—as the primitive inputs for a constrained allocation problem.
We now isolate the scheduling question induced by the dependence of
the DSM estimator on the parameter. Fix a quadrature rule on [0, T] given by times 0 < t1 < ⋯ < tK ≤ T
and nonnegative weights Δtk
(e.g. Riemann or Gauss–Legendre). At each tk we train an
RFNN score model to the global minimizer of the regularized DSM
objective with multiplicity mk, producing an
estimator ŝk ≡ ŝtk, mk.
The decision variable is therefore the schedule
m = (m1, …, mK), mk ∈ ℳ,
where we shall study two classes: (i) the class ℳ = {1, ∞}, and (ii) a class ℳ = {1, 2, …, mmax}
equipped with an interpolation/bounding rule connecting the endpoints
m = 1 and m = ∞.
The scheduling objective is stated in terms of the asymptotic test
error of the time-t regression
problem, denoted εtest(t; m),
evaluated in the proportional limit d, n, p → ∞ with
ψn = n/d
and ψp = p/d
fixed. We define the
which is the natural discrete analogue of ∫0Tεtest(t; m(t)) dt.
This quantity is the correct surrogate for end-to-end generative
accuracy because standard score-based diffusion arguments bound the
divergence between the true reverse-time dynamics and the learned one by
an integral of squared score error; in particular, under the usual
regularity hypotheses one has a bound of the form
Thus, minimizing ℱ is our proxy for
maximizing sample quality.
We model computational cost as proportional to the number of noise
evaluations per datum. Since training at time tk uses mk noise draws
per xi,
the total cost (up to an overall constant depending on optimization
details) is
where B is a prescribed
budget. In the binary class, the symbol mk = ∞ is
interpreted as allocating a very large (effectively variance-free) Monte
Carlo pool at that time; operationally, one may replace ∞ by a large mmax chosen so that
endpoint learning curves are well approximated. Importantly, is
separable across times, which will enable greedy/threshold solutions
once monotonicity properties of εtest(t; m)
are established.
Compute alone does not capture the phenomenon highlighted above: in
regimes where the estimator can interpolate, increasing m can drive ŝt, m
toward the dataset-conditioned score se(t, ⋅),
which is known to become highly sample-specific as t ↓ 0. To quantify the propensity of
the learned score to encode such sample-specific structure, we introduce
a quantity
Mt(m) := 𝔼∥ŝt, m(X̃) − se(t, X̃)∥22, X̃ ∼ Pt,
whose appearance is motivated by the bias–variance style decomposition
underlying the source analysis (in the spirit of Lemma~3.5 therein).
Since se(t, ⋅)
is the score of the empirical Gaussian mixture PtS
and thus directly determined by S, small Mt(m)
indicates that the learned model is close to a score field that
concentrates around training points at scale $\sqrt{h_t}$. This motivates using Mt(m)
as a primitive that correlates with and, consequently, with
vulnerability.
To connect to extraction metrics, we pass from Mt(m)
to a scalar risk proxy Mem(t; m) via a fixed
monotone transform
where ϕ is decreasing (so that
being to se corresponds
to higher memorization risk). This abstraction accommodates multiple
downstream notions: for instance, a membership-inference advantage in a
score-based test is often monotone in a suitable discrepancy between the
model score and a population baseline, while nearest-neighbor style
extraction heuristics become stronger as the learned score field
develops sharp basins aligned with training samples. We do not commit to
a single operational metric here; rather, we treat Mem as an explicitly computable proxy derived
from the same asymptotic formulas as εtest, and we enforce it
only where it is meaningful: in a small-noise window [0, tmin].
Accordingly, we define the
where tmin > 0
is chosen so that ht remains small
on [0, tmin] (so
that se
indeed exhibits sample-specific fine structure), and ρ is a prescribed risk budget. The
restriction to t ≤ tmin
formalizes the idea that extraction vulnerabilities are dominated by the
early reverse-time segment, where the learned score controls fine-scale
placement of probability mass.
Combining , , and , our formal problem is
The essential modeling choice is that the per-time estimators are
trained independently, so that the curves t ↦ εtest(t; m)
and t ↦ Mt(m)
are the only inputs required for . In the binary class, these inputs
reduce to the endpoints m = 1
and m = ∞; in the
bounded-integer class, we will require a controlled interpolation or
upper bound for intermediate m. In the next section we therefore
turn to the asymptotic learning-curve formulas available at m ∈ {1, ∞} and to the construction
of usable surrogates for general m, which together render
algorithmically tractable.
Problem is algorithmically meaningful only insofar as we can
evaluate, for each time t, the
two primitives
m ↦ εtest(t; m) and m ↦ Mt(m),
at least for the values of m
allowed by the policy class. The analysis of the source supplies closed
asymptotic characterizations at the two endpoints m = 1 and m = ∞, which we treat as throughout:
for each fixed t, one obtains
εtest(t; 1) from
the
single-noise'' learning curve (source Thm.~3.6) and $\varepsilon_{\mathrm{test}}(t;\infty)$ from theinfinite-pool’’
learning curve (source Thm.~3.2). Both are given in proportional
asymptotics by constant-size fixed-point systems for a small collection
of scalar order parameters (resolvent/traces of the random-features Gram
matrix and its ridge-regularized inverse), which depend on (ψn, ψp, λ),
on the activation 𝜚 through
its Hermite coefficients, and on t through the OU parameters (at, ht)
and the induced law Pt.
Concretely, at each (t, m) ∈ {tk} × {1, ∞},
the source reduces the DSM regression to a random-features ridge problem
with a teacher signal determined by the population score s⋆(t, ⋅) and an
effective label-noise component determined by the Monte Carlo
variability of the DSM target. In the m = ∞ case, this Monte Carlo
component vanishes and the regression is ``noise-free’’ relative to the
conditional expectation target; in the m = 1 case, it is maximal. The
resulting test error admits the usual decomposition
εtest(t; m) = εbias(t; m) + εvar(t; m),
where both terms are explicit functions of the fixed-point solution (and
thus can be evaluated without simulating high-dimensional random
matrices). An analogous statement holds for Mt(m):
in the source’s bias–variance decomposition, Mt(m)
is expressed in terms of the same order parameters together with an
overlap between the learned predictor and the empirical-score target
se(t, ⋅).
We emphasize that, for our scheduling purposes, it suffices that (εtest(t; 1), εtest(t; ∞), Mt(1), Mt(∞))
can be computed in O(1) time
per t up to numerical root
finding; thus, computing all endpoint curves over the grid costs O(K ⋅ Tsolve).
In the bounded-integer policy ℳ = {1, …, mmax} we additionally require a principled surrogate for εtest(t; m) and Mt(m) when 1 < m < ∞. The key modeling observation is that, at fixed t, using m independent noise draws per datum replaces the ideal conditional expectation in the DSM target by an average of m i.i.d. terms. Accordingly, conditional on the regressor (the diffused input), the deviation of the finite-m target from its m = ∞ counterpart is mean-zero with variance scaling as 1/m. This suggests treating the transition from m = ∞ to finite m as the introduction of an label-noise term with variance proportional to 1/m.
We therefore adopt the following interpolation rule, which is
compatible with standard random-features learning-curve formulas with
noisy labels: for each t we
posit an effective noise level σt2
such that
where εtest(rf)(t; τ)
denotes the random-features ridge test error when the teacher target is
the m = ∞ DSM target and the
labels are corrupted by independent additive noise of variance τ. The map τ ↦ εtest(rf)(t; τ)
is explicit in the same order parameters as the endpoint formulas, with
τ entering as an additive
contribution to the effective noise term in the fixed-point system; in
particular, recovers the endpoints by matching
εtest(t; ∞) = εtest(rf)(t; 0), εtest(t; 1) ≈ εtest(rf)(t; σt2).
In practice, we determine σt2
by solving the scalar equation εtest(rf)(t; σt2) = εtest(t; 1),
which is well-posed whenever τ ↦ εtest(rf)(t; τ)
is continuous and strictly increasing. This produces a one-parameter
family {εtest(t; m)}m ∈ [1, ∞]
consistent with the exact endpoints and with the 1/m Monte Carlo scaling. An entirely
analogous construction is used for Mt(m),
replacing the test-error functional by the mismatch functional in the
noisy-label RF limit:
again with σt2
fixed by matching Mt(1) (or, if
preferred, by matching εtest(t; 1) and
then predicting Mt(1) as a
consistency check).
When we do not wish to re-solve fixed-point equations for each
intermediate m, we may instead
use a cheaper, purely endpoint-based surrogate. A convenient choice is
interpolation in the inverse-multiplicity variable:
and similarly for Mt(m).
Inequality can be justified as a conservative upper bound whenever the
learning curve is convex in the label-noise variance (a property that
holds for many ridge-type risks as functions of an additive noise
parameter), and it is exact at the endpoints. This bound suffices for
designing schedules that are robust to interpolation error, since it
never overstates the benefit of increasing m.
Because the preceding interpolation rules abstract away architectural and optimization effects (we assume global minimizers and the RF limit), we recommend a lightweight calibration step when deploying the schedule empirically. Fix a small subset of times {tkℓ}ℓ = 1L (e.g. logarithmically spaced, with extra density near t ≤ tmin). For each selected time, train the RFNN score model with multiplicities in a small set {1, 2, 4, …, mcal}, estimate εtest(tkℓ; m) on held-out diffused samples, and fit σtkℓ2 (or, in the endpoint-only rule, fit an exponent αt in θm = m−αt). Interpolate σt2 across t (e.g. by a spline in t or in ht), and then evaluate – for all grid points. This produces calibrated curves that preserve the endpoint analytic structure while tracking finite-d effects.
Having thus reduced εtest(t; m) and Mt(m) to explicit, evaluable functions of t and m, we next study their dependence on m across diffusion time and across regimes of ψp/ψn. These monotonicity properties, proved from the endpoint formulas and transported to intermediate m through the interpolation/bounding rules above, are the mechanism by which the scheduling problem collapses to a greedy/threshold form.
Fix a diffusion time t and view the DSM regression at this time as a random-features ridge problem with a teacher function equal to the conditional expectation target (the m = ∞ DSM target) and with an additional label-noise component induced by finite Monte Carlo averaging. Under this viewpoint, increasing m has two competing effects. First, it reduces the effective label-noise variance (scaling as 1/m), which regression when the estimator is variance-limited. Second, it forces the estimator toward the empirical conditional expectation associated with the training set, which can generalization when the model is sufficiently expressive to lock onto sample-specific structure (memorization). The sign of the net effect is determined by the regime ψp/ψn together with the diffusion time t (which controls the signal-to-noise and the complexity of the score).
To formalize, we define the endpoint gaps
Δε(t) := εtest(t; ∞) − εtest(t; 1), ΔM(t) := Mt(∞) − Mt(1).
The quantity Δε(t)
measures whether increasing m
from 1 to ∞ helps (Δε(t) ≤ 0)
or hurts (Δε(t) ≥ 0)
fidelity at time t, while
ΔM(t)
encodes whether increasing m
pulls the learned score toward the empirical-optimal score se(t, ⋅)
(namely ΔM(t) ≤ 0)
or away from it.
We now isolate a sufficient condition ensuring that, for small diffusion times, increasing m worsens test error while decreasing mismatch to the empirical score. The condition is expressed at the endpoints m ∈ {1, ∞}, where the source provides closed-form proportional-asymptotic formulas. The relevant asymptotic is t ↓ 0, for which at = 1 + O(t) and ht = 2t + O(t2), hence Pt is a small Gaussian smoothing of P0. In this regime, the population score s⋆(t, ⋅) = ∇log Pt typically exhibits high curvature at the scale $\sqrt{h_t}$; correspondingly, the empirical score se(t, ⋅) behaves like the score of an ht-bandwidth KDE supported on the training set. When ψp > ψn, the RFNN has enough degrees of freedom (in the random-features limit) to fit sample-specific structure, so moving closer to se can harm generalization.
We compare the endpoint learning-curve systems at m = 1 and m = ∞ by expanding the fixed-point order parameters in the small-ht limit. In the m = ∞ system, the regression target is the conditional expectation, hence the effective label noise term vanishes; in the m = 1 system, an additional noise term persists, entering additively into the variance component of the risk. In the overparameterized regime, the ridge-regularized resolvent exhibits a variance amplification effect associated with near-interpolation (precisely: the variance term is decreasing in the label-noise level only beyond a critical effective dimension, and increasing when ψp exceeds ψn). At small t, the teacher component aligned with low-order Hermite modes is weak relative to the high-frequency component induced by the narrow smoothing (equivalently, the signal-to-noise ratio in the random-features channel is low). Consequently, removing label noise (passing from m = 1 to m = ∞) increases the estimator’s effective degrees of freedom devoted to fitting sample-specific fluctuations, yielding Δε(t) ≥ 0. Meanwhile, because the m = ∞ objective matches the empirical conditional expectation without Monte Carlo perturbation, the learned predictor is closer to the empirical score target se, yielding ΔM(t) ≤ 0. A sign analysis of the resulting expansions determines a positive t0 for which both inequalities hold.
The small-t deterioration is not global in t. As t increases, the OU forward diffusion drives Pt toward an approximately Gaussian law with score close to the linear function x ↦ −x (more precisely, higher-order components decay as at shrinks). In this regime the teacher is effectively low-complexity and aligns strongly with the leading Hermite components of the random-features map; regression becomes signal-dominated, and reducing label noise by increasing m improves test error. Thus, still under ψp > ψn, there exists a (possibly distinct) t1 ∈ (0, T] such that for all t ∈ [t1, T] one has Δε(t) ≤ 0. We emphasize that t0 and t1 need not coincide; the intermediate region is not assumed to be monotone without further conditions.
When ψp < ψn, the RFNN class is not rich enough (in the random-features limit) to interpolate the training constraints, and the dominant effect of increasing m is variance reduction in the DSM target. In this regime the learning-curve formulas imply that εtest(t; m) is nonincreasing in the label-noise variance parameter, and hence nonincreasing in 1/m; at the endpoints this yields Δε(t) ≤ 0 for all t ∈ (0, T]. Simultaneously, because the estimator cannot track the empirical score even when m = ∞, its proximity to se does not improve with m in the same way; indeed the endpoint comparison yields ΔM(t) ≥ 0 in this regime, reflecting that the m = 1 objective’s additional noise acts as an implicit regularizer toward population structure rather than toward the empirical score.
The preceding comparisons are stated at m ∈ {1, ∞} because those values admit exact asymptotic characterization. For bounded integer policies, we propagate the endpoint monotonicity through the 1/m noise-variance scaling surrogate: in the overparameterized small-t window, εtest(t; m) is increasing in m while Mt(m) is decreasing in m; in the underparameterized regime, εtest(t; m) is decreasing in m for all t. This is the structural input needed in the next section, where we convert the constrained schedule search into a greedy/threshold allocation over time bins.
Fix the discretization {tk, Δtk}k = 1K
and regard the schedule m = (m1, …, mK)
as the sole decision variable. Since each time bin is trained
independently in our stylized model, both the fidelity objective and the
constraints are :
$$
\mathcal{F}(\mathbf{m})=\sum_{k=1}^K \Delta
t_k\,\varepsilon_{\mathrm{test}}(t_k;m_k),\qquad
\mathcal{R}(\mathbf{m})=\sum_{k:t_k\le t_{\min}} \Delta
t_k\,\mathsf{Mem}(t_k;m_k),\qquad
\mathcal{C}(\mathbf{m})=\sum_{k=1}^K \Delta t_k\,m_k.
$$
Thus the scheduling problem is a resource allocation over bins with
per-bin
value'' given by test-error reduction and per-binweight’’
given by compute, coupled to an additional early-time risk budget.
We first consider the binary class mk ∈ {1, ∞}.
Define the per-bin fidelity gain from upgrading bin k,
gk := Δtk(εtest(tk; 1) − εtest(tk; ∞)),
and the associated early-time risk increment
rk := 1{tk ≤ tmin} Δtk(Mem(tk; ∞) − Mem(tk; 1)).
Upgrading bin k from 1 to ∞
changes (ℱ, ℛ, 𝒞) by (−gk, +rk, +Δtk(∞ − 1)).
In particular, when tk > tmin
the risk increment is negligible (by construction of ℛ), while in the small-noise window and in
the regime ψp > ψn
the monotonicity input implies rk ≥ 0 (higher
m increases memorization
risk). Hence, for ψp > ψn,
upgrades at early times are doubly disfavored: they consume risk budget
and, in the small-t window
where Δε(t) ≥ 0,
they also tend to have gk ≤ 0.
We use an exchange argument. Consider any feasible schedule that upgrades some k with tk ≤ tmin. By (i) this weakly increases ℛ and by the small-t deterioration input it does not improve ℱ (equivalently, gk ≤ 0 in the relevant window). Reverting that bin to mk = 1 weakly decreases both ℛ and ℱ while saving compute, so no optimum upgrades early bins. After fixing mk = 1 on {tk ≤ tmin}, constraint ℛ ≤ ρ becomes irrelevant for the remaining bins by (ii), and the problem reduces to selecting upgrades over tk > tmin to maximize total gain ∑k ∈ Sgk under a compute budget. Under uniform costs this is solved by sorting gk and taking the largest gains; with per-bin costs one obtains the standard greedy-by-efficiency solution under separability, as implemented in Algorithm .
Now let mk ∈ {1, …, mmax}.
The scheduling problem becomes a (two-budget) integer program. To retain
tractability we invoke the same structural surrogate used to transport
endpoint monotonicity: for fixed t, both εtest(t; m)
and Mem(t; m) vary
smoothly with the effective label-noise variance ∝ 1/m. In particular, for t > tmin we
typically observe in fidelity as m increases, i.e. the marginal
gain
δk(m) := Δtk(εtest(tk; m) − εtest(tk; m + 1))
is nonincreasing in m. Under
such discrete concavity, a standard priority-queue greedy algorithm that
repeatedly allocates one additional noise draw to the bin with largest
current δk(m)/Δtk
yields a constant-factor approximation to the compute-constrained
optimum and is exact in the continuous relaxation (water-filling/KKT
form). The risk constraint is handled by forbidding increments on tk ≤ tmin
once the running sum ∑tk ≤ tminΔtk Mem(tk; mk)
reaches ρ; in the regime ψp > ψn
this typically enforces mk near its
minimum on early bins, while allowing aggressive growth for tk > tmin.
In the binary class, the value function
ℱ⋆(B, ρ) := min {ℱ(m) : m ∈ {1, ∞}K, 𝒞(m) ≤ B, ℛ(m) ≤ ρ}
admits an explicit characterization under the hypotheses of Theorem~.
First, feasibility at a given ρ requires that the unavoidable
baseline risk ∑tk ≤ tminΔtk Mem(tk; 1)
does not exceed ρ. Conditional
on feasibility, the optimal schedule fixes mk = 1 for tk ≤ tmin
and spends compute only on tk > tmin,
upgrading bins in descending order of gk.
Consequently, as B increases,
ℱ⋆(B, ρ)
decreases by successive addition of the gains gk from the
sorted list, yielding a piecewise-linear, monotone frontier in B whose breakpoints occur at budgets
where an additional bin can be upgraded. In the continuous-time limit
(small maxkΔtk),
this becomes an integral threshold rule in t determined by a Lagrange
multiplier: we allocate high m
precisely on those times for which the instantaneous gain density
exceeds the multiplier, while enforcing m ≈ 1 in the small-noise window to
satisfy the privacy/memorization budget.
Our upper-bound schedules are guided by the small-noise monotonicity phenomenon: in the overparameterized regime, increasing noise multiplicity drives the learned score closer to the empirical optimal score se(t, ⋅) at small t, which is precisely the direction in which memorization-like effects worsen. We now record a complementary necessity statement showing that, once tmin lies inside the small-noise window from Theorem~1, schedule that assigns large multiplicity to a nonnegligible portion of early times must incur a nontrivial memorization-risk cost.
To make the statement concrete, fix a threshold m0 ≥ 2 and write E(m; m0) := {k : tk ≤ tmin, mk ≥ m0} for the set of early bins that are ``high multiplicity.’’ Assume the proxy is of the form Mem(t; m) = ϕ(Mt(m)) with ϕ nonincreasing, so that shrinking Mt(m) (moving toward se) increases memorization risk. Define the early-time mass α(m; m0) := ∑k ∈ E(m; m0)Δtk.
The theorem formalizes the intuitive constraint that, in ψp > ψn,
one cannot
spend'' substantial multiplicity at early diffusion times without paying a risk penalty: the best one can do under a fixed $\rho$ is to confine high multiplicity to a small-measure subset of $[0,t_{\min}]$. In the binary class $m\in\{1,\infty\}$ one may take $m_0=\infty$ and $\Delta_M(t)=M_t(1)-M_t(\infty)$, which is strictly positive for small $t$ under Theorem~1(i) unless the estimator already coincides with the empirical score. For bounded integers, one may view $m_0$ as the minimal level at whichnear-empirical’’
behavior emerges; the conclusion then explains why a risk-feasible
optimizer places the majority of its compute in bins with t > tmin even
when the compute objective alone would encourage increasing m everywhere.
The preceding sections derive tractable, nearly closed-form schedules by exploiting diffusion-induced structure. Without such structure, even our separable formulation becomes computationally hard once one permits arbitrary per-bin responses to multiplicity. We record a standard reduction to emphasize that the present tractability is not generic.
Given a knapsack instance with items (vk, wk), we construct K bins and stipulate Δtk so that the compute increment equals ck = wk and the fidelity improvement equals gk = vk. Feasible schedules correspond bijectively to subsets of items, and achieving gain ≥ G corresponds to attaining knapsack value ≥ G. Since knapsack is NP-complete, the decision version of the schedule-selection problem is NP-complete.
The preceding reduction requires that per-bin gains and costs be freely specifiable. In our DSM setting, neither is arbitrary. First, costs are essentially proportional to Δtk (or to a fixed number of noise evaluations per bin), which eliminates the item-specific weights that drive knapsack hardness. Second, and more importantly, the learning-curve formulas impose : in ψp > ψn and for t ≤ tmin, increasing m both worsens test error (Theorem~1(ii)) and increases memorization risk via proximity to se (Theorem~1(i) combined with ϕ nonincreasing). For t > tmin, the risk term is by definition inactive and the gain gk is typically nonnegative (large-noise bins behave like well-specified regression onto an approximately linear score). This sign pattern is exactly what enables the exchange argument behind Theorem~: any upgrade at t ≤ tmin is dominated by either doing nothing or reallocating compute to a later bin.
Thus, our scheduling problem sits in a regime where the apparent
combinatorial complexity collapses to a threshold/greedy allocation
because the diffusion dynamics supply an ordering in which
early'' decisions are uniformly discouraged by the risk budget, whilelate’’
decisions are uniformly rewarded in fidelity. The lower bound of
Theorem~ and the hardness statement of Proposition~ together delineate
the scope of the result: small-t throttling is enforced not only by
the structure of optimal schedules but also by an intrinsic risk
penalty, and algorithmic tractability relies on the monotone
diffusion-time geometry rather than on any generic property of separable
integer programs.
The optimization problem produces a discrete schedule m = (m1, …, mK) intended to control, at each tk, the number of Gaussian perturbations paired with each training datum. In code, however, DSM training is typically organized around mini-batches, shuffling, and often a continuous-time sampling of t. We therefore require an implementation that (i) enforces a target multiplicity , (ii) respects the total budget 𝒞(m) ≈ ∑kmkΔtk, and (iii) permits post hoc auditing of the multiplicity actually realized.
If the model samples t ∼ Unif[0, T] (or a nonuniform density) at each iteration, we implement our schedule by introducing a binning map b(t) ∈ {1, …, K} and replacing m(t) by mb(t). Concretely, one may either (a) sample k directly with probability proportional to Δtk and then set t = tk (standard in discretized diffusion training), or (b) sample t continuously and then apply k = b(t) only for multiplicity control, keeping the original t for the target/weighting. The former matches our analysis most closely; the latter is sometimes preferable when the loss uses continuous-time weights, but one must then track compute and risk by bins.
Fix a time bin k. For each
datum index i, we aim to
restrict the set of noises {zi, j, k}j = 1mk
used to form noised inputs
$$
x_{i,j,k}^{(t_k)} \;:=\; a_{t_k}x_i + \sqrt{h_{t_k}}\,z_{i,j,k},
\qquad z_{i,j,k}\sim\mathcal{N}(0,I_d).
$$
A direct method is to associate to each pair (i, k) a of size mk consisting of
seeds {σi, j, k}j = 1mk;
when the pipeline requests a noise sample for (i, k), we select j uniformly and generate z deterministically from σi, j, k.
This yields an exact multiplicity cap: across all epochs, no more than
mk
distinct z values are ever
paired with xi at time tk. The binary
policy is a special case: mk = 1 means a
singleton pool (full reuse), while mk = ∞ is
approximated by omitting the pool (fresh z each call) or by taking a very
large mmax.
In large-scale training, explicitly storing σi, j, k for all (i, k) is infeasible. A practical compromise is to store pools only for the bins where reuse is desired (typically tk ≤ tmin), and generate fresh noise elsewhere. Even in the worst case, the early-time bins are a small subset, so the memory footprint is manageable. When per-example indices i are not stable (e.g. streaming datasets), we may define i as a stable hash of the sample identifier (file path, shard offset, etc.), which suffices to make reuse reproducible.
When storage is still undesirable, we may emulate pools via a keyed
pseudorandom function. For mk = 1,
define
zi, k = Gauss(Hash(key, i, k)),
where Gauss(⋅) maps a seed to a
Gaussian vector (e.g. by a counter-based RNG). This enforces reuse
without storing anything. For general integer mk, we replace
k by (k, j) with j ∈ {1, …, mk}
and select j by a
deterministic but varying rule, e.g. j = 1 + (Hash(key′, step, i, k) mod mk).
This guarantees that the realized set of z values per (i, k) is contained in a
set of size mk, while still
permitting randomized mixing across steps.
Our compute proxy counts noise evaluations per datum, whereas wall-clock time in modern diffusion training is dominated by network forward/backward passes. We therefore implement the schedule by holding the fixed and altering only whether the same (xi, tk) pair is repeatedly coupled to identical or fresh noise. This makes comparisons at fixed compute meaningful when the noise generation is nontrivial (e.g. when augmentation/noising is expensive) and, even when it is not, isolates the effect of reuse. If one insists that 𝒞(m) track wall-clock time, a closer proxy is to keep the number of (xi, j, k(tk), tk) per epoch proportional to mkΔtk, which can be realized by oversampling time bins with larger mk or by forming within-batch replicates of each example at the selected tk.
Given implementation details, we recommend explicitly measuring the
realized multiplicity. For each bin k, define the empirical effective
multiplicity
$$
\widehat m_k \;:=\; \frac{1}{|\mathcal{I}|}\sum_{i\in\mathcal{I}}
\Big|\big\{ z \ \text{used with}\ (i,k)\ \text{during
training}\big\}\Big|,
$$
where ℐ is a monitored subset of sample
indices and the set cardinality counts distinct RNG seeds (or distinct
noise hashes) observed. In practice we log, for each monitored (i, k), a Bloom filter or a
small hash set of seen seeds, updated online with negligible overhead.
One should then verify that m̂k ≈ 1 for the
bins designated as
reuse'' and that $\widehat m_k$ grows linearly with training steps for the bins designated asfresh
noise’’ (until limited by any imposed mmax).
A second audit is to check that the matches the intended quadrature weights: letting Nk be the number of training calls in which b(t) = k, we expect Nk/∑ℓNℓ ≈ Δtk/∑ℓΔtℓ (or the chosen time-sampling density). This is essential: an apparently correct m̂k is irrelevant if the pipeline almost never visits bin k.
We have found the following patterns robust: (i) enforce deterministic mk = 1 reuse for tk ≤ tmin via hash-based noise; (ii) use unconstrained fresh noise for tk > tmin; (iii) keep the optimizer step count fixed across schedules and report both (m̂k)k and $(N_k)_k alongside fidelity metrics. This aligns the code-level behavior with the mathematical object m, and it prevents accidental ``leakage’’ of high multiplicity into the small-noise window via, e.g., nondeterministic data augmentations or per-worker RNG divergence.
Our theory predicts that the governs a privacy–fidelity tradeoff: in the overparameterized regime one should enforce strong reuse (small mk) in the small-noise window, while allocating larger mk to moderate/large-noise times to improve fidelity. We therefore recommend experiments in which the intervention is the schedule m = (m1, …, mK) used to generate the Gaussian perturbations paired with each training datum at each time bin. All other degrees of freedom—architecture, optimizer, learning-rate schedule, augmentations, time-sampling distribution, number of steps, and EMA settings—are held fixed.
We suggest two families to test generality: (i) a convolutional U-Net diffusion model on CIFAR-10 and/or ImageNet-64, and (ii) a transformer-based DiT-style model on ImageNet-64 or ImageNet-128 (compute permitting). These choices expose different inductive biases and parameterizations while keeping evaluation and training pipelines standardized. For each setup we fix the discretization {tk}k = 1K (or the corresponding continuous-time sampler with a binning map), and we record the realized bin occupancies Nk to ensure that the effective quadrature weights match the intended Δtk.
We recommend at least the following schedules:When reporting, we include the audited effective multiplicities m̂k (on a monitored subset of examples) to verify that the implementation matches the intended schedule.
We propose two complementary protocols. In the first, we fix the number of optimizer steps and batch size across schedules, so that wall-clock compute is matched and the only change is the statistical effect of reusing versus refreshing noise. In the second, we explicitly match the proxy 𝒞(m) by adjusting the number of calls per bin: for example, oversample time bins with larger mk so that the total number of noised pairs per datum scales like ∑kmkΔtk. The two protocols answer different questions: the first isolates the privacy effect of reuse under standard training budgets; the second more directly instantiates the compute-constrained optimization problem.
We recommend standard unconditional generation metrics: FID and Inception Score for CIFAR-10, and FID and precision–recall for ImageNet resolutions. When feasible, we also report likelihood proxies (e.g. bits-per-dimension via a variational bound) and calibration metrics for classifier-free guidance sweeps, as reuse schedules may interact with guidance at sampling time. To connect to the time-resolved nature of the analysis, we additionally log per-bin denoising losses (or DSM losses) at evaluation time: for each k we estimate 𝔼∥ŝtk(x(tk)) − s⋆(tk, x(tk))∥2 using a held-out validation set and fresh evaluation noise. Although s⋆ is unavailable, one may use the standard DSM target estimator (fresh Monte Carlo at evaluation) as a consistent proxy; the important point is that evaluation uses noise for all schedules.
A primary privacy diagnostic is nearest-neighbor (NN) retrieval between generated samples and the training set. For each generated image x̃ we compute its nearest neighbor in a feature space (e.g. CLIP or Inception embeddings) and report the distribution of distances, as well as the fraction of generations whose NN distance falls below a fixed threshold calibrated on validation data. We also report qualitative galleries of (x̃, NN(x̃)) pairs for the tail of closest matches. The theoretical expectation is that schedules with larger effective multiplicity at small t (closer to mk = ∞ for tk ≤ tmin) will produce more near-copies of training points in the overparameterized regime, while the throttled/optimized schedules suppress this phenomenon without sacrificing late-time fidelity.
We recommend membership inference (MI) evaluations tailored to diffusion models. A simple and reproducible attack uses the model’s denoising loss as a score: given a candidate example x, sample times t (with emphasis on t ≤ tmin) and fresh noises z, and compute the average squared error between the model prediction and the DSM target; the attacker classifies x as a member if this loss is unusually small relative to a calibration set. Stronger attacks train a logistic regressor on a vector of per-time losses {ℓ(tk; x)}k ≤ kmin using shadow models. We report the ROC-AUC and true-positive rate at fixed false-positive rate. We expect, consistent with the small-t monotonicity, that increasing multiplicity in the small-noise window improves the model’s fit to training idiosyncrasies and hence increases MI advantage, while allocating multiplicity to larger times yields smaller MI gains at comparable fidelity improvements.
To map an empirical compute–privacy frontier, we sweep (B, ρ) (or, operationally, sweep tmin and the measure of bins upgraded to large mk) and plot fidelity metrics versus MI AUC and NN-retrieval rates. We recommend multiple random seeds and reporting mean±standard error. Finally, to connect with the role of overparameterization, we vary width (or parameter count) to move between ψp > ψn and ψp < ψn qualitatively, and we test whether the sign of the small-t effect reverses as predicted: in an underparameterized setting, higher multiplicity should monotonically improve test metrics with weaker memorization signatures, whereas in an overparameterized setting the benefit of high multiplicity should concentrate at larger t and come with increased small-t privacy risk.
Our scheduling conclusions are derived in a deliberately training model: for each tk we train an independent RFNN estimator ŝk to the global minimizer of a regularized DSM objective with multiplicity mk, and we evaluate its asymptotic test curve εtest(tk; mk) together with the mismatch Mtk(mk) to the empirical score. This abstraction is appropriate for isolating the statistical role of noise reuse, but it ignores algorithmic coupling across times (shared parameters, shared optimizer state) and across data points (finite-batch SGD correlations). Consequently, our formal optimality statements for the binary policy class should be interpreted as optimality this stylized regime; transferring them to end-to-end diffusion training requires additional arguments controlling how far a practical training pipeline deviates from this separable limit.
Modern diffusion models use a single time-conditioned network sθ(t, x) trained on a mixture over t, so the effective estimator at time tk depends on the entire schedule m, not only on mk. In this setting the objective is no longer a sum of independent per-bin risks, and the scheduling problem may cease to reduce to a threshold/greedy selection even under monotonicity at the endpoints m ∈ {1, ∞}. Nevertheless, we expect the qualitative structure to persist when (i) the network has sufficient capacity to fit each time slice with limited interference and (ii) the time-sampling distribution makes the per-time gradients approximately orthogonal. A concrete route to formalization is to study a linearized (NTK) or random-features approximation for a feature map Φ(t, x), in which the Gram matrix acquires a block structure indexed by time. One would then replace the scalar learning-curve equations by a coupled fixed-point system whose off-diagonal blocks quantify time interference; small off-diagonal norms should recover our separable formulas perturbatively, yielding stability of the optimal schedule to moderate coupling.
Our analysis privileges an OU forward process and targets P0 that are Gaussian or close to Gaussian in the sense required by the source’s Hermite/rotational tools. Real data are neither Gaussian nor full-dimensional, and many pipelines employ variance-preserving or variance-exploding SDEs with nontrivial β(t). We view OU as a mathematically controlled proxy capturing the key small-noise phenomenon: for t ↓ 0, the conditional distribution Pt(⋅ ∣ x0) concentrates near x0, so the empirical score becomes increasingly sensitive to individual training points, and overparameterized estimators can track this sensitivity when the Monte Carlo target variance is sufficiently reduced. Extending theorems such as the small-t monotonicity to non-Gaussian P0 plausibly requires two ingredients: (i) local asymptotics of se(t, ⋅) as a smoothed empirical measure for general P0 (possibly supported near a manifold) and (ii) learning-curve control for RF/NTK regression with non-isotropic covariates induced by Pt. A particularly relevant intermediate case is a subspace Gaussian model, where P0 is Gaussian on a low-dimensional subspace plus ambient noise; here one can still hope to obtain explicit equations and to identify when the memorization window is governed by the intrinsic dimension rather than d.
While the binary class mk ∈ {1, ∞} exposes the clearest structure, practical training uses bounded integer multiplicities and complex reuse patterns (e.g. caching a finite pool of noises per datum). Our bounded-mmax variant assumes that εtest(t; m) and Mt(m) can be interpolated between endpoints in a manner respecting diminishing returns. Establishing such an interpolation from first principles is nontrivial: the Monte Carlo target variance scales like 1/m, but the induced effective noise in the regression problem depends on both t (through at, ht) and on the estimator class (through resolvents of random feature matrices). A natural next step is to prove explicit monotonicity and concavity properties of the learning curves as functions of 1/m in the relevant regimes, which would justify greedy incremental allocation (priority-queue style) with provable approximation guarantees.
Classifier-free guidance and related sampling heuristics change the effective reverse drift, often by amplifying score magnitudes in regions where the conditional and unconditional scores differ. Because our fidelity proxy ℱ(m) is tied to unconditional DSM test error, it may not predict guidance-conditioned artifacts or privacy leakage under aggressive guidance scales. Moreover, guidance tends to emphasize high-frequency details, which are precisely the aspects most susceptible to small-t memorization. A more complete account would incorporate (i) conditional scores, (ii) the guidance transformation, and (iii) a privacy metric defined on the output distribution. Analytically, this suggests studying how perturbations of the score field (in L2(Pt) or stronger norms) are magnified by the discretized reverse solver under guidance, and whether the amplification is concentrated at early times.
We have used the standard inequality relating KL(P0∥P̂0) to ∫0Tεtest(t; m(t)) dt as a motivation for ℱ(m). Turning this into a fully end-to-end guarantee for a discrete-time sampler requires additional control of (i) discretization error in time, (ii) numerical integration error in the reverse SDE/ODE solver, (iii) the approximation gap between the trained network and the population regression function under finite optimization, and (iv) the stability of the reverse-time flow to localized score errors. Each term is in principle tractable, but their combination is delicate: small errors at early times can propagate through the flow map, and the relevant stability constants may depend on tail behavior of Pt and on Lipschitz properties of the score. Thus, a complete theorem would likely assume quantitative regularity of Pt (e.g. log-Sobolev or bounded Hessian of log Pt) together with uniform-in-t control of ∥ŝt − s⋆(t, ⋅)∥L2(Pt) and a solver-specific stability bound.
Our risk functional ℛ(m) is intentionally operational: it penalizes proximity to the empirical score in the small-noise window, motivated by the empirical linkage between such proximity and memorization behaviors. This is not a formal differential privacy (DP) guarantee, and it does not preclude leakage through other channels (e.g. model inversion, prompt conditioning, or rare-mode exposure). Bridging to DP would require a sensitivity analysis of the training algorithm under the reuse schedule, including the effect of noise reuse on gradient variance and on the influence function of individual samples. We regard this as complementary: the present framework is designed to expose a lever (where to spend noise diversity in time) that can be combined with orthogonal privacy mechanisms (clipping, DP-SGD, data augmentation, or regularization).
Several extensions appear immediate. First, one can jointly optimize the time-sampling distribution and the reuse schedule, replacing fixed Δtk by decision variables subject to a total-mass constraint. Second, one can make the schedule adaptive, choosing mk online based on monitored indicators such as validation denoising error or an empirical memorization statistic, which would convert the present static knapsack-like problem into a feedback control problem over diffusion time. Third, one can study schedule transfer across architectures by expressing the gains gk and risk deltas rk in terms of measurable quantities (e.g. per-time gradient norms or effective signal-to-noise ratios) rather than oracle access to asymptotic curves. Finally, understanding whether the small-t throttling principle survives in fully nonlinear, finite-d regimes—and how it interacts with guidance and conditioning—remains the central empirical and theoretical question motivating this line of work.