Compute–Privacy Frontiers in Diffusion Models via Optimal Noise-Reuse Scheduling m(t)

Metadata

Total Words: 8,977
Export Date: 2026-01-19 23:23:34
Description: Modern diffusion models are trained with enormous compute, implicitly increasing the number of noise samples per training datum as epochs progress. Recent precise learning-curve theory for denoising score matching (DSM) with random features shows that this noise multiplicity m has a nontrivial effect: increasing m improves generalization when the model is underparameterized (p < n) but can intensify memorization when the model is overparameterized (p > n), especially at small noise levels. We elevate this phenomenon into a 2026-era design principle by studying noise-reuse scheduling across diffusion times. In an analytically tractable OU diffusion setting with RFNN scores and proportional asymptotics, we formulate a constrained optimization problem that trades off (i) a fidelity proxy given by the time-integrated DSM test error (which controls KL divergence of the generated distribution) against (ii) a memorization-risk proxy defined by closeness of the learned score to the empirical optimal score in the small-noise regime, under (iii) a compute budget equal to the total number of noise evaluations per datum. We prove monotonicity regimes for the effect of m on fidelity and memorization, and show that optimal schedules allocate minimal noise reuse at small times and concentrate compute at moderate/large times. For a binary policy m(t) ∈ {1, ∞} we give an explicit optimal threshold schedule and a closed-form compute–privacy Pareto frontier. We complement theory with targeted experiments where only the noise reuse at selected time ranges is varied for U-Net diffusion models, demonstrating substantial reductions in nearest-neighbor retrieval and membership inference at fixed generation quality.

1. Introduction: memorization as a compute-dependent phenomenon; why m is the missing axis; overview of scheduling and compute–privacy frontiers.
2. Background and source-material recap: OU diffusion, DSM objective, m as noise multiplicity, empirical optimal score s_e, and the generalization–memorization crossover at p ≈ n.
3. Formal problem formulation: discrete time grid, compute budget, fidelity proxy via integrated test error, memorization proxy via mismatch-to-empirical-score in small-noise window; relation to extraction metrics.
4. Asymptotic learning curves with variable m: exact expressions at m = 1 and m = ∞ (from the source); constructing usable bounds/interpolants for intermediate m; calibration protocol.
5. Monotonicity theorems: when increasing m helps or hurts (as a function of ψ_p/ψ_n and time t); sufficient conditions for small-t deterioration in overparameterized regime.
6. Optimal scheduling under constraints: binary policy optimality (threshold schedules), extensions to bounded integer m_k (greedy/convex relaxation), and compute–privacy Pareto frontier characterization.
7. Lower bounds and (limited) hardness: necessity of small-t throttling in p > n; NP-hardness of general integer schedules under arbitrary objectives; why monotone structure yields tractability here.
8. Practical instantiation: translating m(t) to training pipelines (epochs/noise pools), suggested implementation patterns, and how to audit effective m(t) in real code.
9. Experiments (recommended): U-Net/DiT ablations varying noise reuse only in selected time bins; evaluation via NN retrieval, membership inference, and standard fidelity metrics; comparing schedules at fixed compute.
10. Discussion and limitations: time-conditioned networks, non-Gaussian data, guidance; what’s needed for fully end-to-end guarantees; future work.

Content

1. Introduction: memorization as a compute-dependent phenomenon; why m is the missing axis; overview of scheduling and compute–privacy frontiers.

In contemporary diffusion training pipelines, the phenomenon colloquially called is typically discussed as a property of the model class, the dataset, and the total optimization effort. We shall treat it instead as a consequence of a more primitive design choice that is usually left implicit: the m used to form the denoising score matching (DSM) regression targets at a given diffusion time. Concretely, at time t one observes pairs (x_i, x̃_i, j) where $\tilde x_{i,j}=a_t x_i+\sqrt{h_t}\,z_{i,j}$ with z_i, j ∼ 𝒩(0, I_d), and one fits a score estimator to approximate ∇_x̃log P_t(x̃) via a regression surrogate. The standard implementation choice is to draw one noise sample per datum per iteration, which corresponds (in our stylized per-time analysis) to m = 1. A seemingly innocuous alternative is to reuse data by drawing many independent noises per datum (large m, or an idealized m = ∞), thereby reducing Monte Carlo variability in the regression target. Since increasing m also increases the number of effective training constraints (and hence the compute spent evaluating the score loss), it is natural to regard m as an explicit compute allocation variable rather than a fixed nuisance parameter.

Our starting point is that, in high-dimensional regimes where random-feature score models exhibit sharp interpolation transitions, m is not merely a variance-reduction knob: it changes which object the learned score is statistically encouraged to approximate. In particular, when the function class is rich enough to closely fit the DSM targets, the learned estimator can become closer to the (a kernel-density-type score built from the finite sample) than to the population-optimal score. This is the operational sense in which memorization arises in our analysis: the estimator aligns with the dataset-induced score field, producing strong attraction toward training points in the small-noise region. The key observation is that large m makes the empirical DSM objective closer to its conditional-expectation limit given the dataset, and therefore (in regimes allowing interpolation) can move the global minimizer this empirical score, even if that move is detrimental to out-of-sample DSM error. Thus, compute spent on increasing m may increase the propensity to memorize, rather than mitigate it.

This dependence is time-dependent and therefore inseparable from the diffusion geometry. At small diffusion times t, the forward corruption $\tilde x=a_t x+\sqrt{h_t}z$ preserves fine-scale information about individual samples (since a_t ≈ 1 and h_t ≈ 0). In that window, the empirical score field is highly non-smooth at the scale relevant to generalization, and the DSM regression becomes capable of encoding sample-specific structure. Accordingly, if we are in an overparameterized regime (in the proportional limit, a representative condition is ψ_p > ψ_n), then increasing m can reduce the mismatch to the empirical optimal score while simultaneously worsening the DSM test error. In contrast, at larger times t the diffusion marginal P_t becomes closer to an isotropic Gaussian and the population score becomes near-linear, so that increasing m behaves as one expects from standard regression: it reduces target noise and improves generalization. This qualitative crossover suggests that any compute spent on noise multiplicity should not be uniform across time, and that ``more data augmentation’’ (in the form of more noise draws per datum) is not universally beneficial.

From these considerations it follows that the relevant optimization problem is not simply choose $m$'' but ratherchoose a schedule t ↦ m(t)’’ subject to finite resources. Once we discretize diffusion time into bins t₁, …, t_K, we may allocate a multiplicity m_k to each time t_k. Doing so induces three competing functionals: a proxy, measured by an integrated DSM test error across times; a proxy, measured by the total number of noise evaluations ∑_km_kΔt_k; and a proxy, measured by an increasing function of the estimator’s proximity to the empirical score, but restricted to a small-noise window t ≤ t_min where memorization is mechanistically plausible. The problem therefore becomes a constrained scheduling task: allocate noise multiplicity across time so as to reduce global DSM error while ensuring that early-time memorization risk stays within a prescribed budget and total compute does not exceed an available budget.

The central message is that this scheduling problem acquires a tractable structure precisely because diffusion time imposes an ordering on the learning curves. Under mild monotonicity conditions (which are verified in the source material for the endpoints m ∈ {1, ∞} via exact proportional-asymptotic formulas), the optimization collapses to a near-threshold rule: keep m_k minimal at sufficiently small times when the model is capable of interpolating, and allocate larger m_k at moderate/large times where additional noise draws strictly improve test performance while having negligible effect on the memorization proxy. In the binary policy class m_k ∈ {1, ∞}, the resulting solution is provably optimal via a standard exchange argument; in bounded-integer policies m_k ∈ {1, …, m_max}, the same reasoning yields a greedy rule based on marginal gains that is near-optimal under natural diminishing-returns conditions. In either case, the schedule induces an explicit compute–privacy–fidelity Pareto frontier: increasing compute improves fidelity only insofar as it is spent at times where the gain is positive, while privacy (or anti-memorization) constraints restrict spending in the small-noise region.

We emphasize that the role of random features and proportional asymptotics is not merely technical convenience. The analytic learning curves provide closed-form access to the dependence of ε_test(t; m) and the mismatch-to-empirical-score M_t(m) on (ψ_n, ψ_p, λ, 𝜚, t) at the extreme multiplicities m = 1 and m = ∞, which is exactly what is needed to compute marginal gains and to certify monotonicity regimes. This permits a clean separation between (i) a statistical statement about how multiplicity changes generalization and memorization as functions of diffusion time, and (ii) an algorithmic statement about how to allocate multiplicity under budgets. The remainder of the paper formalizes these statements, beginning with the OU diffusion and DSM setup, and then deriving the scheduling structure from the asymptotic formulas.

2. Background and source-material recap: OU diffusion, DSM objective, m as noise multiplicity, empirical optimal score s_e, and the generalization–memorization crossover at p ≈ n.

We briefly recall the OU forward process and the denoising score matching (DSM) regression view in a form that makes the parameter m explicit. For t ≥ 0, the OU corruption of a clean sample x ∼ P₀ is
$$ \tilde x = a_t x + \sqrt{h_t}\,z,\qquad z\sim\mathcal{N}(0,I_d), $$
where a_t = e^−t and h_t = 1 − e^−2t satisfy a_t² + h_t = 1. The conditional law is x̃ ∣ x ∼ 𝒩(a_tx, h_tI_d), hence its (Stein) score with respect to x̃ is
$$ \nabla_{\tilde x}\log p(\tilde x\mid x)= -\frac{\tilde x-a_t x}{h_t} = -\frac{1}{\sqrt{h_t}}\,z. $$
The population target score at time t is s^⋆(t, x̃) = ∇_x̃log P_t(x̃), where P_t is the x̃-marginal induced by x ∼ P₀ and the OU channel.

In DSM one learns a function s(⋅) ≈ s^⋆(t, ⋅) by regressing to the conditional score. For a fixed time t and a function class ℋ (here the RFNN score class), we consider the regularized empirical objective built from a dataset S = {x_i}_i = 1ⁿ and m independent noise draws per datum,
$$ \hat L_{t,m}(s) := \frac{1}{nm}\sum_{i=1}^n\sum_{j=1}^m \Bigl\|\, s(\tilde x_{i,j}) + \frac{\tilde x_{i,j}-a_t x_i}{h_t}\Bigr\|_2^2 \;+\;\lambda \|s\|_{\mathcal{H}}^2, \qquad \tilde x_{i,j}=a_t x_i+\sqrt{h_t}\,z_{i,j}. $$
The dependence on m is not merely cosmetic: it controls how accurately the empirical loss approximates its conditional expectation given the dataset. Indeed, conditioning on S, we may view L̂_t, m as an m-sample Monte Carlo approximation to the functional
$$ L_{t,\infty}(s) := \frac{1}{n}\sum_{i=1}^n \mathbb{E}_{z\sim\mathcal{N}(0,I_d)} \Bigl[\Bigl\|\, s(a_t x_i+\sqrt{h_t}\,z) + \frac{z}{\sqrt{h_t}}\Bigr\|_2^2\Bigr] \;+\;\lambda \|s\|_{\mathcal{H}}^2, $$
which corresponds to the idealized multiplicity m = ∞. In this sense, m interpolates between a single-noise'' stochastic regression problem ($m=1$) and aconditional-expectation’’ regression problem (m = ∞) in which the only randomness comes from the finite dataset S itself.

A central object in the source analysis is the s_e(t, ⋅), defined as the (unregularized) pointwise regression function obtained when we replace the model class by all measurable functions . Concretely, for a fixed t and dataset S, consider the empirical mixture distribution
$$ P_t^S := \frac{1}{n}\sum_{i=1}^n \mathcal{N}(a_t x_i,\,h_t I_d), $$
which is exactly the law of x̃ when we first sample an index I uniformly in {1, …, n} and then sample x̃ ∼ 𝒩(a_tx_I, h_tI_d). A standard calculation shows that the corresponding score is
$$ \nabla_{\tilde x}\log P_t^S(\tilde x) = \sum_{i=1}^n w_i(\tilde x)\,\Bigl(-\frac{\tilde x-a_t x_i}{h_t}\Bigr), \qquad w_i(\tilde x):=\frac{\exp\!\bigl(-\frac{\|\tilde x-a_t x_i\|_2^2}{2h_t}\bigr)}{\sum_{\ell=1}^n \exp\!\bigl(-\frac{\|\tilde x-a_t x_\ell\|_2^2}{2h_t}\bigr)}. $$
We denote this score by s_e(t, x̃); it is a kernel-density-estimator (KDE) score field with Gaussian bandwidth $\sqrt{h_t}$ and centers {a_tx_i}. In particular, as t ↓ 0 one has a_t → 1 and h_t → 0, so s_e(t, ⋅) develops sharp attraction basins around the training samples at a spatial scale comparable to $\sqrt{h_t}$. Conversely, for large t the mixture P_t^S is strongly smoothed and approaches a near-isotropic Gaussian in many regimes, so s_e(t, ⋅) becomes comparatively regular and close to the population score s^⋆(t, ⋅).

The role of m may now be phrased as follows. When m is large, the empirical DSM objective is a high-accuracy approximation to the conditional-expectation objective given S, and in rich function classes the optimizer is driven toward the dataset-conditioned regression function—hence, toward s_e(t, ⋅) rather than s^⋆(t, ⋅). When m is small (notably m = 1), the additional Monte Carlo variability in the target acts as an implicit noise injection in the regression problem; in interpolation-capable regimes this can prevent the estimator from locking onto the fine-scale structure of s_e(t, ⋅), especially at small t where that structure is most ``sample-specific.’’

The source material makes this qualitative picture quantitative for RFNN score models in proportional asymptotics. In that setting, the width p and sample size n scale linearly with d, and the ratio ψ_p/ψ_n governs an interpolation transition: roughly, ψ_p > ψ_n corresponds to a regime where the random-features design is expressive enough to fit (or nearly fit) the DSM constraints, while ψ_p < ψ_n corresponds to a constrained regime where the estimator cannot interpolate. The novelty is that the sign of the benefit of increasing m depends on which side of this transition we are on on the diffusion time. In overparameterized regimes ψ_p > ψ_n, increasing m can worsen the DSM test error at sufficiently small t even though it reduces the mismatch to the empirical score, because it encourages convergence toward the spiky s_e(t, ⋅). At larger t, where the target score is effectively low-complexity and less dataset-dependent, increasing m behaves like classical variance reduction and improves generalization. Around the critical region p ≈ n (i.e. ψ_p ≈ ψ_n), the learning curves sharpen and the crossover in t becomes particularly pronounced: small changes in effective constraint-to-parameter ratio, induced here by changing m, can move the estimator between population-like'' andempirical-like’’ solutions.

This generalization–memorization crossover is the mechanism that motivates time-dependent allocation of m. The same multiplicity that is beneficial when the regression problem is well-posed (moderate/large t, or ψ_p < ψ_n) may be harmful when the estimator can track sample-specific score features (small t with ψ_p > ψ_n). The remainder of our development will treat the quantities induced by this dependence—test error and mismatch-to-empirical-score, evaluated at the endpoints m ∈ {1, ∞} and then scheduled over t—as the primitive inputs for a constrained allocation problem.

3. Formal problem formulation: discrete time grid, compute budget, fidelity proxy via integrated test error, memorization proxy via mismatch-to-empirical-score in small-noise window; relation to extraction metrics.

We now isolate the scheduling question induced by the dependence of the DSM estimator on the parameter. Fix a quadrature rule on [0, T] given by times 0 < t₁ < ⋯ < t_K ≤ T and nonnegative weights Δt_k (e.g. Riemann or Gauss–Legendre). At each t_k we train an RFNN score model to the global minimizer of the regularized DSM objective with multiplicity m_k, producing an estimator ŝ_k ≡ ŝ_{t_k, m_k}. The decision variable is therefore the schedule
m = (m₁, …, m_K), m_k ∈ ℳ,
where we shall study two classes: (i) the class ℳ = {1, ∞}, and (ii) a class ℳ = {1, 2, …, m_max} equipped with an interpolation/bounding rule connecting the endpoints m = 1 and m = ∞.

The scheduling objective is stated in terms of the asymptotic test error of the time-t regression problem, denoted ε_test(t; m), evaluated in the proportional limit d, n, p → ∞ with ψ_n = n/d and ψ_p = p/d fixed. We define the

which is the natural discrete analogue of ∫₀^Tε_test(t; m(t)) dt. This quantity is the correct surrogate for end-to-end generative accuracy because standard score-based diffusion arguments bound the divergence between the true reverse-time dynamics and the learned one by an integral of squared score error; in particular, under the usual regularity hypotheses one has a bound of the form

Thus, minimizing ℱ is our proxy for maximizing sample quality.

We model computational cost as proportional to the number of noise evaluations per datum. Since training at time t_k uses m_k noise draws per x_i, the total cost (up to an overall constant depending on optimization details) is

where B is a prescribed budget. In the binary class, the symbol m_k = ∞ is interpreted as allocating a very large (effectively variance-free) Monte Carlo pool at that time; operationally, one may replace ∞ by a large m_max chosen so that endpoint learning curves are well approximated. Importantly, is separable across times, which will enable greedy/threshold solutions once monotonicity properties of ε_test(t; m) are established.

Compute alone does not capture the phenomenon highlighted above: in regimes where the estimator can interpolate, increasing m can drive ŝ_t, m toward the dataset-conditioned score s_e(t, ⋅), which is known to become highly sample-specific as t ↓ 0. To quantify the propensity of the learned score to encode such sample-specific structure, we introduce a quantity
M_t(m) := 𝔼∥ŝ_t, m(X̃) − s_e(t, X̃)∥₂², X̃ ∼ P_t,
whose appearance is motivated by the bias–variance style decomposition underlying the source analysis (in the spirit of Lemma~3.5 therein). Since s_e(t, ⋅) is the score of the empirical Gaussian mixture P_t^S and thus directly determined by S, small M_t(m) indicates that the learned model is close to a score field that concentrates around training points at scale $\sqrt{h_t}$. This motivates using M_t(m) as a primitive that correlates with and, consequently, with vulnerability.

To connect to extraction metrics, we pass from M_t(m) to a scalar risk proxy Mem(t; m) via a fixed monotone transform

where ϕ is decreasing (so that being to s_e corresponds to higher memorization risk). This abstraction accommodates multiple downstream notions: for instance, a membership-inference advantage in a score-based test is often monotone in a suitable discrepancy between the model score and a population baseline, while nearest-neighbor style extraction heuristics become stronger as the learned score field develops sharp basins aligned with training samples. We do not commit to a single operational metric here; rather, we treat Mem as an explicitly computable proxy derived from the same asymptotic formulas as ε_test, and we enforce it only where it is meaningful: in a small-noise window [0, t_min].

Accordingly, we define the

where t_min > 0 is chosen so that h_t remains small on [0, t_min] (so that s_e indeed exhibits sample-specific fine structure), and ρ is a prescribed risk budget. The restriction to t ≤ t_min formalizes the idea that extraction vulnerabilities are dominated by the early reverse-time segment, where the learned score controls fine-scale placement of probability mass.

Combining , , and , our formal problem is

The essential modeling choice is that the per-time estimators are trained independently, so that the curves t ↦ ε_test(t; m) and t ↦ M_t(m) are the only inputs required for . In the binary class, these inputs reduce to the endpoints m = 1 and m = ∞; in the bounded-integer class, we will require a controlled interpolation or upper bound for intermediate m. In the next section we therefore turn to the asymptotic learning-curve formulas available at m ∈ {1, ∞} and to the construction of usable surrogates for general m, which together render algorithmically tractable.

4. Asymptotic learning curves with variable m: exact expressions at m = 1 and m = ∞ (from the source); constructing usable bounds/interpolants for intermediate m; calibration protocol.

Problem is algorithmically meaningful only insofar as we can evaluate, for each time t, the two primitives
m ↦ ε_test(t; m) and m ↦ M_t(m),
at least for the values of m allowed by the policy class. The analysis of the source supplies closed asymptotic characterizations at the two endpoints m = 1 and m = ∞, which we treat as throughout: for each fixed t, one obtains ε_test(t; 1) from the single-noise'' learning curve (source Thm.~3.6) and $\varepsilon_{\mathrm{test}}(t;\infty)$ from theinfinite-pool’’ learning curve (source Thm.~3.2). Both are given in proportional asymptotics by constant-size fixed-point systems for a small collection of scalar order parameters (resolvent/traces of the random-features Gram matrix and its ridge-regularized inverse), which depend on (ψ_n, ψ_p, λ), on the activation 𝜚 through its Hermite coefficients, and on t through the OU parameters (a_t, h_t) and the induced law P_t.

Concretely, at each (t, m) ∈ {t_k} × {1, ∞}, the source reduces the DSM regression to a random-features ridge problem with a teacher signal determined by the population score s^⋆(t, ⋅) and an effective label-noise component determined by the Monte Carlo variability of the DSM target. In the m = ∞ case, this Monte Carlo component vanishes and the regression is ``noise-free’’ relative to the conditional expectation target; in the m = 1 case, it is maximal. The resulting test error admits the usual decomposition
ε_test(t; m) = ε_bias(t; m) + ε_var(t; m),
where both terms are explicit functions of the fixed-point solution (and thus can be evaluated without simulating high-dimensional random matrices). An analogous statement holds for M_t(m): in the source’s bias–variance decomposition, M_t(m) is expressed in terms of the same order parameters together with an overlap between the learned predictor and the empirical-score target s_e(t, ⋅). We emphasize that, for our scheduling purposes, it suffices that (ε_test(t; 1), ε_test(t; ∞), M_t(1), M_t(∞)) can be computed in O(1) time per t up to numerical root finding; thus, computing all endpoint curves over the grid costs O(K ⋅ T_solve).

In the bounded-integer policy ℳ = {1, …, m_max} we additionally require a principled surrogate for ε_test(t; m) and M_t(m) when 1 < m < ∞. The key modeling observation is that, at fixed t, using m independent noise draws per datum replaces the ideal conditional expectation in the DSM target by an average of m i.i.d. terms. Accordingly, conditional on the regressor (the diffused input), the deviation of the finite-m target from its m = ∞ counterpart is mean-zero with variance scaling as 1/m. This suggests treating the transition from m = ∞ to finite m as the introduction of an label-noise term with variance proportional to 1/m.

We therefore adopt the following interpolation rule, which is compatible with standard random-features learning-curve formulas with noisy labels: for each t we posit an effective noise level σ_t² such that

where ε_test^(rf)(t; τ) denotes the random-features ridge test error when the teacher target is the m = ∞ DSM target and the labels are corrupted by independent additive noise of variance τ. The map τ ↦ ε_test^(rf)(t; τ) is explicit in the same order parameters as the endpoint formulas, with τ entering as an additive contribution to the effective noise term in the fixed-point system; in particular, recovers the endpoints by matching
ε_test(t; ∞) = ε_test^(rf)(t; 0), ε_test(t; 1) ≈ ε_test^(rf)(t; σ_t²).
In practice, we determine σ_t² by solving the scalar equation ε_test^(rf)(t; σ_t²) = ε_test(t; 1), which is well-posed whenever τ ↦ ε_test^(rf)(t; τ) is continuous and strictly increasing. This produces a one-parameter family {ε_test(t; m)}_{m ∈ [1, ∞]} consistent with the exact endpoints and with the 1/m Monte Carlo scaling. An entirely analogous construction is used for M_t(m), replacing the test-error functional by the mismatch functional in the noisy-label RF limit:

again with σ_t² fixed by matching M_t(1) (or, if preferred, by matching ε_test(t; 1) and then predicting M_t(1) as a consistency check).

When we do not wish to re-solve fixed-point equations for each intermediate m, we may instead use a cheaper, purely endpoint-based surrogate. A convenient choice is interpolation in the inverse-multiplicity variable:

and similarly for M_t(m). Inequality can be justified as a conservative upper bound whenever the learning curve is convex in the label-noise variance (a property that holds for many ridge-type risks as functions of an additive noise parameter), and it is exact at the endpoints. This bound suffices for designing schedules that are robust to interpolation error, since it never overstates the benefit of increasing m.

Because the preceding interpolation rules abstract away architectural and optimization effects (we assume global minimizers and the RF limit), we recommend a lightweight calibration step when deploying the schedule empirically. Fix a small subset of times {t_{k_ℓ}}_ℓ = 1^L (e.g. logarithmically spaced, with extra density near t ≤ t_min). For each selected time, train the RFNN score model with multiplicities in a small set {1, 2, 4, …, m_cal}, estimate ε_test(t_{k_ℓ}; m) on held-out diffused samples, and fit σ_{t_{k_ℓ}}² (or, in the endpoint-only rule, fit an exponent α_t in θ_m = m^−α_t). Interpolate σ_t² across t (e.g. by a spline in t or in h_t), and then evaluate – for all grid points. This produces calibrated curves that preserve the endpoint analytic structure while tracking finite-d effects.

Having thus reduced ε_test(t; m) and M_t(m) to explicit, evaluable functions of t and m, we next study their dependence on m across diffusion time and across regimes of ψ_p/ψ_n. These monotonicity properties, proved from the endpoint formulas and transported to intermediate m through the interpolation/bounding rules above, are the mechanism by which the scheduling problem collapses to a greedy/threshold form.

5. Monotonicity theorems: when increasing m helps or hurts (as a function of ψ_p/ψ_n and time t); sufficient conditions for small-t deterioration in overparameterized regime.

Fix a diffusion time t and view the DSM regression at this time as a random-features ridge problem with a teacher function equal to the conditional expectation target (the m = ∞ DSM target) and with an additional label-noise component induced by finite Monte Carlo averaging. Under this viewpoint, increasing m has two competing effects. First, it reduces the effective label-noise variance (scaling as 1/m), which regression when the estimator is variance-limited. Second, it forces the estimator toward the empirical conditional expectation associated with the training set, which can generalization when the model is sufficiently expressive to lock onto sample-specific structure (memorization). The sign of the net effect is determined by the regime ψ_p/ψ_n together with the diffusion time t (which controls the signal-to-noise and the complexity of the score).

To formalize, we define the endpoint gaps
Δ_ε(t) := ε_test(t; ∞) − ε_test(t; 1), Δ_M(t) := M_t(∞) − M_t(1).
The quantity Δ_ε(t) measures whether increasing m from 1 to ∞ helps (Δ_ε(t) ≤ 0) or hurts (Δ_ε(t) ≥ 0) fidelity at time t, while Δ_M(t) encodes whether increasing m pulls the learned score toward the empirical-optimal score s_e(t, ⋅) (namely Δ_M(t) ≤ 0) or away from it.

We now isolate a sufficient condition ensuring that, for small diffusion times, increasing m worsens test error while decreasing mismatch to the empirical score. The condition is expressed at the endpoints m ∈ {1, ∞}, where the source provides closed-form proportional-asymptotic formulas. The relevant asymptotic is t ↓ 0, for which a_t = 1 + O(t) and h_t = 2t + O(t²), hence P_t is a small Gaussian smoothing of P₀. In this regime, the population score s^⋆(t, ⋅) = ∇log P_t typically exhibits high curvature at the scale $\sqrt{h_t}$; correspondingly, the empirical score s_e(t, ⋅) behaves like the score of an h_t-bandwidth KDE supported on the training set. When ψ_p > ψ_n, the RFNN has enough degrees of freedom (in the random-features limit) to fit sample-specific structure, so moving closer to s_e can harm generalization.

We compare the endpoint learning-curve systems at m = 1 and m = ∞ by expanding the fixed-point order parameters in the small-h_t limit. In the m = ∞ system, the regression target is the conditional expectation, hence the effective label noise term vanishes; in the m = 1 system, an additional noise term persists, entering additively into the variance component of the risk. In the overparameterized regime, the ridge-regularized resolvent exhibits a variance amplification effect associated with near-interpolation (precisely: the variance term is decreasing in the label-noise level only beyond a critical effective dimension, and increasing when ψ_p exceeds ψ_n). At small t, the teacher component aligned with low-order Hermite modes is weak relative to the high-frequency component induced by the narrow smoothing (equivalently, the signal-to-noise ratio in the random-features channel is low). Consequently, removing label noise (passing from m = 1 to m = ∞) increases the estimator’s effective degrees of freedom devoted to fitting sample-specific fluctuations, yielding Δ_ε(t) ≥ 0. Meanwhile, because the m = ∞ objective matches the empirical conditional expectation without Monte Carlo perturbation, the learned predictor is closer to the empirical score target s_e, yielding Δ_M(t) ≤ 0. A sign analysis of the resulting expansions determines a positive t₀ for which both inequalities hold.

The small-t deterioration is not global in t. As t increases, the OU forward diffusion drives P_t toward an approximately Gaussian law with score close to the linear function x ↦ −x (more precisely, higher-order components decay as a_t shrinks). In this regime the teacher is effectively low-complexity and aligns strongly with the leading Hermite components of the random-features map; regression becomes signal-dominated, and reducing label noise by increasing m improves test error. Thus, still under ψ_p > ψ_n, there exists a (possibly distinct) t₁ ∈ (0, T] such that for all t ∈ [t₁, T] one has Δ_ε(t) ≤ 0. We emphasize that t₀ and t₁ need not coincide; the intermediate region is not assumed to be monotone without further conditions.

When ψ_p < ψ_n, the RFNN class is not rich enough (in the random-features limit) to interpolate the training constraints, and the dominant effect of increasing m is variance reduction in the DSM target. In this regime the learning-curve formulas imply that ε_test(t; m) is nonincreasing in the label-noise variance parameter, and hence nonincreasing in 1/m; at the endpoints this yields Δ_ε(t) ≤ 0 for all t ∈ (0, T]. Simultaneously, because the estimator cannot track the empirical score even when m = ∞, its proximity to s_e does not improve with m in the same way; indeed the endpoint comparison yields Δ_M(t) ≥ 0 in this regime, reflecting that the m = 1 objective’s additional noise acts as an implicit regularizer toward population structure rather than toward the empirical score.

The preceding comparisons are stated at m ∈ {1, ∞} because those values admit exact asymptotic characterization. For bounded integer policies, we propagate the endpoint monotonicity through the 1/m noise-variance scaling surrogate: in the overparameterized small-t window, ε_test(t; m) is increasing in m while M_t(m) is decreasing in m; in the underparameterized regime, ε_test(t; m) is decreasing in m for all t. This is the structural input needed in the next section, where we convert the constrained schedule search into a greedy/threshold allocation over time bins.

6. Optimal scheduling under constraints: binary policy optimality (threshold schedules), extensions to bounded integer m_k (greedy/convex relaxation), and compute–privacy Pareto frontier characterization.

Fix the discretization {t_k, Δt_k}_k = 1^K and regard the schedule m = (m₁, …, m_K) as the sole decision variable. Since each time bin is trained independently in our stylized model, both the fidelity objective and the constraints are :
$$ \mathcal{F}(\mathbf{m})=\sum_{k=1}^K \Delta t_k\,\varepsilon_{\mathrm{test}}(t_k;m_k),\qquad \mathcal{R}(\mathbf{m})=\sum_{k:t_k\le t_{\min}} \Delta t_k\,\mathsf{Mem}(t_k;m_k),\qquad \mathcal{C}(\mathbf{m})=\sum_{k=1}^K \Delta t_k\,m_k. $$
Thus the scheduling problem is a resource allocation over bins with per-bin value'' given by test-error reduction and per-binweight’’ given by compute, coupled to an additional early-time risk budget.

We first consider the binary class m_k ∈ {1, ∞}. Define the per-bin fidelity gain from upgrading bin k,
g_k := Δt_k(ε_test(t_k; 1) − ε_test(t_k; ∞)),
and the associated early-time risk increment
r_k := 1_{{t_k ≤ t_min}} Δt_k(Mem(t_k; ∞) − Mem(t_k; 1)).
Upgrading bin k from 1 to ∞ changes (ℱ, ℛ, 𝒞) by (−g_k, +r_k, +Δt_k(∞ − 1)). In particular, when t_k > t_min the risk increment is negligible (by construction of ℛ), while in the small-noise window and in the regime ψ_p > ψ_n the monotonicity input implies r_k ≥ 0 (higher m increases memorization risk). Hence, for ψ_p > ψ_n, upgrades at early times are doubly disfavored: they consume risk budget and, in the small-t window where Δ_ε(t) ≥ 0, they also tend to have g_k ≤ 0.

We use an exchange argument. Consider any feasible schedule that upgrades some k with t_k ≤ t_min. By (i) this weakly increases ℛ and by the small-t deterioration input it does not improve ℱ (equivalently, g_k ≤ 0 in the relevant window). Reverting that bin to m_k = 1 weakly decreases both ℛ and ℱ while saving compute, so no optimum upgrades early bins. After fixing m_k = 1 on {t_k ≤ t_min}, constraint ℛ ≤ ρ becomes irrelevant for the remaining bins by (ii), and the problem reduces to selecting upgrades over t_k > t_min to maximize total gain ∑_k ∈ Sg_k under a compute budget. Under uniform costs this is solved by sorting g_k and taking the largest gains; with per-bin costs one obtains the standard greedy-by-efficiency solution under separability, as implemented in Algorithm .

Now let m_k ∈ {1, …, m_max}. The scheduling problem becomes a (two-budget) integer program. To retain tractability we invoke the same structural surrogate used to transport endpoint monotonicity: for fixed t, both ε_test(t; m) and Mem(t; m) vary smoothly with the effective label-noise variance ∝ 1/m. In particular, for t > t_min we typically observe in fidelity as m increases, i.e. the marginal gain
δ_k(m) := Δt_k(ε_test(t_k; m) − ε_test(t_k; m + 1))
is nonincreasing in m. Under such discrete concavity, a standard priority-queue greedy algorithm that repeatedly allocates one additional noise draw to the bin with largest current δ_k(m)/Δt_k yields a constant-factor approximation to the compute-constrained optimum and is exact in the continuous relaxation (water-filling/KKT form). The risk constraint is handled by forbidding increments on t_k ≤ t_min once the running sum ∑_{t_k ≤ t_min}Δt_k Mem(t_k; m_k) reaches ρ; in the regime ψ_p > ψ_n this typically enforces m_k near its minimum on early bins, while allowing aggressive growth for t_k > t_min.

In the binary class, the value function
ℱ^⋆(B, ρ) := min {ℱ(m) : m ∈ {1, ∞}^K, 𝒞(m) ≤ B, ℛ(m) ≤ ρ}
admits an explicit characterization under the hypotheses of Theorem~. First, feasibility at a given ρ requires that the unavoidable baseline risk ∑_{t_k ≤ t_min}Δt_k Mem(t_k; 1) does not exceed ρ. Conditional on feasibility, the optimal schedule fixes m_k = 1 for t_k ≤ t_min and spends compute only on t_k > t_min, upgrading bins in descending order of g_k. Consequently, as B increases, ℱ^⋆(B, ρ) decreases by successive addition of the gains g_k from the sorted list, yielding a piecewise-linear, monotone frontier in B whose breakpoints occur at budgets where an additional bin can be upgraded. In the continuous-time limit (small max_kΔt_k), this becomes an integral threshold rule in t determined by a Lagrange multiplier: we allocate high m precisely on those times for which the instantaneous gain density exceeds the multiplier, while enforcing m ≈ 1 in the small-noise window to satisfy the privacy/memorization budget.

7. Lower bounds and (limited) hardness: necessity of small-t throttling in p > n; NP-hardness of general integer schedules under arbitrary objectives; why monotone structure yields tractability here.

Our upper-bound schedules are guided by the small-noise monotonicity phenomenon: in the overparameterized regime, increasing noise multiplicity drives the learned score closer to the empirical optimal score s_e(t, ⋅) at small t, which is precisely the direction in which memorization-like effects worsen. We now record a complementary necessity statement showing that, once t_min lies inside the small-noise window from Theorem~1, schedule that assigns large multiplicity to a nonnegligible portion of early times must incur a nontrivial memorization-risk cost.

To make the statement concrete, fix a threshold m₀ ≥ 2 and write E(m; m₀) := {k : t_k ≤ t_min, m_k ≥ m₀} for the set of early bins that are ``high multiplicity.’’ Assume the proxy is of the form Mem(t; m) = ϕ(M_t(m)) with ϕ nonincreasing, so that shrinking M_t(m) (moving toward s_e) increases memorization risk. Define the early-time mass α(m; m₀) := ∑_{k ∈ E(m; m₀)}Δt_k.

The theorem formalizes the intuitive constraint that, in ψ_p > ψ_n, one cannot spend'' substantial multiplicity at early diffusion times without paying a risk penalty: the best one can do under a fixed $\rho$ is to confine high multiplicity to a small-measure subset of $[0,t_{\min}]$. In the binary class $m\in\{1,\infty\}$ one may take $m_0=\infty$ and $\Delta_M(t)=M_t(1)-M_t(\infty)$, which is strictly positive for small $t$ under Theorem~1(i) unless the estimator already coincides with the empirical score. For bounded integers, one may view $m_0$ as the minimal level at whichnear-empirical’’ behavior emerges; the conclusion then explains why a risk-feasible optimizer places the majority of its compute in bins with t > t_min even when the compute objective alone would encourage increasing m everywhere.

The preceding sections derive tractable, nearly closed-form schedules by exploiting diffusion-induced structure. Without such structure, even our separable formulation becomes computationally hard once one permits arbitrary per-bin responses to multiplicity. We record a standard reduction to emphasize that the present tractability is not generic.

Given a knapsack instance with items (v_k, w_k), we construct K bins and stipulate Δt_k so that the compute increment equals c_k = w_k and the fidelity improvement equals g_k = v_k. Feasible schedules correspond bijectively to subsets of items, and achieving gain ≥ G corresponds to attaining knapsack value ≥ G. Since knapsack is NP-complete, the decision version of the schedule-selection problem is NP-complete.

The preceding reduction requires that per-bin gains and costs be freely specifiable. In our DSM setting, neither is arbitrary. First, costs are essentially proportional to Δt_k (or to a fixed number of noise evaluations per bin), which eliminates the item-specific weights that drive knapsack hardness. Second, and more importantly, the learning-curve formulas impose : in ψ_p > ψ_n and for t ≤ t_min, increasing m both worsens test error (Theorem~1(ii)) and increases memorization risk via proximity to s_e (Theorem~1(i) combined with ϕ nonincreasing). For t > t_min, the risk term is by definition inactive and the gain g_k is typically nonnegative (large-noise bins behave like well-specified regression onto an approximately linear score). This sign pattern is exactly what enables the exchange argument behind Theorem~: any upgrade at t ≤ t_min is dominated by either doing nothing or reallocating compute to a later bin.

Thus, our scheduling problem sits in a regime where the apparent combinatorial complexity collapses to a threshold/greedy allocation because the diffusion dynamics supply an ordering in which early'' decisions are uniformly discouraged by the risk budget, whilelate’’ decisions are uniformly rewarded in fidelity. The lower bound of Theorem~ and the hardness statement of Proposition~ together delineate the scope of the result: small-t throttling is enforced not only by the structure of optimal schedules but also by an intrinsic risk penalty, and algorithmic tractability relies on the monotone diffusion-time geometry rather than on any generic property of separable integer programs.

8. Practical instantiation: translating m(t) to training pipelines (epochs/noise pools), suggested implementation patterns, and how to audit effective m(t) in real code.

The optimization problem produces a discrete schedule m = (m₁, …, m_K) intended to control, at each t_k, the number of Gaussian perturbations paired with each training datum. In code, however, DSM training is typically organized around mini-batches, shuffling, and often a continuous-time sampling of t. We therefore require an implementation that (i) enforces a target multiplicity , (ii) respects the total budget 𝒞(m) ≈ ∑_km_kΔt_k, and (iii) permits post hoc auditing of the multiplicity actually realized.

If the model samples t ∼ Unif[0, T] (or a nonuniform density) at each iteration, we implement our schedule by introducing a binning map b(t) ∈ {1, …, K} and replacing m(t) by m_b(t). Concretely, one may either (a) sample k directly with probability proportional to Δt_k and then set t = t_k (standard in discretized diffusion training), or (b) sample t continuously and then apply k = b(t) only for multiplicity control, keeping the original t for the target/weighting. The former matches our analysis most closely; the latter is sometimes preferable when the loss uses continuous-time weights, but one must then track compute and risk by bins.

Fix a time bin k. For each datum index i, we aim to restrict the set of noises {z_i, j, k}_j = 1^m_k used to form noised inputs
$$ x_{i,j,k}^{(t_k)} \;:=\; a_{t_k}x_i + \sqrt{h_{t_k}}\,z_{i,j,k}, \qquad z_{i,j,k}\sim\mathcal{N}(0,I_d). $$
A direct method is to associate to each pair (i, k) a of size m_k consisting of seeds {σ_i, j, k}_j = 1^m_k; when the pipeline requests a noise sample for (i, k), we select j uniformly and generate z deterministically from σ_i, j, k. This yields an exact multiplicity cap: across all epochs, no more than m_k distinct z values are ever paired with x_i at time t_k. The binary policy is a special case: m_k = 1 means a singleton pool (full reuse), while m_k = ∞ is approximated by omitting the pool (fresh z each call) or by taking a very large m_max.

In large-scale training, explicitly storing σ_i, j, k for all (i, k) is infeasible. A practical compromise is to store pools only for the bins where reuse is desired (typically t_k ≤ t_min), and generate fresh noise elsewhere. Even in the worst case, the early-time bins are a small subset, so the memory footprint is manageable. When per-example indices i are not stable (e.g. streaming datasets), we may define i as a stable hash of the sample identifier (file path, shard offset, etc.), which suffices to make reuse reproducible.

When storage is still undesirable, we may emulate pools via a keyed pseudorandom function. For m_k = 1, define
z_i, k = Gauss(Hash(key, i, k)),
where Gauss(⋅) maps a seed to a Gaussian vector (e.g. by a counter-based RNG). This enforces reuse without storing anything. For general integer m_k, we replace k by (k, j) with j ∈ {1, …, m_k} and select j by a deterministic but varying rule, e.g. j = 1 + (Hash(key^′, step, i, k) mod m_k). This guarantees that the realized set of z values per (i, k) is contained in a set of size m_k, while still permitting randomized mixing across steps.

Our compute proxy counts noise evaluations per datum, whereas wall-clock time in modern diffusion training is dominated by network forward/backward passes. We therefore implement the schedule by holding the fixed and altering only whether the same (x_i, t_k) pair is repeatedly coupled to identical or fresh noise. This makes comparisons at fixed compute meaningful when the noise generation is nontrivial (e.g. when augmentation/noising is expensive) and, even when it is not, isolates the effect of reuse. If one insists that 𝒞(m) track wall-clock time, a closer proxy is to keep the number of (x_i, j, k^(t_k), t_k) per epoch proportional to m_kΔt_k, which can be realized by oversampling time bins with larger m_k or by forming within-batch replicates of each example at the selected t_k.

Given implementation details, we recommend explicitly measuring the realized multiplicity. For each bin k, define the empirical effective multiplicity
$$ \widehat m_k \;:=\; \frac{1}{|\mathcal{I}|}\sum_{i\in\mathcal{I}} \Big|\big\{ z \ \text{used with}\ (i,k)\ \text{during training}\big\}\Big|, $$
where ℐ is a monitored subset of sample indices and the set cardinality counts distinct RNG seeds (or distinct noise hashes) observed. In practice we log, for each monitored (i, k), a Bloom filter or a small hash set of seen seeds, updated online with negligible overhead. One should then verify that m̂_k ≈ 1 for the bins designated as reuse'' and that $\widehat m_k$ grows linearly with training steps for the bins designated asfresh noise’’ (until limited by any imposed m_max).

A second audit is to check that the matches the intended quadrature weights: letting N_k be the number of training calls in which b(t) = k, we expect N_k/∑_ℓN_ℓ ≈ Δt_k/∑_ℓΔt_ℓ (or the chosen time-sampling density). This is essential: an apparently correct m̂_k is irrelevant if the pipeline almost never visits bin k.

We have found the following patterns robust: (i) enforce deterministic m_k = 1 reuse for t_k ≤ t_min via hash-based noise; (ii) use unconstrained fresh noise for t_k > t_min; (iii) keep the optimizer step count fixed across schedules and report both (m̂_k)_k and $(N_k)_k alongside fidelity metrics. This aligns the code-level behavior with the mathematical object m, and it prevents accidental ``leakage’’ of high multiplicity into the small-noise window via, e.g., nondeterministic data augmentations or per-worker RNG divergence.

9. Experiments (recommended): U-Net/DiT ablations varying noise reuse only in selected time bins; evaluation via NN retrieval, membership inference, and standard fidelity metrics; comparing schedules at fixed compute.

Our theory predicts that the governs a privacy–fidelity tradeoff: in the overparameterized regime one should enforce strong reuse (small m_k) in the small-noise window, while allocating larger m_k to moderate/large-noise times to improve fidelity. We therefore recommend experiments in which the intervention is the schedule m = (m₁, …, m_K) used to generate the Gaussian perturbations paired with each training datum at each time bin. All other degrees of freedom—architecture, optimizer, learning-rate schedule, augmentations, time-sampling distribution, number of steps, and EMA settings—are held fixed.

We suggest two families to test generality: (i) a convolutional U-Net diffusion model on CIFAR-10 and/or ImageNet-64, and (ii) a transformer-based DiT-style model on ImageNet-64 or ImageNet-128 (compute permitting). These choices expose different inductive biases and parameterizations while keeping evaluation and training pipelines standardized. For each setup we fix the discretization {t_k}_k = 1^K (or the corresponding continuous-time sampler with a binning map), and we record the realized bin occupancies N_k to ensure that the effective quadrature weights match the intended Δt_k.

We recommend at least the following schedules:

When reporting, we include the audited effective multiplicities m̂_k (on a monitored subset of examples) to verify that the implementation matches the intended schedule.

We propose two complementary protocols. In the first, we fix the number of optimizer steps and batch size across schedules, so that wall-clock compute is matched and the only change is the statistical effect of reusing versus refreshing noise. In the second, we explicitly match the proxy 𝒞(m) by adjusting the number of calls per bin: for example, oversample time bins with larger m_k so that the total number of noised pairs per datum scales like ∑_km_kΔt_k. The two protocols answer different questions: the first isolates the privacy effect of reuse under standard training budgets; the second more directly instantiates the compute-constrained optimization problem.

We recommend standard unconditional generation metrics: FID and Inception Score for CIFAR-10, and FID and precision–recall for ImageNet resolutions. When feasible, we also report likelihood proxies (e.g. bits-per-dimension via a variational bound) and calibration metrics for classifier-free guidance sweeps, as reuse schedules may interact with guidance at sampling time. To connect to the time-resolved nature of the analysis, we additionally log per-bin denoising losses (or DSM losses) at evaluation time: for each k we estimate 𝔼∥ŝ_{t_k}(x^(t_k)) − s^⋆(t_k, x^(t_k))∥² using a held-out validation set and fresh evaluation noise. Although s^⋆ is unavailable, one may use the standard DSM target estimator (fresh Monte Carlo at evaluation) as a consistent proxy; the important point is that evaluation uses noise for all schedules.

A primary privacy diagnostic is nearest-neighbor (NN) retrieval between generated samples and the training set. For each generated image x̃ we compute its nearest neighbor in a feature space (e.g. CLIP or Inception embeddings) and report the distribution of distances, as well as the fraction of generations whose NN distance falls below a fixed threshold calibrated on validation data. We also report qualitative galleries of (x̃, NN(x̃)) pairs for the tail of closest matches. The theoretical expectation is that schedules with larger effective multiplicity at small t (closer to m_k = ∞ for t_k ≤ t_min) will produce more near-copies of training points in the overparameterized regime, while the throttled/optimized schedules suppress this phenomenon without sacrificing late-time fidelity.

We recommend membership inference (MI) evaluations tailored to diffusion models. A simple and reproducible attack uses the model’s denoising loss as a score: given a candidate example x, sample times t (with emphasis on t ≤ t_min) and fresh noises z, and compute the average squared error between the model prediction and the DSM target; the attacker classifies x as a member if this loss is unusually small relative to a calibration set. Stronger attacks train a logistic regressor on a vector of per-time losses {ℓ(t_k; x)}_{k ≤ k_min} using shadow models. We report the ROC-AUC and true-positive rate at fixed false-positive rate. We expect, consistent with the small-t monotonicity, that increasing multiplicity in the small-noise window improves the model’s fit to training idiosyncrasies and hence increases MI advantage, while allocating multiplicity to larger times yields smaller MI gains at comparable fidelity improvements.

To map an empirical compute–privacy frontier, we sweep (B, ρ) (or, operationally, sweep t_min and the measure of bins upgraded to large m_k) and plot fidelity metrics versus MI AUC and NN-retrieval rates. We recommend multiple random seeds and reporting mean±standard error. Finally, to connect with the role of overparameterization, we vary width (or parameter count) to move between ψ_p > ψ_n and ψ_p < ψ_n qualitatively, and we test whether the sign of the small-t effect reverses as predicted: in an underparameterized setting, higher multiplicity should monotonically improve test metrics with weaker memorization signatures, whereas in an overparameterized setting the benefit of high multiplicity should concentrate at larger t and come with increased small-t privacy risk.

10. Discussion and limitations: time-conditioned networks, non-Gaussian data, guidance; what’s needed for fully end-to-end guarantees; future work.

Our scheduling conclusions are derived in a deliberately training model: for each t_k we train an independent RFNN estimator ŝ_k to the global minimizer of a regularized DSM objective with multiplicity m_k, and we evaluate its asymptotic test curve ε_test(t_k; m_k) together with the mismatch M_{t_k}(m_k) to the empirical score. This abstraction is appropriate for isolating the statistical role of noise reuse, but it ignores algorithmic coupling across times (shared parameters, shared optimizer state) and across data points (finite-batch SGD correlations). Consequently, our formal optimality statements for the binary policy class should be interpreted as optimality this stylized regime; transferring them to end-to-end diffusion training requires additional arguments controlling how far a practical training pipeline deviates from this separable limit.

Modern diffusion models use a single time-conditioned network s_θ(t, x) trained on a mixture over t, so the effective estimator at time t_k depends on the entire schedule m, not only on m_k. In this setting the objective is no longer a sum of independent per-bin risks, and the scheduling problem may cease to reduce to a threshold/greedy selection even under monotonicity at the endpoints m ∈ {1, ∞}. Nevertheless, we expect the qualitative structure to persist when (i) the network has sufficient capacity to fit each time slice with limited interference and (ii) the time-sampling distribution makes the per-time gradients approximately orthogonal. A concrete route to formalization is to study a linearized (NTK) or random-features approximation for a feature map Φ(t, x), in which the Gram matrix acquires a block structure indexed by time. One would then replace the scalar learning-curve equations by a coupled fixed-point system whose off-diagonal blocks quantify time interference; small off-diagonal norms should recover our separable formulas perturbatively, yielding stability of the optimal schedule to moderate coupling.

Our analysis privileges an OU forward process and targets P₀ that are Gaussian or close to Gaussian in the sense required by the source’s Hermite/rotational tools. Real data are neither Gaussian nor full-dimensional, and many pipelines employ variance-preserving or variance-exploding SDEs with nontrivial β(t). We view OU as a mathematically controlled proxy capturing the key small-noise phenomenon: for t ↓ 0, the conditional distribution P_t(⋅ ∣ x₀) concentrates near x₀, so the empirical score becomes increasingly sensitive to individual training points, and overparameterized estimators can track this sensitivity when the Monte Carlo target variance is sufficiently reduced. Extending theorems such as the small-t monotonicity to non-Gaussian P₀ plausibly requires two ingredients: (i) local asymptotics of s_e(t, ⋅) as a smoothed empirical measure for general P₀ (possibly supported near a manifold) and (ii) learning-curve control for RF/NTK regression with non-isotropic covariates induced by P_t. A particularly relevant intermediate case is a subspace Gaussian model, where P₀ is Gaussian on a low-dimensional subspace plus ambient noise; here one can still hope to obtain explicit equations and to identify when the memorization window is governed by the intrinsic dimension rather than d.

While the binary class m_k ∈ {1, ∞} exposes the clearest structure, practical training uses bounded integer multiplicities and complex reuse patterns (e.g. caching a finite pool of noises per datum). Our bounded-m_max variant assumes that ε_test(t; m) and M_t(m) can be interpolated between endpoints in a manner respecting diminishing returns. Establishing such an interpolation from first principles is nontrivial: the Monte Carlo target variance scales like 1/m, but the induced effective noise in the regression problem depends on both t (through a_t, h_t) and on the estimator class (through resolvents of random feature matrices). A natural next step is to prove explicit monotonicity and concavity properties of the learning curves as functions of 1/m in the relevant regimes, which would justify greedy incremental allocation (priority-queue style) with provable approximation guarantees.

Classifier-free guidance and related sampling heuristics change the effective reverse drift, often by amplifying score magnitudes in regions where the conditional and unconditional scores differ. Because our fidelity proxy ℱ(m) is tied to unconditional DSM test error, it may not predict guidance-conditioned artifacts or privacy leakage under aggressive guidance scales. Moreover, guidance tends to emphasize high-frequency details, which are precisely the aspects most susceptible to small-t memorization. A more complete account would incorporate (i) conditional scores, (ii) the guidance transformation, and (iii) a privacy metric defined on the output distribution. Analytically, this suggests studying how perturbations of the score field (in L²(P_t) or stronger norms) are magnified by the discretized reverse solver under guidance, and whether the amplification is concentrated at early times.

We have used the standard inequality relating KL(P₀∥P̂₀) to ∫₀^Tε_test(t; m(t)) dt as a motivation for ℱ(m). Turning this into a fully end-to-end guarantee for a discrete-time sampler requires additional control of (i) discretization error in time, (ii) numerical integration error in the reverse SDE/ODE solver, (iii) the approximation gap between the trained network and the population regression function under finite optimization, and (iv) the stability of the reverse-time flow to localized score errors. Each term is in principle tractable, but their combination is delicate: small errors at early times can propagate through the flow map, and the relevant stability constants may depend on tail behavior of P_t and on Lipschitz properties of the score. Thus, a complete theorem would likely assume quantitative regularity of P_t (e.g. log-Sobolev or bounded Hessian of log P_t) together with uniform-in-t control of ∥ŝ_t − s^⋆(t, ⋅)∥_{L²(P_t)} and a solver-specific stability bound.

Our risk functional ℛ(m) is intentionally operational: it penalizes proximity to the empirical score in the small-noise window, motivated by the empirical linkage between such proximity and memorization behaviors. This is not a formal differential privacy (DP) guarantee, and it does not preclude leakage through other channels (e.g. model inversion, prompt conditioning, or rare-mode exposure). Bridging to DP would require a sensitivity analysis of the training algorithm under the reuse schedule, including the effect of noise reuse on gradient variance and on the influence function of individual samples. We regard this as complementary: the present framework is designed to expose a lever (where to spend noise diversity in time) that can be combined with orthogonal privacy mechanisms (clipping, DP-SGD, data augmentation, or regularization).

Several extensions appear immediate. First, one can jointly optimize the time-sampling distribution and the reuse schedule, replacing fixed Δt_k by decision variables subject to a total-mass constraint. Second, one can make the schedule adaptive, choosing m_k online based on monitored indicators such as validation denoising error or an empirical memorization statistic, which would convert the present static knapsack-like problem into a feedback control problem over diffusion time. Third, one can study schedule transfer across architectures by expressing the gains g_k and risk deltas r_k in terms of measurable quantities (e.g. per-time gradient norms or effective signal-to-noise ratios) rather than oracle access to asymptotic curves. Finally, understanding whether the small-t throttling principle survives in fully nonlinear, finite-d regimes—and how it interacts with guidance and conditioning—remains the central empirical and theoretical question motivating this line of work.