← Back

Time-Conditioned Denoising Score Matching as Multi-Task Interpolation: Precise Learning Curves and Memorization Thresholds

Metadata

Table of Contents

  1. 1. Introduction: memorization in time-conditioned diffusion; why single-time theory is insufficient; summary of contributions and phase diagram with n_{}.
  2. 2. Background: OU/VP diffusion, DSM objectives, empirical optimal score and memorization mechanism; recap of random-features/NTK learning-curve tools from the source material (GEP, linear pencils).
  3. 3. Multi-time DSM with shared parameters: formal definition of the joint objective over {t_k}, noise reuse {m_k}, time embeddings (t), and shared score model s_A(x,t); definitions of per-time and integrated train/test errors and memorization order parameters.
  4. 4. Main technical reduction: Gaussian equivalence for time-augmented features; identification of the coupled random matrix whose resolvent controls learning curves; statement of the block linear-pencil construction.
  5. 5. The K=2 case in full detail: explicit 2×2 block resolvent equations; closed-form characterization of limiting train/test errors; definition and computation of _{}; recovery of the p n boundary under time-decoupling.
  6. 6. General finite K: operator-/block-valued fixed-point equations; conditions under which _{} exists and is unique; qualitative monotonicity in m_k and in time-feature correlation.
  7. 7. Phase boundary and (near-)matching bounds: sufficient conditions (upper) for generalization when p < {} and unavoidable alignment (lower) when p > {} within the shared linear-in-parameters model class.
  8. 8. Experiments: NTK-linearized U-Net/DiT and finite-width checks; measuring effective sample size and predicting threshold; ablations over time grid, embedding dimension, and per-time noise reuse; link to nearest-neighbor retrieval under reverse diffusion.
  9. 9. Discussion: implications for training schedules and architecture design; connections to early stopping, guidance, and latent diffusion; open problems (continuous time, feature learning, non-Gaussian data).
  10. Appendices: full linear-pencil derivations; numerical solver details; additional plots; proofs of auxiliary lemmas; experimental protocols and reproducibility.

Content

1. Introduction: memorization in time-conditioned diffusion; why single-time theory is insufficient; summary of contributions and phase diagram with n_{}.

We consider score learning for an Ornstein–Uhlenbeck (OU) diffusion at a finite set of time points {tk}k = 1K, where each forward perturbation takes the form
$$ x_{t_k}=a_k x_0+\sqrt{h_k}\,z,\qquad a_k=e^{-t_k},\qquad h_k=1-e^{-2t_k}, $$
with x0 ∼ P0 and z ∼ 𝒩(0, Id). Our focus is on the high-dimensional regime in which the ambient dimension d grows proportionally with the number of training samples n and the number of random features (equivalently, linearized parameters) p, while the number of training times K is fixed. The central phenomenon we address is in time-conditioned diffusion score training: when a single model is trained across times with shared parameters, the small-noise times can undergo an abrupt transition from population-level generalization to an estimator that tracks an empirical, kernel-density-like score field concentrated around the training data.

The memorization mechanism is most transparent at times with hk ≪ 1. At such times, xtk is close to x0, and the score s*(⋅, tk) = ∇log Ptk encodes fine-scale structure of the data distribution. In denoising score matching (DSM), one trains a model to predict $-z/\sqrt{h_k}$ from xtk, and thus the supervision becomes increasingly ``sharp’’ as hk → 0. When the model class is sufficiently expressive relative to the effective number of constraints supplied by the dataset and the noise sampling budget, the learned score can match the empirical optimal score semp(⋅, tk), i.e. the score of the finite-sample mixture distribution induced by the training points and the OU forward kernel. This empirical score is accurate on the training set but can deviate from the population score in a way that does not vanish with d, yielding a nontrivial excess test error at small noise even as the training error continues to decrease.

Existing high-dimensional learning-curve theory for random-features regression provides a precise account of interpolation thresholds and excess risk in score regression problems. However, modern diffusion training does not fit a single-time abstraction: a single score network is trained jointly across a family of times, and this joint training introduces two structural differences that invalidate a naive extension of single-time formulas.

First, parameter sharing couples the regression problems across times. Concretely, a shared second-layer weight matrix A ∈ ℝd × p must simultaneously fit K regression tasks indexed by tk, each with its own forward distribution Ptk and label field $-z/\sqrt{h_k}$. Even when the time embedding τ(t) and the random time weights Wt are fixed, the resulting feature design matrices at different tk are neither independent nor orthogonal in general. Consequently, the effective degrees of freedom available to fit small-noise constraints depend not only on ψp = p/d and ψn = n/d but also on cross-time correlations induced by τ(⋅) and on the distribution of noise budgets {mk}.

Second, multi-time training can shift the interpolation boundary for the small-noise task away from the classical p ≈ n threshold. Intuitively, a large-noise time t (with h = Θ(1)) supplies abundant, low-variance regression constraints that are easier to fit and may partially ``use up’’ shared capacity, whereas a small-noise time ts (with hs ≪ 1) supplies highly localized constraints that can trigger memorization once the shared parameterization becomes expressive enough to represent the empirical score component. Thus, the correct control parameter is not simply ψn but an ψeff that aggregates, through the coupling structure, how many independent constraints the joint design imposes on the shared model for the purpose of fitting the small-noise score.

The goal of this work is to provide a mathematically explicit characterization of this coupled-time transition in a tractable surrogate model that still retains the essential coupling induced by time-conditioning. We analyze a random-features/NTK-linearized score model of the form
$$ s_A(x,t)=\frac{A}{\sqrt{p}}\,\varphi\!\left(\frac{W_x x}{\sqrt{d}},\frac{W_t\tau(t)}{\sqrt{d_t}}\right), $$
trained by a multi-time DSM objective with ridge regularization λ > 0 and with mk independent noise draws per datum at each time tk. We work under proportional asymptotics d, n, p → ∞ with fixed ratios (ψn, ψp) and fixed intrinsic ratio ψD = D/d for the subspace Gaussian data model P0 = 𝒩(0, Π). The restriction to finite, fixed K is deliberate: it reflects the fact that practice uses a discretized time grid, and it yields a closed-form random-matrix characterization in terms of a constant-size system of equations.

Our contributions can be summarized as follows.

We show that the expected per-time training and test errors converge to deterministic limits that can be expressed using traces of resolvents of a random matrix built from the joint feature design across all times. The multi-time coupling requires a Stieltjes transform rather than a scalar one: cross-time correlations appear as off-diagonal blocks, and the quantities controlling generalization are block traces of the inverse of a structured linear pencil. The final outcome is a finite-dimensional algebraic fixed-point system whose unknowns are normalized block traces; solving this system yields the limiting errors Etrain(tk) and Etest(tk), as well as any aggregate error $\overline{E}_{\mathrm{test}}$ over the time grid.

In the regime where at least one time satisfies hk ≪ 1, we identify a computable scalar order parameter ψeff such that the small-noise learning curve exhibits a sharp crossover at ψp ≈ ψeff. Below this boundary, the estimator remains close (in the appropriate population sense) to the Bayes score and the small-noise test error is controlled. Above this boundary, the estimator aligns with the empirical optimal score on the data manifold, and a nonvanishing excess test error emerges. Importantly, ψeff depends on the full multi-time design, including the noise budgets {mk} and the time embedding correlations, and it reduces to ψn only in a decoupled-time limit.

For K = 2 with a small-noise time ts and a larger-noise time t, we derive an explicit 2 × 2 block linear pencil and obtain closed resolvent equations that quantify how cross-time coupling shifts the memorization threshold. This yields a provable characterization of ψeff(ts, t, ms, m, τ, φ, ψn, ψD, λ) and demonstrates how the classical boundary p ≈ n is recovered when time features are orthogonal so that cross-time correlations vanish.

We interpret the transition as occurring at an effective sample size neff := dψeff, which plays the role of the number of independent constraints that the joint multi-time regression imposes on the shared parameters for the small-noise task. In particular, in the overparameterized regime ψp ≫ ψeff, the shared model has sufficient capacity to interpolate the multi-time DSM constraints in a manner that necessarily introduces an empirical-score component at small noise. Conversely, when ψp ≪ ψeff, the estimator is forced to remain in a population-aligned regime and does not collapse to the empirical score.

The remainder of the paper develops the background and the technical tools needed to support these statements. We first formalize the OU/variance-preserving diffusion setting and the DSM objectives, and we make precise the distinction between the population score s* and the empirical optimal score semp that governs memorization. We then recall the random-features learning-curve machinery required for the multi-time setting, including the Gaussian equivalence principle and the linear-pencil linearization that converts coupled rational trace expressions into a tractable block-resolvent problem.


2. Background: OU/VP diffusion, DSM objectives, empirical optimal score and memorization mechanism; recap of random-features/NTK learning-curve tools from the source material (GEP, linear pencils).

For each time t > 0, the OU forward map defines a Markov kernel Qt(dx ∣ x0) given by
$$ x_t = a\,x_0+\sqrt{h}\,z,\qquad a=e^{-t},\qquad h=1-e^{-2t},\qquad z\sim\mathcal{N}(0,I_d), $$
so that Qt(⋅ ∣ x0) = 𝒩(ax0, hId) and the marginal is the convolution
Pt = P0Qt,   pt(x) = ∫ϕh(x − ax0) P0(dx0),
where ϕh(u) = (2πh)d/2exp (−∥u22/(2h)). The population score field at time t is
s*(x, t) = ∇xlog pt(x),
which is well-defined under mild conditions (e.g. if P0 has a density, or more generally as a weak gradient of the log-convolution). In the subspace Gaussian model used throughout, P0 = 𝒩(0, Π) with Π an orthogonal projection of rank D, and Pt is again Gaussian with covariance
Σt = a2Π + hId,   so that   s*(x, t) = −Σt−1x.
This explicit form is convenient for computing Bayes benchmarks and for separating behavior along Im(Π) and its orthogonal complement.

The (population) denoising score matching objective at time t is the regression risk obtained by observing xt and predicting $-z/\sqrt{h}$:
$$ \mathcal{R}_t(s)\;:=\;\frac{1}{d}\,\mathbb{E}\Bigl[\bigl\|\sqrt{h}\,s(x_t,t)+z\bigr\|_2^2\Bigr], $$
where the expectation is over (x0, z). A standard integration-by-parts identity (equivalently, Stein’s lemma under the conditional Gaussian xt ∣ x0) implies the basic equivalence

hence, at each fixed t, minimizing DSM is equivalent to L2(Pt) regression toward the population score s*(⋅, t), up to the additive constant 1.

The memorization phenomenon is not captured by~ alone because practical training does not minimize t over all measurable functions: it minimizes an empirical average over finitely many data points and finitely many noises. Fix a dataset {xi}i = 1n and (for the moment) consider the idealized regime m = ∞ where we average over the noise conditional on each xi. The empirical forward distribution at time t is then the mixture
$$ P_t^{\mathrm{emp}} := \frac{1}{n}\sum_{i=1}^n \mathcal{N}(a x_i, h I_d), \qquad p_t^{\mathrm{emp}}(x)=\frac{1}{n}\sum_{i=1}^n \phi_h(x-a x_i), $$
and the corresponding is
$$ s^{\mathrm{emp}}(x,t):=\nabla_x \log p_t^{\mathrm{emp}}(x) =\frac{\sum_{i=1}^n \phi_h(x-a x_i)\,\frac{a x_i-x}{h}}{\sum_{i=1}^n \phi_h(x-a x_i)}. $$
This field coincides with the population score of the finite-sample mixture Ptemp, and it is the exact minimizer of the conditional-noise-averaged empirical DSM risk over all measurable functions. In particular, semp is a kernel-density-like object at bandwidth $\sqrt{h}$, and its deviation from s* is controlled by the usual bias–variance tradeoffs of KDE, except that we evaluate it in the high-dimensional scaling relevant to learning curves.

The ``sharpness’’ of small-noise constraints is visible directly in semp. When h ≪ 1, the mixture components concentrate in balls of radius $\sqrt{h}$ around the points axi, and near such a point the score behaves like (axi − x)/h, i.e. it has magnitude on the order of $1/\sqrt{h}$ at typical perturbations. Thus, for small h, the empirical target contains narrow, high-curvature structure tied to individual training samples. A sufficiently expressive model trained on the empirical objective can represent this sample-specific structure; doing so yields low training error while potentially increasing the population test error measured against s* (equivalently, the L2(Pt) error in~). The crossover we study arises when the shared parameterization becomes capable of fitting the empirical score component at the small-noise times.

For finite m < ∞, the empirical loss incorporates additional label noise due to Monte Carlo sampling of z. At fixed t, this is the standard random-design regression setting with nm samples; however, in multi-time training the relevant constraint count is not simply knmk because all times share the same parameter matrix and their feature designs are correlated.

Once we fix the first-layer random weights and the time embedding, the score model is linear in the trainable parameters. At a given time tk, each perturbed input xtk(i, j) produces a feature vector in p via
$$ \phi_{k}^{(i,j)}:=\varphi\!\left(\frac{W_x x_{t_k}^{(i,j)}}{\sqrt{d}},\frac{W_t \tau(t_k)}{\sqrt{d_t}}\right), $$
and the model output is $s_A(x_{t_k}^{(i,j)},t_k)=(A/\sqrt{p})\,\phi_{k}^{(i,j)}$. The DSM training problem is therefore a ridge regression with vector-valued labels $-z_k^{(i,j)}/\sqrt{h_k}$ and a design matrix assembled from the feature vectors across all times. High-dimensional learning-curve theory reduces train/test metrics to traces of rational functions of empirical Gram matrices (or, equivalently, sample covariances in feature space).

The primary obstacle is that ϕk(i, j) is nonlinear in Gaussian weights and high-dimensional inputs. We address this using a Gaussian equivalence principle (GEP): under the proportional limit and mild moment conditions on φ, the feature matrix can be replaced, for the purpose of computing normalized traces of resolvents and related quantities, by an design with matched first- and second-order statistics. Concretely, only a finite set of activation-dependent coefficients (e.g. Hermite coefficients, or equivalently the induced mean and covariance parameters) enter the limiting learning curves. In the multi-time setting, we require the covariance across times; the time embedding induces correlations
𝔼[ϕk(i, j)(ϕ(i, j))]
that remain nontrivial even when the data/noises are independent across (k, i, j), and these correlations are the mechanism by which parameter sharing couples the effective regression problems.

To compute the limiting traces, we use a linear-pencil linearization. In its simplest form, for a (centered) feature matrix Φ ∈ ℝN × p one can embed the resolvent of ΦΦ/p into the inverse of a block matrix
$$ L(z)=\begin{pmatrix} -z I_p & \Phi^\top/\sqrt{p}\\ \Phi/\sqrt{p} & -I_N \end{pmatrix}, \qquad L(z)^{-1}=\begin{pmatrix} (\Phi^\top \Phi/p-z I_p)^{-1} & \ast\\ \ast & \ast \end{pmatrix}, $$
so that normalized traces of (ΦΦ/p + λI)−1 and related objects become block traces of L(−λ)−1. In the coupled multi-time problem, Φ is replaced by a structured block design collecting all times, and one must track block traces corresponding to each time (and cross-time) component. The resulting deterministic equivalents are obtained by solving a finite-dimensional Dyson (self-consistent) equation for these block traces. This machinery is the technical backbone that converts a high-dimensional, nonlinear random-features DSM problem into a closed algebraic system whose solution yields the limiting training and test errors, and ultimately the effective sample-size parameter governing memorization.


3. Multi-time DSM with shared parameters: formal definition of the joint objective over {t_k}, noise reuse {m_k}, time embeddings (t), and shared score model s_A(x,t); definitions of per-time and integrated train/test errors and memorization order parameters.

We fix a finite time grid {tk}k = 1K ⊂ (0, ∞) and write
ak = etk,   hk = 1 − e−2tk.
Given training samples {xi}i = 1n ⊂ ℝd drawn i.i.d. from P0, we generate, for each time tk and each datum xi, a (possibly time-dependent) number mk ∈ {1, 2, …} ∪ {∞} of i.i.d. Gaussian noises {zk(i, j)}j = 1mk with zk(i, j) ∼ 𝒩(0, Id), and we form perturbed inputs
$$ x_{t_k}^{(i,j)} \;:=\; a_k x_i + \sqrt{h_k}\, z_k^{(i,j)}. $$
The case mk = ∞ is interpreted as exact conditional-noise averaging, i.e. replacing empirical sums over j by an expectation over z ∼ 𝒩(0, Id) conditional on xi at that time.

We train a score model shared across all times. Concretely, after sampling first-layer random weights (Wx, Wt) and fixing a time embedding τ(⋅), we consider the linear-in-parameters class
$$ s_A(x,t) \;=\; \frac{A}{\sqrt{p}}\,\varphi\!\left(\frac{W_x x}{\sqrt{d}},\,\frac{W_t\tau(t)}{\sqrt{d_t}}\right)\in\mathbb{R}^d, \qquad A\in\mathbb{R}^{d\times p}, $$
where φ acts elementwise on the p coordinates. For each training triple (k, i, j), we denote the corresponding feature vector by
$$ \phi_k^{(i,j)}\;:=\;\varphi\!\left(\frac{W_x x_{t_k}^{(i,j)}}{\sqrt{d}},\,\frac{W_t\tau(t_k)}{\sqrt{d_t}}\right)\in\mathbb{R}^p, \qquad s_A(x_{t_k}^{(i,j)},t_k)=\frac{A}{\sqrt{p}}\,\phi_k^{(i,j)}. $$

The multi-time denoising score matching (DSM) objective with ridge regularization λ > 0 is then

with the convention that when mk = ∞ the inner average is 𝔼z ∼ 𝒩(0, Id)[⋅ ∣ xi]. The factor hk in the regularizer is chosen so that each time contributes at the same scale as the corresponding DSM loss, and so that the small-noise regime hk ≪ 1 does not trivially force over-regularization. Since A is shared,~ is a sum of independent single-time problems: even if the noise draws are independent across times, the same parameter matrix must simultaneously fit all time-indexed regression tasks.

It is convenient to make the regression structure explicit. Let Nk := nmk be the number of training pairs at time tk (finite if mk < ∞), and form the design matrix Φk ∈ ℝNk × p whose rows are the feature vectors {ϕk(i, j)}, ordered arbitrarily. Likewise let Zk ∈ ℝNk × d be the matrix whose rows are the noise vectors {zk(i, j)}. Writing the prediction matrix as $\widehat{S}_k:=\Phi_k A^\top/\sqrt{p}\in\mathbb{R}^{N_k\times d}$, we may rewrite the data-fit part of~ as
$$ \frac{1}{dN_k}\bigl\|\sqrt{h_k}\,\widehat{S}_k+Z_k\bigr\|_F^2, $$
so that  = arg minAℒ(A) is a ridge regression estimator with a response and a (block) design across times. Importantly, the effective coupling is controlled by the correlations between ϕk(i, j) and ϕ(i, j) induced jointly by the shared weights Wx, the shared time weights Wt, and the embedding inner products τ(tk), τ(t)⟩.

To evaluate performance, we distinguish empirical (training) DSM errors from population (test) DSM errors at each time. For any fitted parameter , we define the (excluding the ridge term) by

again with the conditional expectation convention when mk = ∞. The corresponding is the DSM risk evaluated on an independent draw (x0, z):

where the expectation is over (x0, z) ∼ P0 ⊗ 𝒩(0, Id) and is conditional on the training data and random features through .

By the single-time identity~ applied at each tk, we may equivalently view~ as a scaled L2(Ptk) score estimation error:

Thus the natural score error is Etest(tk) − 1, and the factor hk emphasizes that small-noise times encode more stringent (higher-curvature) constraints per unit L2(Ptk) error.

Since K is fixed, we aggregate across times by a finite weighted average. Fix weights {wk}k = 1K with wk ≥ 0 and kwk = 1 (e.g. uniform weights wk = 1/K), and define
$$ \overline{E}_{\mathrm{train}}:=\sum_{k=1}^K w_k\,E_{\mathrm{train}}(t_k), \qquad \overline{E}_{\mathrm{test}}:=\sum_{k=1}^K w_k\,E_{\mathrm{test}}(t_k). $$
In applications one may interpret $\overline{E}_{\mathrm{test}}$ as a discretized integral along the diffusion time axis; in our asymptotic analysis it is simply a fixed linear functional of the per-time learning curves.

Finally, to formalize the ``memorization’’ behavior at small noise, we introduce order parameters that compare the learned score to the empirical optimal score semp(⋅, tk) when mk = ∞. For such times, a natural empirical alignment metric is

where $P_{t_k}^{\mathrm{emp}}=\frac{1}{n}\sum_{i=1}^n \mathcal{N}(a_k x_i,h_k I_d)$ is the conditional-noise-averaged empirical mixture. The scaling by hk matches~ and ensures that Mem(tk) is commensurate with DSM excess risks. Informally, Mem(tk) → 0 indicates that the estimator has enough capacity (relative to the effective number of constraints induced by parameter sharing across times) to reproduce the KDE-like, sample-specific score field at bandwidth $\sqrt{h_k}$, whereas Mem(tk) bounded away from 0 indicates that the estimator remains closer to the population score s*(⋅, tk) in the sense of~. In the sequel, these per-time quantities, together with the train–test gap Etest(tk) − Etrain(tk) at small hk, will be the observables through which the effective sample-size ratio ψeff and the associated generalization–memorization crossover are identified.


4. Main technical reduction: Gaussian equivalence for time-augmented features; identification of the coupled random matrix whose resolvent controls learning curves; statement of the block linear-pencil construction.

Our main technical step is to reduce the multi-time DSM problem to the analysis of a coupled random matrix resolvent under a Gaussian surrogate for the time-augmented random features. We proceed in three stages: (i) rewrite the objective as a single ridge problem with a weighted, stacked design; (ii) invoke a Gaussian equivalence principle (GEP) to replace the nonlinear time-conditioned features by an explicitly correlated Gaussian design with matching low-order moments; and (iii) express all train/test observables as normalized traces of rational functions of the corresponding Gram operator, which we then linearize via a block linear pencil.

Let Nk := nmk (finite when mk < ∞) and write Φk ∈ ℝNk × p for the feature matrix at time tk, with rows {ϕk(i, j)}, and Zk ∈ ℝNk × d for the corresponding noise matrix with rows {(zk(i, j))}. Define the weighted stacked design and response
$$ \widetilde{\Phi} \;:=\; \begin{bmatrix} \sqrt{\frac{h_1}{N_1}}\Phi_1\\ \vdots\\ \sqrt{\frac{h_K}{N_K}}\Phi_K \end{bmatrix} \in\mathbb{R}^{(\sum_k N_k)\times p}, \qquad \widetilde{Z} \;:=\; \begin{bmatrix} \frac{1}{\sqrt{N_1}}Z_1\\ \vdots\\ \frac{1}{\sqrt{N_K}}Z_K \end{bmatrix} \in\mathbb{R}^{(\sum_k N_k)\times d}, $$
and set $\Lambda:=\lambda\sum_{k=1}^K h_k$ for the aggregate ridge coefficient induced by~. Up to an additive constant independent of A, the objective becomes
$$ \mathcal{L}(A) \equiv \frac{1}{d}\Bigl\|\widetilde{\Phi}\frac{A^\top}{\sqrt{p}}+\widetilde{Z}\Bigr\|_F^2 +\frac{\Lambda}{dp}\|A\|_F^2. $$
Hence the minimizer admits the usual closed form

Because the response is isotropic across output coordinates, all per-time train/test quantities reduce to scalar ridge regression identities involving Q and certain cross-covariances between (time-dependent) test features and the training design. In particular, the deterministic limits of~– are governed by finitely many normalized traces such as $\frac{1}{p}\mathrm{Tr}(Q)$, $\frac{1}{p}\mathrm{Tr}(Q\widetilde{\Phi}^\top\widetilde{\Phi}Q)$, and (for test risk) quadratic forms $\frac{1}{p}\phi_k(x)^\top Q\,\phi_k(x)$ averaged over x ∼ Ptk, all of which can be expressed through resolvents of a coupled Gram operator.

The feature vectors are of the form
$$ \phi_k^{(i,j)} = \varphi\!\left(u_k^{(i,j)},v_k\right), \qquad u_k^{(i,j)}:=\frac{W_x x_{t_k}^{(i,j)}}{\sqrt{d}}\in\mathbb{R}^p, \qquad v_k:=\frac{W_t\tau(t_k)}{\sqrt{d_t}}\in\mathbb{R}^p, $$
with φ applied elementwise. By construction, for each coordinate α ∈ {1, …, p}, the pair (uk, α(i, j), vk, α) is jointly Gaussian conditional on (xtk(i, j), τ(tk)), and its covariance structure is explicit. Under our proportional asymptotics and the subspace-Gaussian assumption P0 = 𝒩(0, Π), we have concentration of xtk22/d and explicit limits for cross-time inner products; in particular, for independent draws x0, x0,
$$ \frac{1}{d}\,\mathbb{E}\bigl[x_{t_k}^\top x_{t_\ell}\bigr] = a_k a_\ell\,\frac{1}{d}\,\mathbb{E}\bigl[x_0^\top x_0'\bigr] = 0, \qquad \frac{1}{d}\,\mathbb{E}\bigl[\|x_{t_k}\|_2^2\bigr] = a_k^2\psi_D+h_k, $$
while for the underlying datum xi the common xi induces a nontrivial correlation between xtk(i, j) and xt(i, j) through akaxi. This is the source of coupling across times even when noises are independent across k.

We now invoke a Gaussian equivalence principle tailored to random features with Hermite-expandable activations: we replace {ϕk(i, j)} by a Gaussian surrogate {ϕk, G(i, j)} such that (a) each coordinate has matching first and second moments under the joint law of (uk(i, j), vk), and (b) cross-time correlations induced by the shared random weights (Wx, Wt) and time embedding inner products τ(tk), τ(t)⟩ are preserved. Concretely, for each time k we define activation-dependent coefficients
μ0, k := 𝔼[φ(g, ηk)],   μ1, k := 𝔼[gφ(g, ηk)],   vk := 𝔼[φ(g, ηk)2] − μ0, k2 − μ1, k2,
where (g, ηk) is a centered Gaussian pair with covariance chosen to match the limiting variance of uk, α(i, j) and the limiting variance of vk, α. Likewise, for (k, ) we define cross-time coefficients
κk := 𝔼[φ(gk, ηk)φ(g, η)],   γk := 𝔼[gkφ(g, η)],
with (gk, g, ηk, η) jointly Gaussian and with correlations determined by the OU overlap aka on the data subspace and the embedding overlap τ(tk), τ(t)⟩/dt. The GEP then asserts that, for the purpose of computing normalized traces appearing in the learning curves, we may work with a Gaussian design whose row/column covariance is explicitly parametrized by {μ0, k, μ1, k, vk, κk, γk} (and by the finite mk replication structure), with an error that vanishes as d, n, p → ∞ at fixed ratios. This yields a in which the only nontrivial dependence on (tk, mk, τ, φ) enters through a finite collection of scalars.

Under the Gaussian surrogate, all quantities of interest are functions of the weighted Gram matrix
U := Φ̃Φ̃ ∈ ℝp × p,   R(z) := (U − zIp)−1,
evaluated at z = −Λ and at small perturbations thereof (to access derivatives entering variance terms). The multi-time aspect is not encoded by block structure in U (since A is shared), but rather by the fact that U itself is built from a of time-indexed feature blocks with nontrivial cross-time correlations; equivalently, U can be represented as a sum of finitely many independent Wishart-type components plus low-rank terms reflecting the shared xi and shared time features. This is precisely the regime in which standard resolvent methods apply: for fixed K, the limiting learning curves are determined by a finite-dimensional system of equations for a finite set of block traces associated with an augmented resolvent.

To make this closure explicit, we linearize the rational functions of U that arise in the error formulas. The basic device is a block LK(z, q) (depending on spectral parameter z and an auxiliary scalar perturbation q) whose inverse contains, as Schur complements, the resolvents and mixed resolvents needed for train/test risk. In its minimal form, LK(z, q) is a block matrix with O(K) blocks that couples (i) the parameter-side resolvent (U − zIp)−1, (ii) the sample/time-side covariance operators induced by the Gaussian surrogate, and (iii) the low-rank signal directions associated with the data subspace Π and the time embedding overlaps. The defining property is that for appropriate block selection matrices E, we have identities of the form
ELK(z, q)−1E = (U(z, q) − zIp)−1,
where U(z, q) is the perturbed Gram operator whose q-derivative generates the quadratic trace functionals appearing in Etrain(tk) and Etest(tk). Consequently, it suffices to control normalized traces of blocks of LK(z, q)−1. By standard Dyson-equation arguments for block-Gaussian linearizations, these block traces converge to deterministic limits characterized by a closed algebraic fixed-point system whose dimension is polynomial in K (indeed O(K2) in a symmetric parametrization).

This reduction isolates the genuinely multi-time difficulty: the fixed-point system depends on cross-time correlations computed from (ak, hk), the noise budgets mk, and the embedding overlaps τ(tk), τ(t)⟩. In the next section we instantiate this scheme for K = 2, where the pencil becomes 2 × 2 at the level of time blocks, the fixed-point system is fully explicit, and the effective threshold ψeff emerges as the point at which the relevant resolvent denominator approaches its interpolation singularity for the small-noise time.


5. The K=2 case in full detail: explicit 2×2 block resolvent equations; closed-form characterization of limiting train/test errors; definition and computation of _{}; recovery of the p n boundary under time-decoupling.

We now specialize to K = 2 with a time ts and a time t, writing
as = ets,  hs = 1 − e−2ts,   a = et,  h = 1 − e−2t,
and Ns := nms, N := nm. Under the Gaussian surrogate described previously, the stacked weighted design Φ̃ has a row covariance (on the sample/time side) which is exchangeable across noise indices j at each fixed time and exchangeable across data indices i. Consequently, the companion resolvent on the sample/time side is asymptotically supported on the 2-dimensional span generated by the within each time block, and all train/test observables close on a 2 × 2 matrix of block traces.

Let μ1, s, vs and μ1, , v be the (time-dependent) Hermite/moment coefficients from the surrogate, and let ρτ := limdt → ∞τ(ts), τ(t)⟩/dt denote the normalized embedding overlap. The only additional scalar needed is the OU-induced overlap through the shared datum xi, which at the level of normalized inner products concentrates to
$$ \lim_{d\to\infty}\frac{1}{d}\,(x_{t_s}^{(i,j)})^\top x_{t_\ell}^{(i,j')} = a_s a_\ell\,\psi_D, \qquad (j,j'\ \text{arbitrary}). $$
Collecting the exchangeable (noise-idiosyncratic) contribution and the shared-datum contribution yields the effective 2 × 2 covariance matrix

(We emphasize that is the minimal parametrization after averaging over the noise-exchangeable directions; more refined surrogate parametrizations merely alter these entries by additional explicit terms depending on (κk, γk) and do not change the 2 × 2 closure.)

Let U = Φ̃Φ̃ ∈ ℝp × p and R(z) = (U − zIp)−1. We introduce the scalar Stieltjes transform
$$ g(z):=\lim_{d\to\infty}\frac{1}{p}\mathrm{Tr}\,R(z), $$
and the 2 × 2 G(z) obtained by taking normalized traces of the corresponding 2 × 2 blocks of the pencil inverse (equivalently, of the companion resolvent on the sample/time side restricted to the span of noise-averaging directions). The block Gaussian linearization yields a closed system:

Equations – are the explicit 2 × 2 analogue of the single-time scalar fixed point: the entire two-time coupling enters only through C and the matrix inversion in . In particular, writing
Δ(z) := det ( − zI2 + ψpg(z)C),
we have the closed expressions

Fix k ∈ {s, } and consider either the per-time training error or the per-time test error defined earlier (as a normalized quadratic functional of the fitted score). By the ridge identities from the stacked form and the GEP reduction, every such quantity can be written as a linear combination of the following finite set of resolvent observables evaluated at z = −Λ:
g(−Λ),   [G(−Λ)]ss, [G(−Λ)], [G(−Λ)]s,   ∂zg(z)|z = −Λ, ∂zG(z)|z = −Λ.
To make the dependence explicit, we use the standard perturbation device: for each time k, introduce a scalar q multiplying the corresponding block in the pencil (equivalently, $U\mapsto U(q):=U+q\,\frac{h_k}{N_k}\Phi_k^\top\Phi_k$), so that
$$ \frac{\partial}{\partial q}\Big|_{q=0}\frac{1}{p}\mathrm{Tr}\,(U(q)+\Lambda I_p)^{-1} = -\frac{1}{p}\mathrm{Tr}\Bigl(Q\,\frac{h_k}{N_k}\Phi_k^\top\Phi_k\,Q\Bigr), \qquad Q:=(U+\Lambda I_p)^{-1}. $$
Under the 2 × 2 closure, this derivative is obtained by differentiating – with C replaced by C + qckkekek, hence it is an explicit rational function of the entries in . Substituting these derivatives into the ridge expressions for Etrain(tk) and Etest(tk) yields deterministic limits of the schematic form

where ktr, ℱkte are explicit (activation- and OU-dependent) rational functions determined by the Gaussian surrogate coefficients. Importantly, no additional random-matrix degrees of freedom remain: once – is solved numerically at z = −Λ, is obtained by direct evaluation together with a 2 × 2 differentiation step.

In the small-noise regime hs ≪ 1, the sharp change in ϵtest(ts) is governed by the amplification of the variance-type terms in , which are controlled by the magnitude of [G(−Λ)]ss and its derivatives. From , these terms blow up precisely when the denominator

approaches zero. We therefore define the effective sample-size ratio ψeff as the smallest ψp (at fixed (ts, t, ms, m, ρτ, φ, ψn, ψD, λ)) for which the coupled fixed point – satisfies the

with ψeff computed in practice by monitoring the sign change (or minimal absolute value) of Δ(−Λ) along ψp after solving –. This characterization matches the observed crossover: as ψp crosses ψeff, the resolvent entries develop a large factor 1/Δ(−Λ), and the estimator transitions toward the empirical-optimal (KDE-like) score component on the data manifold.

Finally, suppose the times decouple in the sense that the cross-time feature correlation vanishes, which in our parametrization is cs = 0 (e.g. ρτ = 0, or more generally orthogonal time embeddings making the cross-time Hermite contractions vanish). Then C is diagonal and splits into two independent scalar relations. In particular,
Δ(−Λ) = (Λ + ψpg(−Λ)css)(Λ + ψpg(−Λ)c),
so the small-noise task depends only on the first factor, and the criticality condition reduces to the single-time interpolation threshold. In this limit we recover ψeff = ψn (equivalently p ≃ n) for the small-noise time, in agreement with the uncoupled theory and providing the required consistency check for the coupled two-time analysis.


6. General finite K: operator-/block-valued fixed-point equations; conditions under which _{} exists and is unique; qualitative monotonicity in m_k and in time-feature correlation.

For fixed K we form the stacked weighted design
$$ \widetilde{\Phi} = \begin{pmatrix} \sqrt{\frac{h_1}{N_1}}\Phi_1\\ \vdots\\ \sqrt{\frac{h_K}{N_K}}\Phi_K \end{pmatrix}, \qquad N_k:=nm_k, $$
so that the (feature-side) Gram matrix is U = Φ̃Φ̃ ∈ ℝp × p. Under the Gaussian surrogate, the row covariance on the sample/time side is exchangeable within each time block (over the mk noises) and exchangeable across data indices i. As in the two-time analysis, this implies that the companion resolvent on the row side is asymptotically supported on the K-dimensional span of the K (one per time block). Consequently, all limiting train/test observables close on a collection of block traces which we organize into a K × K matrix-valued Stieltjes transform.

Concretely, for z ∈ ℂ \ ℝ+ we define
$$ g(z):=\lim_{d\to\infty}\frac{1}{p}\Tr\,(U-zI_p)^{-1}, \qquad \mathbf{G}(z):=\lim_{d\to\infty}\Bigl[\text{normalized block-trace of the row-resolvent restricted to the averaging span}\Bigr]\in\mathbb{C}^{K\times K}. $$
The only model-dependent input at this level is the effective K × K time-covariance C induced by (i) the OU overlap through the shared datum and (ii) the time-embedding overlap through τ(⋅), together with (iii) the within-time noise idiosyncrasy controlled by mk. In the minimal Hermite truncation used throughout (keeping only the first-order contraction in the data input and the second-moment correction), C has entries

where ρτ, k := limdt → ∞τ(tk), τ(t)⟩/dt. More refined surrogates (including additional Hermite contractions between time and data channels) simply replace by explicit corrections; the structural point is that all such corrections remain deterministic and yield a symmetric matrix C depending only on ({tk, mk}, τ, φ, ψD).

After building the K-time linear pencil and applying the block-trace map, we obtain an operator-/block-valued Dyson system. In its simplest (single-self-energy) form, it takes the same schematic shape as in the 2 × 2 case,

while in the fully general finite-K surrogate the right-hand side of is replaced by an explicit self-energy map 𝒮[G(z)] whose entries are polynomials in the K2 block traces of the pencil inverse. In either case, the number of unknowns is O(K2) and therefore constant in the proportional limit. Differentiating the fixed-point system with respect to z and with respect to block perturbations (the q-device used previously) yields the additional resolvent derivatives needed to evaluate each ϵtrain(tk) and ϵtest(tk) as explicit rational functions of these block traces.

We emphasize that for z > 0 the system obtained from the pencil is a standard (MDE) in the sense of operator-valued free probability: it is a self-consistent equation for a matrix-valued Stieltjes transform. Under our standing assumptions (bounded second moments of the activation under Gaussian inputs, fixed K, and ridge λ > 0 so that we ultimately evaluate at z = −Λ with Λ > 0), the MDE admits a unique solution (g(z), G(z)) satisfying the Stieltjes sign conditions
g(z) > 0,   ℑG(z) ≺ 0  (ℑz > 0),
and this solution extends continuously to the negative real axis. In particular, the evaluation at z = −Λ is well-defined and the fixed point is locally stable, with stability quantified by the invertibility of the linearization (Jacobian) of the map defining . This stability operator is precisely the object that controls the blow-up of the derivatives zG and qG, and therefore controls the steepness of the learning curves in the small-noise regime.

Let S ⊂ {1, …, K} denote the subset of times (those with hk ≪ 1). For k ∈ S, the dominant contributions to ϵtest(tk) are variance-type terms whose prefactors involve entries of G(−Λ) and, more sharply, the of the fixed point through zG(−Λ) and the block-perturbation derivatives. We therefore define the effective sample-size ratio ψeff by a stability threshold:

where k(ψp; −Λ) denotes the Jacobian (restricted to perturbations that act on the kth time block) of the MDE at z = −Λ, σmin is its smallest singular value, and ε > 0 is a fixed small tolerance (vanishing with Λ if we take a ridgeless limit). In the simplified closure , the same criterion may be monitored equivalently via the minimal eigenvalue of the K × K denominator
𝒟(ψp; z) := −zIK + ψpg(z)C,
since G(z) = 𝒟(ψp; z)−1 and the linearization is dominated by 𝒟−1 whenever the self-energy is low-rank.

A sufficient condition for of a nontrivial ψeff is that there is at least one small-noise time k ∈ S for which the corresponding direction in the averaging span is not orthogonal (through C) to the remaining tasks; otherwise the kth task is effectively decoupled and its threshold reduces to the single-time boundary. A sufficient condition for is that the stability margin in is strictly decreasing in ψp. This holds, for example, whenever the scalar order parameter
θ(ψp) := ψpg(−Λ)
is strictly increasing and C is positive definite on the span relevant to S; then λmin(𝒟(ψp; −Λ)) varies monotonically and crosses any fixed tolerance at most once. In practice, for fixed K we simply solve the MDE along a grid in ψp and observe that the stability metric is unimodal in all regimes we consider, yielding an unambiguous ψeff.

The dependence of ψeff on the noise budgets and correlations is transparent at the level of C. First, increasing mk reduces the idiosyncratic within-time contribution vk/mk in ckk, hence decreases ckk while leaving the cross-time terms unchanged. Since the small-noise instability is triggered by approaching a poorly conditioned regime in the averaging span, reducing diagonal mass in a small-noise block requires a larger ψp to reach the same stability tolerance. Thus, all else fixed, ψeff is (weakly) increasing in each mk for k ∈ S, matching the interpretation that additional noise draws impose more effective constraints and push the memorization transition to larger parameter ratios.

Second, increasing time-feature correlation (e.g. increasing |ρτ, k| for pairs involving S) increases the magnitude of the off-diagonal couplings ck and therefore enlarges the extent to which the large-noise tasks constrain the small-noise estimator through the shared parameters. At the level of the MDE, stronger coupling reduces the stability margin in the small-noise directions and therefore decreases ψeff (the transition occurs earlier in ψp). In particular, in the time-decoupled limit ρτ, k → 0 for k ≠ , the matrix C becomes diagonal, the MDE splits into K independent scalar equations, and the small-noise boundary reduces to the single-time threshold.


7. Phase boundary and (near-)matching bounds: sufficient conditions (upper) for generalization when p < {} and unavoidable alignment (lower) when p > {} within the shared linear-in-parameters model class.

We now formalize the sense in which the stability threshold defining ψeff is not merely a diagnostic of steep learning curves, but indeed marks the boundary between (i) a regime in which the shared estimator behaves like a population-score learner on the small-noise tasks and (ii) a regime in which, within the shared linear-in-parameters class, the estimator is forced to align with the empirical optimal score and thereby incur a non-vanishing small-noise test error.

Fix λ > 0 and recall that under the Gaussian surrogate the ridge minimizer admits the usual closed form
$$ \widehat{A} = \frac{1}{\sqrt{p}}\Bigl(\sum_{k=1}^K \frac{h_k}{N_k} Y_k \Phi_k^\top\Bigr)\Bigl(U+\lambda I_p\Bigr)^{-1}, \qquad U=\sum_{k=1}^K \frac{h_k}{N_k}\Phi_k\Phi_k^\top, $$
where Yk ∈ ℝd × Nk stacks the regression targets $-\frac{1}{\sqrt{h_k}}z_k^{(i,j)}$ and Φk ∈ ℝp × Nk stacks the corresponding random features at time tk.
Consequently, for any fixed time index k, the prediction s(⋅, tk) is a linear functional of the targets {Y} = 1K through the shared resolvent (U + λIp)−1. This induces a decomposition of the per-time test error into a squared-bias term (measuring mismatch to the population score component representable by the feature map) and a variance term (measuring amplification of the label noise through the resolvent). The small-noise regime hk ≪ 1 is precisely the regime in which the label noise has a large effective magnitude after the DSM reweighting, and therefore the variance term is the quantity whose control (or blow-up) determines the qualitative behavior.

Let S ⊂ {1, …, K} denote the set of small-noise times. Fix δ > 0 and suppose that for some ψp the corresponding MDE solution satisfies a stability margin condition of the form

with cδ independent of d and bounded away from 0 as d → ∞. By the implicit function theorem applied to the finite-dimensional fixed-point system defining the block traces, implies that all required derivatives of the limiting resolvent observables (in particular the z-derivatives and the block-perturbation derivatives that enter the variance terms of ϵtest(tk)) remain uniformly bounded. Translating this into the learning-curve expressions yields, for each k ∈ S, a bound of the schematic form

where Cδ < ∞ depends on cδ but not on d. In particular, when the first-order Hermite contraction dominates (the regime in which our surrogate is accurate), the feature-mismatch term is controlled by the deviation between the representable linear score component and the true population score, and the remaining terms vanish as mk → ∞ and/or λ ↓ 0 along any path that preserves . Thus, for ψp strictly below the stability threshold (equivalently ψp ≤ ψeff − δ for the definition of ψeff based on the first failure of ), the small-noise tasks remain in a regime: the shared estimator does not amplify the DSM label noise, and the test error stays close to the Bayes score error up to the approximation error of the linearized model class.

The preceding argument is qualitative but sharp in its mechanism: below ψeff, the MDE solution varies smoothly with ψp and the block-resolvent does not develop a near-singularity in the small-noise directions. In this regime the large-noise tasks (those with h = Θ(1)) contribute additional effective regularization through the shared parameters, improving conditioning in the averaging span and thereby suppressing the variance amplification that would otherwise occur at small noise. The resulting estimator is therefore constrained to track a object rather than an empirical, point-mass object.

We turn to the complementary regime ψp ≥ ψeff + δ. The fixed-point characterization implies that in this regime the stability operator (or, in simplified closures, the denominator 𝒟(ψp; −Λ)) becomes poorly conditioned in at least one small-noise direction. Operationally, this corresponds to an almost-interpolation phenomenon on the small-noise time blocks: the mapping from targets to predictions has an operator norm that grows like the inverse stability margin, and for λ ↓ 0 this growth forces the ridge solution to fit the empirical constraints increasingly tightly.

To make this precise within the shared linear-in-parameters class, fix a small-noise time k ∈ S with mk = ∞ (so the within-time idiosyncratic fluctuation in the design has been averaged out) and consider a sequence λ = λd ↓ 0. In the overparameterized regime where the stability margin collapses, the representer form of the ridge solution together with standard random-features interpolation arguments imply that

i.e. the learned score converges to the empirical optimal score (the score of the empirical mixture induced by the training samples under the forward OU perturbation). The point is not that the model explicitly computes a KDE, but that once the shared design reaches the near-interpolating regime, the minimum-norm (or vanishing-ridge) solution is the unique linear functional that matches the empirical constraints, and these constraints coincide with those defining semp under the DSM loss.

The alignment implies a lower bound on the population-score error because, for hk ≪ 1, the empirical score differs from the population score by an O(1) amount on the data manifold: the empirical mixture retains atomistic structure at the scale $\sqrt{h_k}$, whereas the population Ptk retains the smoothed geometry of P0. Quantitatively, one can show (by comparing the empirical and population log-densities, or by standard lower bounds for KDE score estimation at fixed bandwidth) that there exists c(δ) > 0 such that
a),
\end{equation}
uniformly over all empirical risk minimizers in the shared class as long as the near-interpolation condition (equivalently the stability collapse defining ψeff) holds. This yields a model-class lower bound of the type stated in Proposition~4: once ψp exceeds ψeff by a fixed margin, the shared estimator cannot remain close to the Bayes score on the small-noise tasks.

The upper and lower mechanisms are governed by the same object: the conditioning of the MDE fixed point (or of 𝒟(ψp; −Λ) in simplified closures) restricted to the small-noise directions. Below ψeff this conditioning is bounded and enforces population-score behavior; above ψeff it collapses and forces empirical-score alignment. The bounds therefore match at the level of a common instability criterion, up to the tolerance used in the definition of ψeff and the choice of the ridgeless path λ ↓ 0. This is the sense in which ψeff is a genuine phase boundary for generalization versus memorization in the small-noise tasks under shared random-feature parametrization.


We complement the asymptotic characterization with empirical checks in two regimes: (i) a controlled regime, where training is well-approximated by kernel ridge regression with a fixed NTK and hence is directly comparable to the Gaussian-surrogate predictions; and (ii) a regime, where feature learning is present but we can still probe whether the same effective-threshold phenomenology persists. Across both regimes, the target observation is the existence of a sharp crossover, visible primarily at small-noise times, as a function of the parameter ratio ψp = p/d (or an empirical proxy for p in convolutional architectures), and the dependence of that crossover on the time grid, time embedding, and per-time noise budgets.

Let fθ(x, t) ∈ ℝd denote either a U-Net-like or DiT-like score network (operating in pixel space or in a latent space), trained with the same discrete-time DSM objective used in the theory. We linearize around an initialization θ0:
fθ(x, t) ≈ fθ0(x, t) + Jθ0(x, t)(θ − θ0),
and we train only the linear parameter Δθ := θ − θ0 by ridge-regularized least squares. This induces a block kernel matrix over the index set of examples and time points, with blocks determined by the time-conditioned NTK
K((x, t), (x, t)) := Jθ0(x, t) Jθ0(x, t).
Because the same Δθ is shared across all times, the training problem is intrinsically coupled even when the loss is a sum over times. In particular, the per-time predictor is a linear functional of all targets across all times through the inverse of a coupled Gram matrix, which is the empirical analogue of the coupled resolvent appearing in the fixed-point description.

To align the experimental setting with the assumptions, we first take P0 = 𝒩(0, Π) with rank(Π) = D and choose (d, D, n) at moderate scale with ratios near the proportional regime. For a fixed time grid {tk}k = 1K, we construct {xtk(i, j)} with prescribed mk, train the linearized model, and report (a) per-time DSM error Etest(tk), computed on fresh (x0, z) pairs not used in training; (b) per-time training error Etrain(tk); and (c) an aggregate error obtained by averaging over k with weights proportional to hk (to match the DSM scaling). The qualitative signature we seek is that, for the smallest-noise time(s), Etest(tk) remains near a baseline for smaller width and then increases rapidly past a critical width, while Etrain(tk) continues to decrease.

Operationally, we extract an empirical ψ̂eff from either of the following equivalent criteria (which agree within measurement noise in the linearized regime). First, we locate the point of maximal slope of the small-noise test curve as a function of width:
$$ \widehat{\psi}_{\mathrm{eff}} \in \arg\max_{\psi_p}\ \frac{\mathrm{d}}{\mathrm{d}\psi_p} E_{\mathrm{test}}(t_s). $$
Second, we compute an interpolation proxy from the coupled kernel matrix: letting Gλ be the ridge-regularized Gram operator on the joint dataset-time index set, we monitor the minimum eigenvalue of Gλ and define ψ̂eff as the width at which the effective condition number crosses a fixed threshold. In the linearized setting, both procedures are stable under resampling and produce a single transition point for the small-noise times, while large-noise times exhibit a smoother dependence on width.

To compare against the theory without re-deriving architecture-specific Hermite coefficients, we use a semi-empirical plug-in approach: we measure the cross-time correlations induced by the time embedding and the linearized features by estimating
$$ \widehat{C}_{k\ell} := \frac{1}{p}\,\mathrm{Tr}\Bigl(\Phi_k\Phi_\ell^\top\Bigr), $$
where Φk denotes the feature matrix at time tk (either random features in the surrogate or Jacobian features in the NTK linearization). We then feed {k}k,  ≤ K, along with (ψn, ψD, λ, {hk, mk}), into the fixed-point solver to obtain a predicted phase boundary ψeffpred. On the synthetic model, we observe that ψeffpred tracks ψ̂eff across variations in K and in embedding design, and that the agreement improves as (i) width increases (improving the linearization) and (ii) mk increases (reducing within-time Monte Carlo noise in the design).

We vary the grid in two directions while keeping the smallest time ts fixed. (i) Adding additional times (larger hk) tends to the critical width for small-noise memorization: empirically, Etest(ts) remains stable up to a larger width, consistent with the interpretation that additional times improve conditioning of the shared inverse problem and delay near-interpolation in the small-noise block. (ii) Adding additional times (multiple tk with hk ≪ 1) has the opposite effect unless mk is increased accordingly: the small-noise tasks collectively introduce more high-variance constraints, and the transition can occur at smaller width if the model can fit those constraints jointly. These trends are captured by the same plug-in prediction through changes in the measured cross-time correlation structure and in the effective sample weighting through {hk}.

We next vary the time-embedding map τ(t) by changing its dimension dt and by optionally orthogonalizing {τ(tk)}k = 1K in dt. As dt increases and the embeddings become closer to orthogonal, the empirical cross-time feature correlations k decrease, and the observed ψ̂eff moves toward the decoupled baseline consistent with ψeff ≈ ψn. Conversely, low-dimensional or highly correlated embeddings strengthen coupling and shift ψ̂eff away from ψn, in the direction predicted by the coupled fixed-point solver. This ablation isolates the architectural mechanism by which ``sharing across times’’ can be tuned continuously: not only by sharing weights, but by controlling whether the time-conditioned features actually span distinct directions.

We finally vary {mk}, with particular emphasis on the smallest-noise time(s). For fixed n and width, increasing ms (new noise draws per datum at ts) reduces the variance of the empirical design and makes the training objective closer to its conditional expectation; correspondingly, the small-noise test curve becomes smoother and the transition point stabilizes. In contrast, reusing a single noise draw (small ms) increases stochasticity and can produce an ``early’’ appearance of memorization as diagnosed by nearest-neighbor phenomena (below), even when the averaged learning curve has not yet reached its steepest point. In the linearized regime, these effects are quantitatively explained by the same kernel perspective: mk controls the effective number of distinct constraints at time tk and hence the rank and conditioning of the coupled Gram matrix.

We repeat the above diagnostics with finite-width training (no linearization), keeping the optimizer and schedule fixed and scaling width over a range in which training remains stable. While the precise location of the transition drifts (consistent with feature learning modifying the effective kernel), the qualitative structure persists: small-noise test error exhibits a sharp change as width increases, and the direction of the shift under the three ablations (time grid, embedding correlation, and mk) matches the linearized predictions. The main systematic deviation is that feature learning tends to reduce the apparent coupling (empirically lowering k at convergence relative to initialization), which moves the observed threshold toward the decoupled baseline; this is consistent with the expectation that learned representations can partially separate times even when the initial embedding is correlated.

To connect the small-noise alignment phenomenon to an observable in sampling, we run the reverse diffusion (or probability flow ODE) using the trained score and examine the terminal samples. In the regime where Etest(ts) has crossed the transition, the generated samples exhibit markedly reduced novelty under standard nearest-neighbor metrics: for each generated output 0, the distance to the closest training point decreases sharply, and the identity of the nearest neighbor stabilizes across small perturbations of the initial noise. This behavior is consistent with the interpretation that, at small noise, the learned vector field inherits an empirical-mixture structure and thereby induces an attraction toward training atoms along reverse-time trajectories. Below the transition, the same retrieval statistics are diffuse and do not concentrate on single training exemplars, matching the regime in which the small-noise test error remains near its population-score baseline.


9. Discussion: implications for training schedules and architecture design; connections to early stopping, guidance, and latent diffusion; open problems (continuous time, feature learning, non-Gaussian data).

Our asymptotic characterization isolates a single scalar order parameter, the effective sample-size ratio ψeff, which governs whether small-noise times behave as a population-score regression problem or collapse toward the empirical optimal score. We emphasize that ψeff is not merely a reparametrization of ψn: it is a functional of the entire coupled training specification (time grid, embedding correlations, and per-time noise budgets), and it is therefore a quantity one can in principle . The most immediate implication is that, once one trains with shared parameters across times, it is generally insufficient to reason about ``capacity’’ via a single-time interpolation threshold p ≈ n; rather, the relevant threshold for memorization at a given small-noise time is p ≈ dψeff, and ψeff may shift substantially as we change the remainder of the schedule.

In practice, diffusion schedules place substantial emphasis on small-noise times, both because those times are used heavily during reverse-time integration and because the DSM target scales as $\sqrt{h_k}$ in our normalization. Our theory suggests a tension: small-noise constraints are the most informative for refining samples, but they are also the constraints for which the population score is most sensitive to finite-sample effects (the empirical score field resembles a mixture with sharp local structure). In the coupled multi-time problem, adding additional times can either stabilize or destabilize the small-noise task depending on whether the added times increase the effective conditioning of the shared inverse problem or instead introduce additional near-singular constraints. This motivates a schedule-level principle: when we include multiple small-noise points (multiple k with hk ≪ 1), we should anticipate that ψeff decreases unless we correspondingly increase the effective constraints via n or mk, or weaken coupling through the time embedding. Conversely, adding moderate-to-large noise points (with hk = Θ(1)) can increase ψeff and delay the small-noise transition, because those times provide constraints that are less sensitive to the data manifold and thus regularize the shared fit.

A complementary lever is explicit loss reweighting across k. While our objective uses the canonical DSM scaling, one may introduce weights {wk} and consider $\sum_k w_k \| \sqrt{h_k}s(\cdot,t_k)+z\|^2$. Within the Gaussian-surrogate viewpoint, this rescales the per-time block contributions to the coupled Gram operator and thereby shifts ψeff. In particular, downweighting the smallest hk effectively reduces the influence of the most sample-sensitive constraints and can be interpreted as increasing ψeff for the small-noise task (i.e., requiring larger width before memorization-like behavior appears). This provides a quantitative lens on heuristics that ``avoid overfitting the last timesteps’’: such heuristics are not merely stabilizing optimization, but can be understood as modifying the effective interpolation geometry.

The coupling mechanism in our analysis is explicit: the shared second-layer weights (or, in the NTK view, the shared linearized parameters) induce a joint kernel over the product space of data and times, with off-diagonal blocks determined by the time embedding and time-conditioned features. Therefore, architectural choices that alter cross-time feature correlations alter ψeff. Two design principles follow.

First, increasing the effective dimensionality of the time representation (larger dt, richer embeddings, or architectural pathways that separate time conditioning) tends to reduce off-diagonal correlations and push the system toward the decoupled limit in which ψeff ≈ ψn. This suggests that, if one wishes to preserve the favorable single-time generalization behavior at small noise, it is beneficial to ensure that features at distinct times are not nearly collinear. Second, if one instead desires aggressive parameter sharing (for computational or statistical reasons), then one should anticipate that strong coupling can shift ψeff away from ψn and may either delay or accelerate small-noise memorization depending on the sign and structure of correlations. In this sense, ``time conditioning’’ is not a purely semantic input; it is an operator that determines whether the multi-time regression decomposes or becomes a genuinely coupled system with nontrivial phase behavior.

Our results are stated for ridge-regularized empirical risk minimizers, but the same resolvent-based quantities that determine test error also govern gradient-flow dynamics in linear models. In particular, for linearized training, early stopping is closely related to a spectral filter on the coupled Gram operator, and thus plays a role analogous to increasing λ in a data-dependent manner. Heuristically, early stopping reduces the contribution of small-eigenvalue directions of the joint kernel (directions that are most associated with near-interpolation and with alignment to the empirical score field). Consequently, early stopping can increase the apparent ψeff (postponing the transition in ψp) without changing width. This provides a concrete interpretation of why early stopping and explicit weight decay often mitigate memorization-like artifacts in diffusion models: they prevent the shared fit from entering the regime in which the small-noise block effectively interpolates the empirical constraints.

Classifier-free guidance and related conditioning methods alter the reverse-time vector field by effectively scaling or combining scores. Even when guidance is applied only at sampling time, it can amplify any discrepancy between the learned and population scores at small noise. Under our phase picture, once ψp crosses ψeff and the learned field aligns with the empirical optimal score near the data manifold, guidance can magnify the attraction toward training atoms, thereby increasing nearest-neighbor retrieval phenomena. This suggests that guidance strength should be treated as coupled to schedule design: if one uses strong guidance, it is prudent either to remain below the small-noise transition (by limiting effective capacity or strengthening regularization) or to mitigate coupling mechanisms that reduce ψeff.

Latent diffusion changes the relevant proportional-asymptotic ratios by replacing the pixel-space ambient dimension d with a smaller latent dimension. For fixed dataset size n and model width proxy p, decreasing d increases ψn = n/d and ψp = p/d simultaneously. Therefore, whether latent diffusion moves the system toward or away from the memorization boundary depends on the relative scaling of p with the latent dimension. Our framework suggests that latent compression can be beneficial insofar as it increases ψn faster than it increases the effective degrees of freedom of the score model, thereby increasing the margin ψeff − ψp at the smallest-noise times. However, if compression is accompanied by a proportional increase in model width, the system may remain near the boundary, and the same small-noise transition can reappear in latent space.

Several extensions are mathematically natural but technically nontrivial. (i) Our characterization is for fixed K. Passing to a dense time grid requires controlling the growth of the block-resolvent system as K → ∞ and identifying an operator-valued limit. One expects a functional fixed-point equation indexed by t, with ψeff becoming a functional of a time-correlation kernel; establishing this rigorously would connect the present analysis to continuous-time score matching and to the numerical stability of reverse-time solvers.

  1. Finite-width training modifies the kernel during optimization and can, empirically, reduce cross-time correlations by learning representations that partially disentangle times. Proving such a decorrelation effect, or characterizing when it fails, remains open. A plausible route is to couple mean-field limits for feature learning with the random-matrix resolvent analysis used here, but this requires new tools because the design is no longer independent of the targets.

  2. We have focused on (subspace) Gaussian models to make the coupled resolvent explicit. Real data introduce curvature, multimodality, and heavy tails; in such settings, the empirical score field may be even more ``atomic’’ at small noise, and the consequences of crossing ψeff may be more pronounced. It would be valuable to identify universality classes in which ψeff depends only on low-order moments and on time-feature correlations, as well as regimes where higher-order structure dominates and new order parameters are needed.

These directions are not merely refinements: they determine whether ψeff can be used as a predictive design quantity for modern diffusion pipelines. The present work establishes, at least in a controlled high-dimensional limit, that the multi-time coupling inherent in shared-parameter score models induces a genuine phase boundary for small-noise generalization, and that this boundary is manipulable through schedule and architecture choices.


Appendices: full linear-pencil derivations; numerical solver details; additional plots; proofs of auxiliary lemmas; experimental protocols and reproducibility.

We collect the technical components supporting the main statements in a set of appendices with two goals: (i) to make the block-resolvent method fully checkable, including the linear-pencil constructions and the closure of the associated trace system; and (ii) to provide sufficient numerical and experimental detail so that the limiting predictions and the finite-dimensional simulations can be reproduced without ambiguity.

A first appendix presents the full linearization of the rational matrix expressions that arise when we write the ridge solution and substitute it into the per-time train/test errors. Starting from the joint time-stacked design matrix (with blocks indexed by k ∈ {1, …, K}), we explicitly express the relevant quantities as traces of functions of coupled Gram operators, including off-diagonal blocks induced by time features. We then introduce a block linear pencil LK(z, q) whose inverse contains, as Schur complements, the resolvents of interest. The appendix gives the precise block structure, the bookkeeping conventions for block indices, and the step-by-step verification that the desired traces are obtained as normalized traces of specific diagonal blocks of LK(z, q)−1 and, when needed, of its derivatives with respect to z and auxiliary perturbation parameters q. For the case K = 2, we provide a minimal pencil with a fully explicit block layout and show how the coupled system reduces to a small set of scalar Stieltjes-transform variables.

A second appendix records the time-augmented Gaussian equivalence reduction used to replace nonlinear random features by an analytically tractable Gaussian-linear surrogate. We state the conditions on φ (Hermite expansion, finite second moment) under which the surrogate matches the first two moments of the features and preserves the relevant cross-time correlations induced by τ(tk). The outcome is a list of coefficients (e.g. μ0, k, μ1, k, vk and their cross-time analogues) that parametrize the surrogate covariance operator. We include the derivations for these coefficients in the subspace-Gaussian model P0 = 𝒩(0, Π), highlighting how the Π structure enters only through a small number of scalars (via ψD in the proportional limit) and how the OU parameters (ak, hk) enter through simple rescalings. The appendix also provides consistency checks (such as the decoupled-time limit) that serve as debugging tools for implementations.

A third appendix derives the finite-dimensional fixed-point system that characterizes the deterministic limits of the block traces. We write the Dyson (self-consistent) equation for the pencil resolvent and apply the block trace map to obtain a closed algebraic system for the trace variables {ζr}. We emphasize the precise symmetry reductions that allow one to replace matrix-valued unknowns by O(K2) scalars, and we track carefully where the proportional-asymptotic ratios (ψn, ψp, ψD) enter. Since different equivalent parameterizations of the unknowns are possible (e.g. in terms of block Stieltjes transforms versus Schur-complement traces), we provide a translation table between the parameterizations used in different parts of the manuscript, along with a list of the trace identities used to convert the solved fixed-point variables into predicted train/test errors at each time.

Several lemmas used implicitly in the main line are proved in full. These include: concentration statements ensuring that empirical block Gram operators are close to their deterministic equivalents at scale d−1/2 (sufficient for the deterministic limit claims); resolvent identities controlling derivatives with respect to z and small perturbations (needed for variance terms and for interpolation-denominator diagnostics); and small-noise asymptotics that justify the separation between population-score behavior and empirical-score alignment in the hk ≪ 1 regime. We also include a short proof that the time-decoupled limit reduces to the single-time formulas by showing that the off-diagonal blocks vanish and the Dyson system factorizes. Where a step relies on standard random-matrix results, we cite a canonical reference and spell out the correspondence of assumptions (in particular, bounded moments and independence structure) to avoid hidden regularity requirements.

An implementation-focused appendix describes how we solve the fixed-point equations robustly for finite K and how we compute the derived error quantities. We specify the unknown vector, the residual map, and the Jacobian used in Newton or quasi-Newton iterations. To handle regimes near the memorization transition (where denominators become small and naive Newton updates can be unstable), we describe a continuation strategy in ψp or λ, damped Newton steps, and a stopping criterion based on both residual norm and relative change in the trace variables. Derivatives needed for error expressions are obtained either by automatic differentiation of the fixed-point map (when implemented in a differentiable framework) or by finite differences with step sizes chosen relative to λ to balance truncation and roundoff. We also include sanity checks: verifying positivity of the implied Stieltjes transforms, monitoring monotonicity in λ, and comparing K = 1 solutions to known single-time formulas.

We provide supplementary figures that visualize (i) per-time learning curves across multiple tk for varying (ψn, ψp); (ii) the effect of changing noise budgets mk at fixed width; (iii) the sensitivity of ψeff to the correlation structure of the time embedding (including both nearly collinear and nearly orthogonal regimes); and (iv) comparisons between finite-d simulations and deterministic-equivalent predictions, including observed convergence rates as d increases. Each plot is accompanied by a brief caption that states the parameter values, the averaging procedure across random seeds, and the specific error functional being displayed (train, test, or aggregate). We also include diagnostic plots of the fixed-point residual versus iteration and of the smallest eigenvalue of the relevant Gram blocks as a function of ψp, which help identify numerical artifacts versus genuine transitions.

Finally, we document the simulation pipeline used for all finite-dimensional experiments. We specify how we generate xi under the subspace Gaussian model (construction of Π, normalization conventions, and sampling), how we generate OU-perturbed samples xtk(i, j), and how we draw and store the random feature weights (Wx, Wt) and the time embeddings τ(tk). We record the exact ridge objective solved in finite dimensions, including the scaling of the design matrix and the regularization term, and we describe the linear algebra routines used (direct solvers versus iterative methods) together with their tolerances. We list default seeds, the number of independent trials used for reported averages, and the hardware/software environment assumed. Where appropriate, we provide pseudocode for data generation, training, and evaluation, and we state how to reproduce each figure by a single command or script entry point. This appendix is intended to eliminate degrees of freedom that commonly affect learning-curve experiments (normalizations, hidden rescalings of λ, and inconsistent definitions of error), and thereby to make the empirical portion of the work verifiable in a straightforward manner.