Time-Conditioned Denoising Score Matching as Multi-Task Interpolation: Precise Learning Curves and Memorization Thresholds

Metadata

Total Words: 10,637
Export Date: 2026-01-19 23:30:54
Description: Modern diffusion models learn a single time-conditioned score network shared across noise levels, yet precise theory for generalization versus memorization largely analyzes independent estimators at each time. Building on exact asymptotic learning-curve methods for denoising score matching (DSM) with random features, we introduce a multi-time DSM model in which a single shared random-features/NTK-linearized score is trained jointly over a finite grid of diffusion times with time embeddings. In the proportional high-dimensional regime d, n, p with n / d n and p / d p, we derive deterministic limits for the train and test errors via a Gaussian equivalence reduction and a block linear-pencil computation of coupled resolvents. Our main finding is that parameter sharing across time does not eliminate memorization; it relocates the familiar interpolation boundary p n to an explicit multi-time effective sample size n{} that depends on the time grid, time embedding, and the number of noise samples per datum used at each time. In the small-noise regime, the learned score transitions from approximating the true score to approximating the empirical optimal score once p exceeds n{}, predicting sharp increases in test error and training-sample attraction under reverse diffusion. We provide closed-form equations for the K=2 case and validate the resulting scaling law against NTK-linearized U-Nets/DiTs on MNIST/Fashion-MNIST and a mid-scale image dataset, highlighting a compute-relevant route to controlling memorization in 2026-era diffusion training.

1. Introduction: memorization in time-conditioned diffusion; why single-time theory is insufficient; summary of contributions and phase diagram with n_{}.
2. Background: OU/VP diffusion, DSM objectives, empirical optimal score and memorization mechanism; recap of random-features/NTK learning-curve tools from the source material (GEP, linear pencils).
3. Multi-time DSM with shared parameters: formal definition of the joint objective over {t_k}, noise reuse {m_k}, time embeddings (t), and shared score model s_A(x,t); definitions of per-time and integrated train/test errors and memorization order parameters.
4. Main technical reduction: Gaussian equivalence for time-augmented features; identification of the coupled random matrix whose resolvent controls learning curves; statement of the block linear-pencil construction.
5. The K=2 case in full detail: explicit 2×2 block resolvent equations; closed-form characterization of limiting train/test errors; definition and computation of _{}; recovery of the p n boundary under time-decoupling.
6. General finite K: operator-/block-valued fixed-point equations; conditions under which _{} exists and is unique; qualitative monotonicity in m_k and in time-feature correlation.
7. Phase boundary and (near-)matching bounds: sufficient conditions (upper) for generalization when p < {} and unavoidable alignment (lower) when p > {} within the shared linear-in-parameters model class.
8. Experiments: NTK-linearized U-Net/DiT and finite-width checks; measuring effective sample size and predicting threshold; ablations over time grid, embedding dimension, and per-time noise reuse; link to nearest-neighbor retrieval under reverse diffusion.
9. Discussion: implications for training schedules and architecture design; connections to early stopping, guidance, and latent diffusion; open problems (continuous time, feature learning, non-Gaussian data).
Appendices: full linear-pencil derivations; numerical solver details; additional plots; proofs of auxiliary lemmas; experimental protocols and reproducibility.

Content

1. Introduction: memorization in time-conditioned diffusion; why single-time theory is insufficient; summary of contributions and phase diagram with n_{}.

We consider score learning for an Ornstein–Uhlenbeck (OU) diffusion at a finite set of time points {t_k}_k = 1^K, where each forward perturbation takes the form
$$ x_{t_k}=a_k x_0+\sqrt{h_k}\,z,\qquad a_k=e^{-t_k},\qquad h_k=1-e^{-2t_k}, $$
with x₀ ∼ P₀ and z ∼ 𝒩(0, I_d). Our focus is on the high-dimensional regime in which the ambient dimension d grows proportionally with the number of training samples n and the number of random features (equivalently, linearized parameters) p, while the number of training times K is fixed. The central phenomenon we address is in time-conditioned diffusion score training: when a single model is trained across times with shared parameters, the small-noise times can undergo an abrupt transition from population-level generalization to an estimator that tracks an empirical, kernel-density-like score field concentrated around the training data.

The memorization mechanism is most transparent at times with h_k ≪ 1. At such times, x_{t_k} is close to x₀, and the score s^*(⋅, t_k) = ∇log P_{t_k} encodes fine-scale structure of the data distribution. In denoising score matching (DSM), one trains a model to predict $-z/\sqrt{h_k}$ from x_{t_k}, and thus the supervision becomes increasingly ``sharp’’ as h_k → 0. When the model class is sufficiently expressive relative to the effective number of constraints supplied by the dataset and the noise sampling budget, the learned score can match the empirical optimal score s^emp(⋅, t_k), i.e. the score of the finite-sample mixture distribution induced by the training points and the OU forward kernel. This empirical score is accurate on the training set but can deviate from the population score in a way that does not vanish with d, yielding a nontrivial excess test error at small noise even as the training error continues to decrease.

Existing high-dimensional learning-curve theory for random-features regression provides a precise account of interpolation thresholds and excess risk in score regression problems. However, modern diffusion training does not fit a single-time abstraction: a single score network is trained jointly across a family of times, and this joint training introduces two structural differences that invalidate a naive extension of single-time formulas.

First, parameter sharing couples the regression problems across times. Concretely, a shared second-layer weight matrix A ∈ ℝ^d × p must simultaneously fit K regression tasks indexed by t_k, each with its own forward distribution P_{t_k} and label field $-z/\sqrt{h_k}$. Even when the time embedding τ(t) and the random time weights W_t are fixed, the resulting feature design matrices at different t_k are neither independent nor orthogonal in general. Consequently, the effective degrees of freedom available to fit small-noise constraints depend not only on ψ_p = p/d and ψ_n = n/d but also on cross-time correlations induced by τ(⋅) and on the distribution of noise budgets {m_k}.

Second, multi-time training can shift the interpolation boundary for the small-noise task away from the classical p ≈ n threshold. Intuitively, a large-noise time t_ℓ (with h_ℓ = Θ(1)) supplies abundant, low-variance regression constraints that are easier to fit and may partially ``use up’’ shared capacity, whereas a small-noise time t_s (with h_s ≪ 1) supplies highly localized constraints that can trigger memorization once the shared parameterization becomes expressive enough to represent the empirical score component. Thus, the correct control parameter is not simply ψ_n but an ψ_eff that aggregates, through the coupling structure, how many independent constraints the joint design imposes on the shared model for the purpose of fitting the small-noise score.

The goal of this work is to provide a mathematically explicit characterization of this coupled-time transition in a tractable surrogate model that still retains the essential coupling induced by time-conditioning. We analyze a random-features/NTK-linearized score model of the form
$$ s_A(x,t)=\frac{A}{\sqrt{p}}\,\varphi\!\left(\frac{W_x x}{\sqrt{d}},\frac{W_t\tau(t)}{\sqrt{d_t}}\right), $$
trained by a multi-time DSM objective with ridge regularization λ > 0 and with m_k independent noise draws per datum at each time t_k. We work under proportional asymptotics d, n, p → ∞ with fixed ratios (ψ_n, ψ_p) and fixed intrinsic ratio ψ_D = D/d for the subspace Gaussian data model P₀ = 𝒩(0, Π_∥). The restriction to finite, fixed K is deliberate: it reflects the fact that practice uses a discretized time grid, and it yields a closed-form random-matrix characterization in terms of a constant-size system of equations.

Our contributions can be summarized as follows.

We show that the expected per-time training and test errors converge to deterministic limits that can be expressed using traces of resolvents of a random matrix built from the joint feature design across all times. The multi-time coupling requires a Stieltjes transform rather than a scalar one: cross-time correlations appear as off-diagonal blocks, and the quantities controlling generalization are block traces of the inverse of a structured linear pencil. The final outcome is a finite-dimensional algebraic fixed-point system whose unknowns are normalized block traces; solving this system yields the limiting errors E_train(t_k) and E_test(t_k), as well as any aggregate error $\overline{E}_{\mathrm{test}}$ over the time grid.

In the regime where at least one time satisfies h_k ≪ 1, we identify a computable scalar order parameter ψ_eff such that the small-noise learning curve exhibits a sharp crossover at ψ_p ≈ ψ_eff. Below this boundary, the estimator remains close (in the appropriate population sense) to the Bayes score and the small-noise test error is controlled. Above this boundary, the estimator aligns with the empirical optimal score on the data manifold, and a nonvanishing excess test error emerges. Importantly, ψ_eff depends on the full multi-time design, including the noise budgets {m_k} and the time embedding correlations, and it reduces to ψ_n only in a decoupled-time limit.

For K = 2 with a small-noise time t_s and a larger-noise time t_ℓ, we derive an explicit 2 × 2 block linear pencil and obtain closed resolvent equations that quantify how cross-time coupling shifts the memorization threshold. This yields a provable characterization of ψ_eff(t_s, t_ℓ, m_s, m_ℓ, τ, φ, ψ_n, ψ_D, λ) and demonstrates how the classical boundary p ≈ n is recovered when time features are orthogonal so that cross-time correlations vanish.

We interpret the transition as occurring at an effective sample size n_eff := d ψ_eff, which plays the role of the number of independent constraints that the joint multi-time regression imposes on the shared parameters for the small-noise task. In particular, in the overparameterized regime ψ_p ≫ ψ_eff, the shared model has sufficient capacity to interpolate the multi-time DSM constraints in a manner that necessarily introduces an empirical-score component at small noise. Conversely, when ψ_p ≪ ψ_eff, the estimator is forced to remain in a population-aligned regime and does not collapse to the empirical score.

The remainder of the paper develops the background and the technical tools needed to support these statements. We first formalize the OU/variance-preserving diffusion setting and the DSM objectives, and we make precise the distinction between the population score s^* and the empirical optimal score s^emp that governs memorization. We then recall the random-features learning-curve machinery required for the multi-time setting, including the Gaussian equivalence principle and the linear-pencil linearization that converts coupled rational trace expressions into a tractable block-resolvent problem.

2. Background: OU/VP diffusion, DSM objectives, empirical optimal score and memorization mechanism; recap of random-features/NTK learning-curve tools from the source material (GEP, linear pencils).

For each time t > 0, the OU forward map defines a Markov kernel Q_t(dx ∣ x₀) given by
$$ x_t = a\,x_0+\sqrt{h}\,z,\qquad a=e^{-t},\qquad h=1-e^{-2t},\qquad z\sim\mathcal{N}(0,I_d), $$
so that Q_t(⋅ ∣ x₀) = 𝒩(ax₀, hI_d) and the marginal is the convolution
P_t = P₀Q_t, p_t(x) = ∫ϕ_h(x − ax₀) P₀(dx₀),
where ϕ_h(u) = (2πh)^−d/2exp (−∥u∥₂²/(2h)). The population score field at time t is
s^*(x, t) = ∇_xlog p_t(x),
which is well-defined under mild conditions (e.g. if P₀ has a density, or more generally as a weak gradient of the log-convolution). In the subspace Gaussian model used throughout, P₀ = 𝒩(0, Π_∥) with Π_∥ an orthogonal projection of rank D, and P_t is again Gaussian with covariance
Σ_t = a²Π_∥ + hI_d, so that s^*(x, t) = −Σ_t⁻¹x.
This explicit form is convenient for computing Bayes benchmarks and for separating behavior along Im(Π_∥) and its orthogonal complement.

The (population) denoising score matching objective at time t is the regression risk obtained by observing x_t and predicting $-z/\sqrt{h}$:
$$ \mathcal{R}_t(s)\;:=\;\frac{1}{d}\,\mathbb{E}\Bigl[\bigl\|\sqrt{h}\,s(x_t,t)+z\bigr\|_2^2\Bigr], $$
where the expectation is over (x₀, z). A standard integration-by-parts identity (equivalently, Stein’s lemma under the conditional Gaussian x_t ∣ x₀) implies the basic equivalence

hence, at each fixed t, minimizing DSM is equivalent to L²(P_t) regression toward the population score s^*(⋅, t), up to the additive constant 1.

The memorization phenomenon is not captured by~ alone because practical training does not minimize ℛ_t over all measurable functions: it minimizes an empirical average over finitely many data points and finitely many noises. Fix a dataset {x_i}_i = 1ⁿ and (for the moment) consider the idealized regime m = ∞ where we average over the noise conditional on each x_i. The empirical forward distribution at time t is then the mixture
$$ P_t^{\mathrm{emp}} := \frac{1}{n}\sum_{i=1}^n \mathcal{N}(a x_i, h I_d), \qquad p_t^{\mathrm{emp}}(x)=\frac{1}{n}\sum_{i=1}^n \phi_h(x-a x_i), $$
and the corresponding is
$$ s^{\mathrm{emp}}(x,t):=\nabla_x \log p_t^{\mathrm{emp}}(x) =\frac{\sum_{i=1}^n \phi_h(x-a x_i)\,\frac{a x_i-x}{h}}{\sum_{i=1}^n \phi_h(x-a x_i)}. $$
This field coincides with the population score of the finite-sample mixture P_t^emp, and it is the exact minimizer of the conditional-noise-averaged empirical DSM risk over all measurable functions. In particular, s^emp is a kernel-density-like object at bandwidth $\sqrt{h}$, and its deviation from s^* is controlled by the usual bias–variance tradeoffs of KDE, except that we evaluate it in the high-dimensional scaling relevant to learning curves.

The ``sharpness’’ of small-noise constraints is visible directly in s^emp. When h ≪ 1, the mixture components concentrate in balls of radius $\sqrt{h}$ around the points ax_i, and near such a point the score behaves like (ax_i − x)/h, i.e. it has magnitude on the order of $1/\sqrt{h}$ at typical perturbations. Thus, for small h, the empirical target contains narrow, high-curvature structure tied to individual training samples. A sufficiently expressive model trained on the empirical objective can represent this sample-specific structure; doing so yields low training error while potentially increasing the population test error measured against s^* (equivalently, the L²(P_t) error in~). The crossover we study arises when the shared parameterization becomes capable of fitting the empirical score component at the small-noise times.

For finite m < ∞, the empirical loss incorporates additional label noise due to Monte Carlo sampling of z. At fixed t, this is the standard random-design regression setting with nm samples; however, in multi-time training the relevant constraint count is not simply ∑_knm_k because all times share the same parameter matrix and their feature designs are correlated.

Once we fix the first-layer random weights and the time embedding, the score model is linear in the trainable parameters. At a given time t_k, each perturbed input x_{t_k}^(i, j) produces a feature vector in ℝ^p via
$$ \phi_{k}^{(i,j)}:=\varphi\!\left(\frac{W_x x_{t_k}^{(i,j)}}{\sqrt{d}},\frac{W_t \tau(t_k)}{\sqrt{d_t}}\right), $$
and the model output is $s_A(x_{t_k}^{(i,j)},t_k)=(A/\sqrt{p})\,\phi_{k}^{(i,j)}$. The DSM training problem is therefore a ridge regression with vector-valued labels $-z_k^{(i,j)}/\sqrt{h_k}$ and a design matrix assembled from the feature vectors across all times. High-dimensional learning-curve theory reduces train/test metrics to traces of rational functions of empirical Gram matrices (or, equivalently, sample covariances in feature space).

The primary obstacle is that ϕ_k^(i, j) is nonlinear in Gaussian weights and high-dimensional inputs. We address this using a Gaussian equivalence principle (GEP): under the proportional limit and mild moment conditions on φ, the feature matrix can be replaced, for the purpose of computing normalized traces of resolvents and related quantities, by an design with matched first- and second-order statistics. Concretely, only a finite set of activation-dependent coefficients (e.g. Hermite coefficients, or equivalently the induced mean and covariance parameters) enter the limiting learning curves. In the multi-time setting, we require the covariance across times; the time embedding induces correlations
𝔼[ϕ_k^(i, j)(ϕ_ℓ^{(i^′, j^′)})^⊤]
that remain nontrivial even when the data/noises are independent across (k, i, j), and these correlations are the mechanism by which parameter sharing couples the effective regression problems.

To compute the limiting traces, we use a linear-pencil linearization. In its simplest form, for a (centered) feature matrix Φ ∈ ℝ^N × p one can embed the resolvent of Φ^⊤Φ/p into the inverse of a block matrix
$$ L(z)=\begin{pmatrix} -z I_p & \Phi^\top/\sqrt{p}\\ \Phi/\sqrt{p} & -I_N \end{pmatrix}, \qquad L(z)^{-1}=\begin{pmatrix} (\Phi^\top \Phi/p-z I_p)^{-1} & \ast\\ \ast & \ast \end{pmatrix}, $$
so that normalized traces of (Φ^⊤Φ/p + λI)⁻¹ and related objects become block traces of L(−λ)⁻¹. In the coupled multi-time problem, Φ is replaced by a structured block design collecting all times, and one must track block traces corresponding to each time (and cross-time) component. The resulting deterministic equivalents are obtained by solving a finite-dimensional Dyson (self-consistent) equation for these block traces. This machinery is the technical backbone that converts a high-dimensional, nonlinear random-features DSM problem into a closed algebraic system whose solution yields the limiting training and test errors, and ultimately the effective sample-size parameter governing memorization.

3. Multi-time DSM with shared parameters: formal definition of the joint objective over {t_k}, noise reuse {m_k}, time embeddings (t), and shared score model s_A(x,t); definitions of per-time and integrated train/test errors and memorization order parameters.

We fix a finite time grid {t_k}_k = 1^K ⊂ (0, ∞) and write
a_k = e^−t_k, h_k = 1 − e^−2t_k.
Given training samples {x_i}_i = 1ⁿ ⊂ ℝ^d drawn i.i.d. from P₀, we generate, for each time t_k and each datum x_i, a (possibly time-dependent) number m_k ∈ {1, 2, …} ∪ {∞} of i.i.d. Gaussian noises {z_k^(i, j)}_j = 1^m_k with z_k^(i, j) ∼ 𝒩(0, I_d), and we form perturbed inputs
$$ x_{t_k}^{(i,j)} \;:=\; a_k x_i + \sqrt{h_k}\, z_k^{(i,j)}. $$
The case m_k = ∞ is interpreted as exact conditional-noise averaging, i.e. replacing empirical sums over j by an expectation over z ∼ 𝒩(0, I_d) conditional on x_i at that time.

We train a score model shared across all times. Concretely, after sampling first-layer random weights (W_x, W_t) and fixing a time embedding τ(⋅), we consider the linear-in-parameters class
$$ s_A(x,t) \;=\; \frac{A}{\sqrt{p}}\,\varphi\!\left(\frac{W_x x}{\sqrt{d}},\,\frac{W_t\tau(t)}{\sqrt{d_t}}\right)\in\mathbb{R}^d, \qquad A\in\mathbb{R}^{d\times p}, $$
where φ acts elementwise on the p coordinates. For each training triple (k, i, j), we denote the corresponding feature vector by
$$ \phi_k^{(i,j)}\;:=\;\varphi\!\left(\frac{W_x x_{t_k}^{(i,j)}}{\sqrt{d}},\,\frac{W_t\tau(t_k)}{\sqrt{d_t}}\right)\in\mathbb{R}^p, \qquad s_A(x_{t_k}^{(i,j)},t_k)=\frac{A}{\sqrt{p}}\,\phi_k^{(i,j)}. $$

The multi-time denoising score matching (DSM) objective with ridge regularization λ > 0 is then

with the convention that when m_k = ∞ the inner average is 𝔼_{z ∼ 𝒩(0, I_d)}[⋅ ∣ x_i]. The factor h_k in the regularizer is chosen so that each time contributes at the same scale as the corresponding DSM loss, and so that the small-noise regime h_k ≪ 1 does not trivially force over-regularization. Since A is shared,~ is a sum of independent single-time problems: even if the noise draws are independent across times, the same parameter matrix must simultaneously fit all time-indexed regression tasks.

It is convenient to make the regression structure explicit. Let N_k := nm_k be the number of training pairs at time t_k (finite if m_k < ∞), and form the design matrix Φ_k ∈ ℝ^N_k × p whose rows are the feature vectors {ϕ_k^(i, j)}, ordered arbitrarily. Likewise let Z_k ∈ ℝ^N_k × d be the matrix whose rows are the noise vectors {z_k^(i, j)}^⊤. Writing the prediction matrix as $\widehat{S}_k:=\Phi_k A^\top/\sqrt{p}\in\mathbb{R}^{N_k\times d}$, we may rewrite the data-fit part of~ as
$$ \frac{1}{dN_k}\bigl\|\sqrt{h_k}\,\widehat{S}_k+Z_k\bigr\|_F^2, $$
so that Â = arg min_Aℒ(A) is a ridge regression estimator with a response and a (block) design across times. Importantly, the effective coupling is controlled by the correlations between ϕ_k^(i, j) and ϕ_ℓ^{(i^′, j^′)} induced jointly by the shared weights W_x, the shared time weights W_t, and the embedding inner products ⟨τ(t_k), τ(t_ℓ)⟩.

To evaluate performance, we distinguish empirical (training) DSM errors from population (test) DSM errors at each time. For any fitted parameter Â, we define the (excluding the ridge term) by

again with the conditional expectation convention when m_k = ∞. The corresponding is the DSM risk evaluated on an independent draw (x₀, z):

where the expectation is over (x₀, z) ∼ P₀ ⊗ 𝒩(0, I_d) and is conditional on the training data and random features through Â.

By the single-time identity~ applied at each t_k, we may equivalently view~ as a scaled L²(P_{t_k}) score estimation error:

Thus the natural score error is E_test(t_k) − 1, and the factor h_k emphasizes that small-noise times encode more stringent (higher-curvature) constraints per unit L²(P_{t_k}) error.

Since K is fixed, we aggregate across times by a finite weighted average. Fix weights {w_k}_k = 1^K with w_k ≥ 0 and ∑_kw_k = 1 (e.g. uniform weights w_k = 1/K), and define
$$ \overline{E}_{\mathrm{train}}:=\sum_{k=1}^K w_k\,E_{\mathrm{train}}(t_k), \qquad \overline{E}_{\mathrm{test}}:=\sum_{k=1}^K w_k\,E_{\mathrm{test}}(t_k). $$
In applications one may interpret $\overline{E}_{\mathrm{test}}$ as a discretized integral along the diffusion time axis; in our asymptotic analysis it is simply a fixed linear functional of the per-time learning curves.

Finally, to formalize the ``memorization’’ behavior at small noise, we introduce order parameters that compare the learned score to the empirical optimal score s^emp(⋅, t_k) when m_k = ∞. For such times, a natural empirical alignment metric is

where $P_{t_k}^{\mathrm{emp}}=\frac{1}{n}\sum_{i=1}^n \mathcal{N}(a_k x_i,h_k I_d)$ is the conditional-noise-averaged empirical mixture. The scaling by h_k matches~ and ensures that Mem(t_k) is commensurate with DSM excess risks. Informally, Mem(t_k) → 0 indicates that the estimator has enough capacity (relative to the effective number of constraints induced by parameter sharing across times) to reproduce the KDE-like, sample-specific score field at bandwidth $\sqrt{h_k}$, whereas Mem(t_k) bounded away from 0 indicates that the estimator remains closer to the population score s^*(⋅, t_k) in the sense of~. In the sequel, these per-time quantities, together with the train–test gap E_test(t_k) − E_train(t_k) at small h_k, will be the observables through which the effective sample-size ratio ψ_eff and the associated generalization–memorization crossover are identified.

4. Main technical reduction: Gaussian equivalence for time-augmented features; identification of the coupled random matrix whose resolvent controls learning curves; statement of the block linear-pencil construction.

Our main technical step is to reduce the multi-time DSM problem to the analysis of a coupled random matrix resolvent under a Gaussian surrogate for the time-augmented random features. We proceed in three stages: (i) rewrite the objective as a single ridge problem with a weighted, stacked design; (ii) invoke a Gaussian equivalence principle (GEP) to replace the nonlinear time-conditioned features by an explicitly correlated Gaussian design with matching low-order moments; and (iii) express all train/test observables as normalized traces of rational functions of the corresponding Gram operator, which we then linearize via a block linear pencil.

Let N_k := nm_k (finite when m_k < ∞) and write Φ_k ∈ ℝ^N_k × p for the feature matrix at time t_k, with rows {ϕ_k^(i, j)}, and Z_k ∈ ℝ^N_k × d for the corresponding noise matrix with rows {(z_k^(i, j))^⊤}. Define the weighted stacked design and response
$$ \widetilde{\Phi} \;:=\; \begin{bmatrix} \sqrt{\frac{h_1}{N_1}}\Phi_1\\ \vdots\\ \sqrt{\frac{h_K}{N_K}}\Phi_K \end{bmatrix} \in\mathbb{R}^{(\sum_k N_k)\times p}, \qquad \widetilde{Z} \;:=\; \begin{bmatrix} \frac{1}{\sqrt{N_1}}Z_1\\ \vdots\\ \frac{1}{\sqrt{N_K}}Z_K \end{bmatrix} \in\mathbb{R}^{(\sum_k N_k)\times d}, $$
and set $\Lambda:=\lambda\sum_{k=1}^K h_k$ for the aggregate ridge coefficient induced by~. Up to an additive constant independent of A, the objective becomes
$$ \mathcal{L}(A) \equiv \frac{1}{d}\Bigl\|\widetilde{\Phi}\frac{A^\top}{\sqrt{p}}+\widetilde{Z}\Bigr\|_F^2 +\frac{\Lambda}{dp}\|A\|_F^2. $$
Hence the minimizer admits the usual closed form

Because the response is isotropic across output coordinates, all per-time train/test quantities reduce to scalar ridge regression identities involving Q and certain cross-covariances between (time-dependent) test features and the training design. In particular, the deterministic limits of~– are governed by finitely many normalized traces such as $\frac{1}{p}\mathrm{Tr}(Q)$, $\frac{1}{p}\mathrm{Tr}(Q\widetilde{\Phi}^\top\widetilde{\Phi}Q)$, and (for test risk) quadratic forms $\frac{1}{p}\phi_k(x)^\top Q\,\phi_k(x)$ averaged over x ∼ P_{t_k}, all of which can be expressed through resolvents of a coupled Gram operator.

The feature vectors are of the form
$$ \phi_k^{(i,j)} = \varphi\!\left(u_k^{(i,j)},v_k\right), \qquad u_k^{(i,j)}:=\frac{W_x x_{t_k}^{(i,j)}}{\sqrt{d}}\in\mathbb{R}^p, \qquad v_k:=\frac{W_t\tau(t_k)}{\sqrt{d_t}}\in\mathbb{R}^p, $$
with φ applied elementwise. By construction, for each coordinate α ∈ {1, …, p}, the pair (u_k, α^(i, j), v_k, α) is jointly Gaussian conditional on (x_{t_k}^(i, j), τ(t_k)), and its covariance structure is explicit. Under our proportional asymptotics and the subspace-Gaussian assumption P₀ = 𝒩(0, Π_∥), we have concentration of ∥x_{t_k}∥₂²/d and explicit limits for cross-time inner products; in particular, for independent draws x₀, x₀^′,
$$ \frac{1}{d}\,\mathbb{E}\bigl[x_{t_k}^\top x_{t_\ell}\bigr] = a_k a_\ell\,\frac{1}{d}\,\mathbb{E}\bigl[x_0^\top x_0'\bigr] = 0, \qquad \frac{1}{d}\,\mathbb{E}\bigl[\|x_{t_k}\|_2^2\bigr] = a_k^2\psi_D+h_k, $$
while for the underlying datum x_i the common x_i induces a nontrivial correlation between x_{t_k}^(i, j) and x_{t_ℓ}^{(i, j^′)} through a_ka_ℓx_i. This is the source of coupling across times even when noises are independent across k.

We now invoke a Gaussian equivalence principle tailored to random features with Hermite-expandable activations: we replace {ϕ_k^(i, j)} by a Gaussian surrogate {ϕ_k, G^(i, j)} such that (a) each coordinate has matching first and second moments under the joint law of (u_k^(i, j), v_k), and (b) cross-time correlations induced by the shared random weights (W_x, W_t) and time embedding inner products ⟨τ(t_k), τ(t_ℓ)⟩ are preserved. Concretely, for each time k we define activation-dependent coefficients
μ_0, k := 𝔼[φ(g, η_k)], μ_1, k := 𝔼[g φ(g, η_k)], v_k := 𝔼[φ(g, η_k)²] − μ_0, k² − μ_1, k²,
where (g, η_k) is a centered Gaussian pair with covariance chosen to match the limiting variance of u_k, α^(i, j) and the limiting variance of v_k, α. Likewise, for (k, ℓ) we define cross-time coefficients
κ_kℓ := 𝔼[φ(g_k, η_k)φ(g_ℓ, η_ℓ)], γ_kℓ := 𝔼[g_k φ(g_ℓ, η_ℓ)],
with (g_k, g_ℓ, η_k, η_ℓ) jointly Gaussian and with correlations determined by the OU overlap a_ka_ℓ on the data subspace and the embedding overlap ⟨τ(t_k), τ(t_ℓ)⟩/d_t. The GEP then asserts that, for the purpose of computing normalized traces appearing in the learning curves, we may work with a Gaussian design whose row/column covariance is explicitly parametrized by {μ_0, k, μ_1, k, v_k, κ_kℓ, γ_kℓ} (and by the finite m_k replication structure), with an error that vanishes as d, n, p → ∞ at fixed ratios. This yields a in which the only nontrivial dependence on (t_k, m_k, τ, φ) enters through a finite collection of scalars.

Under the Gaussian surrogate, all quantities of interest are functions of the weighted Gram matrix
U := Φ̃^⊤Φ̃ ∈ ℝ^p × p, R(z) := (U − zI_p)⁻¹,
evaluated at z = −Λ and at small perturbations thereof (to access derivatives entering variance terms). The multi-time aspect is not encoded by block structure in U (since A is shared), but rather by the fact that U itself is built from a of time-indexed feature blocks with nontrivial cross-time correlations; equivalently, U can be represented as a sum of finitely many independent Wishart-type components plus low-rank terms reflecting the shared x_i and shared time features. This is precisely the regime in which standard resolvent methods apply: for fixed K, the limiting learning curves are determined by a finite-dimensional system of equations for a finite set of block traces associated with an augmented resolvent.

To make this closure explicit, we linearize the rational functions of U that arise in the error formulas. The basic device is a block L_K(z, q) (depending on spectral parameter z and an auxiliary scalar perturbation q) whose inverse contains, as Schur complements, the resolvents and mixed resolvents needed for train/test risk. In its minimal form, L_K(z, q) is a block matrix with O(K) blocks that couples (i) the parameter-side resolvent (U − zI_p)⁻¹, (ii) the sample/time-side covariance operators induced by the Gaussian surrogate, and (iii) the low-rank signal directions associated with the data subspace Π_∥ and the time embedding overlaps. The defining property is that for appropriate block selection matrices E, we have identities of the form
E^⊤L_K(z, q)⁻¹E = (U(z, q) − zI_p)⁻¹,
where U(z, q) is the perturbed Gram operator whose q-derivative generates the quadratic trace functionals appearing in E_train(t_k) and E_test(t_k). Consequently, it suffices to control normalized traces of blocks of L_K(z, q)⁻¹. By standard Dyson-equation arguments for block-Gaussian linearizations, these block traces converge to deterministic limits characterized by a closed algebraic fixed-point system whose dimension is polynomial in K (indeed O(K²) in a symmetric parametrization).

This reduction isolates the genuinely multi-time difficulty: the fixed-point system depends on cross-time correlations computed from (a_k, h_k), the noise budgets m_k, and the embedding overlaps ⟨τ(t_k), τ(t_ℓ)⟩. In the next section we instantiate this scheme for K = 2, where the pencil becomes 2 × 2 at the level of time blocks, the fixed-point system is fully explicit, and the effective threshold ψ_eff emerges as the point at which the relevant resolvent denominator approaches its interpolation singularity for the small-noise time.

5. The K=2 case in full detail: explicit 2×2 block resolvent equations; closed-form characterization of limiting train/test errors; definition and computation of _{}; recovery of the p n boundary under time-decoupling.

We now specialize to K = 2 with a time t_s and a time t_ℓ, writing
a_s = e^−t_s, h_s = 1 − e^−2t_s, a_ℓ = e^−t_ℓ, h_ℓ = 1 − e^−2t_ℓ,
and N_s := nm_s, N_ℓ := nm_ℓ. Under the Gaussian surrogate described previously, the stacked weighted design Φ̃ has a row covariance (on the sample/time side) which is exchangeable across noise indices j at each fixed time and exchangeable across data indices i. Consequently, the companion resolvent on the sample/time side is asymptotically supported on the 2-dimensional span generated by the within each time block, and all train/test observables close on a 2 × 2 matrix of block traces.

Let μ_1, s, v_s and μ_1, ℓ, v_ℓ be the (time-dependent) Hermite/moment coefficients from the surrogate, and let ρ_τ := lim_{d_t → ∞}⟨τ(t_s), τ(t_ℓ)⟩/d_t denote the normalized embedding overlap. The only additional scalar needed is the OU-induced overlap through the shared datum x_i, which at the level of normalized inner products concentrates to
$$ \lim_{d\to\infty}\frac{1}{d}\,(x_{t_s}^{(i,j)})^\top x_{t_\ell}^{(i,j')} = a_s a_\ell\,\psi_D, \qquad (j,j'\ \text{arbitrary}). $$
Collecting the exchangeable (noise-idiosyncratic) contribution and the shared-datum contribution yields the effective 2 × 2 covariance matrix

(We emphasize that is the minimal parametrization after averaging over the noise-exchangeable directions; more refined surrogate parametrizations merely alter these entries by additional explicit terms depending on (κ_kℓ, γ_kℓ) and do not change the 2 × 2 closure.)

Let U = Φ̃^⊤Φ̃ ∈ ℝ^p × p and R(z) = (U − zI_p)⁻¹. We introduce the scalar Stieltjes transform
$$ g(z):=\lim_{d\to\infty}\frac{1}{p}\mathrm{Tr}\,R(z), $$
and the 2 × 2 G(z) obtained by taking normalized traces of the corresponding 2 × 2 blocks of the pencil inverse (equivalently, of the companion resolvent on the sample/time side restricted to the span of noise-averaging directions). The block Gaussian linearization yields a closed system:

Equations – are the explicit 2 × 2 analogue of the single-time scalar fixed point: the entire two-time coupling enters only through C and the matrix inversion in . In particular, writing
Δ(z) := det ( − zI₂ + ψ_pg(z)C),
we have the closed expressions

Fix k ∈ {s, ℓ} and consider either the per-time training error or the per-time test error defined earlier (as a normalized quadratic functional of the fitted score). By the ridge identities from the stacked form and the GEP reduction, every such quantity can be written as a linear combination of the following finite set of resolvent observables evaluated at z = −Λ:
g(−Λ), [G(−Λ)]_ss, [G(−Λ)]_ℓℓ, [G(−Λ)]_sℓ, ∂_zg(z)|_{z = −Λ}, ∂_zG(z)|_{z = −Λ}.
To make the dependence explicit, we use the standard perturbation device: for each time k, introduce a scalar q multiplying the corresponding block in the pencil (equivalently, $U\mapsto U(q):=U+q\,\frac{h_k}{N_k}\Phi_k^\top\Phi_k$), so that
$$ \frac{\partial}{\partial q}\Big|_{q=0}\frac{1}{p}\mathrm{Tr}\,(U(q)+\Lambda I_p)^{-1} = -\frac{1}{p}\mathrm{Tr}\Bigl(Q\,\frac{h_k}{N_k}\Phi_k^\top\Phi_k\,Q\Bigr), \qquad Q:=(U+\Lambda I_p)^{-1}. $$
Under the 2 × 2 closure, this derivative is obtained by differentiating – with C replaced by C + q c_kke_ke_k^⊤, hence it is an explicit rational function of the entries in . Substituting these derivatives into the ridge expressions for E_train(t_k) and E_test(t_k) yields deterministic limits of the schematic form

where ℱ_k^tr, ℱ_k^te are explicit (activation- and OU-dependent) rational functions determined by the Gaussian surrogate coefficients. Importantly, no additional random-matrix degrees of freedom remain: once – is solved numerically at z = −Λ, is obtained by direct evaluation together with a 2 × 2 differentiation step.

In the small-noise regime h_s ≪ 1, the sharp change in ϵ_test(t_s) is governed by the amplification of the variance-type terms in , which are controlled by the magnitude of [G(−Λ)]_ss and its derivatives. From , these terms blow up precisely when the denominator

approaches zero. We therefore define the effective sample-size ratio ψ_eff as the smallest ψ_p (at fixed (t_s, t_ℓ, m_s, m_ℓ, ρ_τ, φ, ψ_n, ψ_D, λ)) for which the coupled fixed point – satisfies the

with ψ_eff computed in practice by monitoring the sign change (or minimal absolute value) of Δ(−Λ) along ψ_p after solving –. This characterization matches the observed crossover: as ψ_p crosses ψ_eff, the resolvent entries develop a large factor 1/Δ(−Λ), and the estimator transitions toward the empirical-optimal (KDE-like) score component on the data manifold.

Finally, suppose the times decouple in the sense that the cross-time feature correlation vanishes, which in our parametrization is c_sℓ = 0 (e.g. ρ_τ = 0, or more generally orthogonal time embeddings making the cross-time Hermite contractions vanish). Then C is diagonal and splits into two independent scalar relations. In particular,
Δ(−Λ) = (Λ + ψ_pg(−Λ)c_ss)(Λ + ψ_pg(−Λ)c_ℓℓ),
so the small-noise task depends only on the first factor, and the criticality condition reduces to the single-time interpolation threshold. In this limit we recover ψ_eff = ψ_n (equivalently p ≃ n) for the small-noise time, in agreement with the uncoupled theory and providing the required consistency check for the coupled two-time analysis.

6. General finite K: operator-/block-valued fixed-point equations; conditions under which _{} exists and is unique; qualitative monotonicity in m_k and in time-feature correlation.

For fixed K we form the stacked weighted design
$$ \widetilde{\Phi} = \begin{pmatrix} \sqrt{\frac{h_1}{N_1}}\Phi_1\\ \vdots\\ \sqrt{\frac{h_K}{N_K}}\Phi_K \end{pmatrix}, \qquad N_k:=nm_k, $$
so that the (feature-side) Gram matrix is U = Φ̃^⊤Φ̃ ∈ ℝ^p × p. Under the Gaussian surrogate, the row covariance on the sample/time side is exchangeable within each time block (over the m_k noises) and exchangeable across data indices i. As in the two-time analysis, this implies that the companion resolvent on the row side is asymptotically supported on the K-dimensional span of the K (one per time block). Consequently, all limiting train/test observables close on a collection of block traces which we organize into a K × K matrix-valued Stieltjes transform.

Concretely, for z ∈ ℂ \ ℝ₊ we define
$$ g(z):=\lim_{d\to\infty}\frac{1}{p}\Tr\,(U-zI_p)^{-1}, \qquad \mathbf{G}(z):=\lim_{d\to\infty}\Bigl[\text{normalized block-trace of the row-resolvent restricted to the averaging span}\Bigr]\in\mathbb{C}^{K\times K}. $$
The only model-dependent input at this level is the effective K × K time-covariance C induced by (i) the OU overlap through the shared datum and (ii) the time-embedding overlap through τ(⋅), together with (iii) the within-time noise idiosyncrasy controlled by m_k. In the minimal Hermite truncation used throughout (keeping only the first-order contraction in the data input and the second-moment correction), C has entries

where ρ_τ, kℓ := lim_{d_t → ∞}⟨τ(t_k), τ(t_ℓ)⟩/d_t. More refined surrogates (including additional Hermite contractions between time and data channels) simply replace by explicit corrections; the structural point is that all such corrections remain deterministic and yield a symmetric matrix C depending only on ({t_k, m_k}, τ, φ, ψ_D).

After building the K-time linear pencil and applying the block-trace map, we obtain an operator-/block-valued Dyson system. In its simplest (single-self-energy) form, it takes the same schematic shape as in the 2 × 2 case,

while in the fully general finite-K surrogate the right-hand side of is replaced by an explicit self-energy map 𝒮[G(z)] whose entries are polynomials in the K² block traces of the pencil inverse. In either case, the number of unknowns is O(K²) and therefore constant in the proportional limit. Differentiating the fixed-point system with respect to z and with respect to block perturbations (the q-device used previously) yields the additional resolvent derivatives needed to evaluate each ϵ_train(t_k) and ϵ_test(t_k) as explicit rational functions of these block traces.

We emphasize that for ℑz > 0 the system obtained from the pencil is a standard (MDE) in the sense of operator-valued free probability: it is a self-consistent equation for a matrix-valued Stieltjes transform. Under our standing assumptions (bounded second moments of the activation under Gaussian inputs, fixed K, and ridge λ > 0 so that we ultimately evaluate at z = −Λ with Λ > 0), the MDE admits a unique solution (g(z), G(z)) satisfying the Stieltjes sign conditions
ℑg(z) > 0, ℑG(z) ≺ 0 (ℑz > 0),
and this solution extends continuously to the negative real axis. In particular, the evaluation at z = −Λ is well-defined and the fixed point is locally stable, with stability quantified by the invertibility of the linearization (Jacobian) of the map defining . This stability operator is precisely the object that controls the blow-up of the derivatives ∂_zG and ∂_qG, and therefore controls the steepness of the learning curves in the small-noise regime.

Let S ⊂ {1, …, K} denote the subset of times (those with h_k ≪ 1). For k ∈ S, the dominant contributions to ϵ_test(t_k) are variance-type terms whose prefactors involve entries of G(−Λ) and, more sharply, the of the fixed point through ∂_zG(−Λ) and the block-perturbation derivatives. We therefore define the effective sample-size ratio ψ_eff by a stability threshold:

where ℬ_k(ψ_p; −Λ) denotes the Jacobian (restricted to perturbations that act on the kth time block) of the MDE at z = −Λ, σ_min is its smallest singular value, and ε > 0 is a fixed small tolerance (vanishing with Λ if we take a ridgeless limit). In the simplified closure , the same criterion may be monitored equivalently via the minimal eigenvalue of the K × K denominator
𝒟(ψ_p; z) := −zI_K + ψ_pg(z)C,
since G(z) = 𝒟(ψ_p; z)⁻¹ and the linearization is dominated by 𝒟⁻¹ whenever the self-energy is low-rank.

A sufficient condition for of a nontrivial ψ_eff is that there is at least one small-noise time k ∈ S for which the corresponding direction in the averaging span is not orthogonal (through C) to the remaining tasks; otherwise the kth task is effectively decoupled and its threshold reduces to the single-time boundary. A sufficient condition for is that the stability margin in is strictly decreasing in ψ_p. This holds, for example, whenever the scalar order parameter
θ(ψ_p) := ψ_p g(−Λ)
is strictly increasing and C is positive definite on the span relevant to S; then λ_min(𝒟(ψ_p; −Λ)) varies monotonically and crosses any fixed tolerance at most once. In practice, for fixed K we simply solve the MDE along a grid in ψ_p and observe that the stability metric is unimodal in all regimes we consider, yielding an unambiguous ψ_eff.

The dependence of ψ_eff on the noise budgets and correlations is transparent at the level of C. First, increasing m_k reduces the idiosyncratic within-time contribution v_k/m_k in c_kk, hence decreases c_kk while leaving the cross-time terms unchanged. Since the small-noise instability is triggered by approaching a poorly conditioned regime in the averaging span, reducing diagonal mass in a small-noise block requires a larger ψ_p to reach the same stability tolerance. Thus, all else fixed, ψ_eff is (weakly) increasing in each m_k for k ∈ S, matching the interpretation that additional noise draws impose more effective constraints and push the memorization transition to larger parameter ratios.

Second, increasing time-feature correlation (e.g. increasing |ρ_τ, kℓ| for pairs involving S) increases the magnitude of the off-diagonal couplings c_kℓ and therefore enlarges the extent to which the large-noise tasks constrain the small-noise estimator through the shared parameters. At the level of the MDE, stronger coupling reduces the stability margin in the small-noise directions and therefore decreases ψ_eff (the transition occurs earlier in ψ_p). In particular, in the time-decoupled limit ρ_τ, kℓ → 0 for k ≠ ℓ, the matrix C becomes diagonal, the MDE splits into K independent scalar equations, and the small-noise boundary reduces to the single-time threshold.

7. Phase boundary and (near-)matching bounds: sufficient conditions (upper) for generalization when p < {} and unavoidable alignment (lower) when p > {} within the shared linear-in-parameters model class.

We now formalize the sense in which the stability threshold defining ψ_eff is not merely a diagnostic of steep learning curves, but indeed marks the boundary between (i) a regime in which the shared estimator behaves like a population-score learner on the small-noise tasks and (ii) a regime in which, within the shared linear-in-parameters class, the estimator is forced to align with the empirical optimal score and thereby incur a non-vanishing small-noise test error.

Fix λ > 0 and recall that under the Gaussian surrogate the ridge minimizer admits the usual closed form
$$ \widehat{A} = \frac{1}{\sqrt{p}}\Bigl(\sum_{k=1}^K \frac{h_k}{N_k} Y_k \Phi_k^\top\Bigr)\Bigl(U+\lambda I_p\Bigr)^{-1}, \qquad U=\sum_{k=1}^K \frac{h_k}{N_k}\Phi_k\Phi_k^\top, $$
where Y_k ∈ ℝ^d × N_k stacks the regression targets $-\frac{1}{\sqrt{h_k}}z_k^{(i,j)}$ and Φ_k ∈ ℝ^p × N_k stacks the corresponding random features at time t_k.
Consequently, for any fixed time index k, the prediction s_Â(⋅, t_k) is a linear functional of the targets {Y_ℓ}_ℓ = 1^K through the shared resolvent (U + λI_p)⁻¹. This induces a decomposition of the per-time test error into a squared-bias term (measuring mismatch to the population score component representable by the feature map) and a variance term (measuring amplification of the label noise through the resolvent). The small-noise regime h_k ≪ 1 is precisely the regime in which the label noise has a large effective magnitude after the DSM reweighting, and therefore the variance term is the quantity whose control (or blow-up) determines the qualitative behavior.

Let S ⊂ {1, …, K} denote the set of small-noise times. Fix δ > 0 and suppose that for some ψ_p the corresponding MDE solution satisfies a stability margin condition of the form

with c_δ independent of d and bounded away from 0 as d → ∞. By the implicit function theorem applied to the finite-dimensional fixed-point system defining the block traces, implies that all required derivatives of the limiting resolvent observables (in particular the z-derivatives and the block-perturbation derivatives that enter the variance terms of ϵ_test(t_k)) remain uniformly bounded. Translating this into the learning-curve expressions yields, for each k ∈ S, a bound of the schematic form

where C_δ < ∞ depends on c_δ but not on d. In particular, when the first-order Hermite contraction dominates (the regime in which our surrogate is accurate), the feature-mismatch term is controlled by the deviation between the representable linear score component and the true population score, and the remaining terms vanish as m_k → ∞ and/or λ ↓ 0 along any path that preserves . Thus, for ψ_p strictly below the stability threshold (equivalently ψ_p ≤ ψ_eff − δ for the definition of ψ_eff based on the first failure of ), the small-noise tasks remain in a regime: the shared estimator does not amplify the DSM label noise, and the test error stays close to the Bayes score error up to the approximation error of the linearized model class.

The preceding argument is qualitative but sharp in its mechanism: below ψ_eff, the MDE solution varies smoothly with ψ_p and the block-resolvent does not develop a near-singularity in the small-noise directions. In this regime the large-noise tasks (those with h_ℓ = Θ(1)) contribute additional effective regularization through the shared parameters, improving conditioning in the averaging span and thereby suppressing the variance amplification that would otherwise occur at small noise. The resulting estimator is therefore constrained to track a object rather than an empirical, point-mass object.

We turn to the complementary regime ψ_p ≥ ψ_eff + δ. The fixed-point characterization implies that in this regime the stability operator (or, in simplified closures, the denominator 𝒟(ψ_p; −Λ)) becomes poorly conditioned in at least one small-noise direction. Operationally, this corresponds to an almost-interpolation phenomenon on the small-noise time blocks: the mapping from targets to predictions has an operator norm that grows like the inverse stability margin, and for λ ↓ 0 this growth forces the ridge solution to fit the empirical constraints increasingly tightly.

To make this precise within the shared linear-in-parameters class, fix a small-noise time k ∈ S with m_k = ∞ (so the within-time idiosyncratic fluctuation in the design has been averaged out) and consider a sequence λ = λ_d ↓ 0. In the overparameterized regime where the stability margin collapses, the representer form of the ridge solution together with standard random-features interpolation arguments imply that

i.e. the learned score converges to the empirical optimal score (the score of the empirical mixture induced by the training samples under the forward OU perturbation). The point is not that the model explicitly computes a KDE, but that once the shared design reaches the near-interpolating regime, the minimum-norm (or vanishing-ridge) solution is the unique linear functional that matches the empirical constraints, and these constraints coincide with those defining s^emp under the DSM loss.

The alignment implies a lower bound on the population-score error because, for h_k ≪ 1, the empirical score differs from the population score by an O(1) amount on the data manifold: the empirical mixture retains atomistic structure at the scale $\sqrt{h_k}$, whereas the population P_{t_k} retains the smoothed geometry of P₀. Quantitatively, one can show (by comparing the empirical and population log-densities, or by standard lower bounds for KDE score estimation at fixed bandwidth) that there exists c(δ) > 0 such that
a),
\end{equation}
uniformly over all empirical risk minimizers in the shared class as long as the near-interpolation condition (equivalently the stability collapse defining ψ_eff) holds. This yields a model-class lower bound of the type stated in Proposition~4: once ψ_p exceeds ψ_eff by a fixed margin, the shared estimator cannot remain close to the Bayes score on the small-noise tasks.

The upper and lower mechanisms are governed by the same object: the conditioning of the MDE fixed point (or of 𝒟(ψ_p; −Λ) in simplified closures) restricted to the small-noise directions. Below ψ_eff this conditioning is bounded and enforces population-score behavior; above ψ_eff it collapses and forces empirical-score alignment. The bounds therefore match at the level of a common instability criterion, up to the tolerance used in the definition of ψ_eff and the choice of the ridgeless path λ ↓ 0. This is the sense in which ψ_eff is a genuine phase boundary for generalization versus memorization in the small-noise tasks under shared random-feature parametrization.

8. Experiments: NTK-linearized U-Net/DiT and finite-width checks; measuring effective sample size and predicting threshold; ablations over time grid, embedding dimension, and per-time noise reuse; link to nearest-neighbor retrieval under reverse diffusion.

We complement the asymptotic characterization with empirical checks in two regimes: (i) a controlled regime, where training is well-approximated by kernel ridge regression with a fixed NTK and hence is directly comparable to the Gaussian-surrogate predictions; and (ii) a regime, where feature learning is present but we can still probe whether the same effective-threshold phenomenology persists. Across both regimes, the target observation is the existence of a sharp crossover, visible primarily at small-noise times, as a function of the parameter ratio ψ_p = p/d (or an empirical proxy for p in convolutional architectures), and the dependence of that crossover on the time grid, time embedding, and per-time noise budgets.

Let f_θ(x, t) ∈ ℝ^d denote either a U-Net-like or DiT-like score network (operating in pixel space or in a latent space), trained with the same discrete-time DSM objective used in the theory. We linearize around an initialization θ₀:
f_θ(x, t) ≈ f_θ₀(x, t) + J_θ₀(x, t)(θ − θ₀),
and we train only the linear parameter Δθ := θ − θ₀ by ridge-regularized least squares. This induces a block kernel matrix over the index set of examples and time points, with blocks determined by the time-conditioned NTK
K((x, t), (x^′, t^′)) := J_θ₀(x, t) J_θ₀(x^′, t^′)^⊤.
Because the same Δθ is shared across all times, the training problem is intrinsically coupled even when the loss is a sum over times. In particular, the per-time predictor is a linear functional of all targets across all times through the inverse of a coupled Gram matrix, which is the empirical analogue of the coupled resolvent appearing in the fixed-point description.

To align the experimental setting with the assumptions, we first take P₀ = 𝒩(0, Π_∥) with rank(Π_∥) = D and choose (d, D, n) at moderate scale with ratios near the proportional regime. For a fixed time grid {t_k}_k = 1^K, we construct {x_{t_k}^(i, j)} with prescribed m_k, train the linearized model, and report (a) per-time DSM error E_test(t_k), computed on fresh (x₀, z) pairs not used in training; (b) per-time training error E_train(t_k); and (c) an aggregate error obtained by averaging over k with weights proportional to h_k (to match the DSM scaling). The qualitative signature we seek is that, for the smallest-noise time(s), E_test(t_k) remains near a baseline for smaller width and then increases rapidly past a critical width, while E_train(t_k) continues to decrease.

Operationally, we extract an empirical ψ̂_eff from either of the following equivalent criteria (which agree within measurement noise in the linearized regime). First, we locate the point of maximal slope of the small-noise test curve as a function of width:
$$ \widehat{\psi}_{\mathrm{eff}} \in \arg\max_{\psi_p}\ \frac{\mathrm{d}}{\mathrm{d}\psi_p} E_{\mathrm{test}}(t_s). $$
Second, we compute an interpolation proxy from the coupled kernel matrix: letting G_λ be the ridge-regularized Gram operator on the joint dataset-time index set, we monitor the minimum eigenvalue of G_λ and define ψ̂_eff as the width at which the effective condition number crosses a fixed threshold. In the linearized setting, both procedures are stable under resampling and produce a single transition point for the small-noise times, while large-noise times exhibit a smoother dependence on width.

To compare against the theory without re-deriving architecture-specific Hermite coefficients, we use a semi-empirical plug-in approach: we measure the cross-time correlations induced by the time embedding and the linearized features by estimating
$$ \widehat{C}_{k\ell} := \frac{1}{p}\,\mathrm{Tr}\Bigl(\Phi_k\Phi_\ell^\top\Bigr), $$
where Φ_k denotes the feature matrix at time t_k (either random features in the surrogate or Jacobian features in the NTK linearization). We then feed {Ĉ_kℓ}_{k, ℓ ≤ K}, along with (ψ_n, ψ_D, λ, {h_k, m_k}), into the fixed-point solver to obtain a predicted phase boundary ψ_eff^pred. On the synthetic model, we observe that ψ_eff^pred tracks ψ̂_eff across variations in K and in embedding design, and that the agreement improves as (i) width increases (improving the linearization) and (ii) m_k increases (reducing within-time Monte Carlo noise in the design).

We vary the grid in two directions while keeping the smallest time t_s fixed. (i) Adding additional times (larger h_k) tends to the critical width for small-noise memorization: empirically, E_test(t_s) remains stable up to a larger width, consistent with the interpretation that additional times improve conditioning of the shared inverse problem and delay near-interpolation in the small-noise block. (ii) Adding additional times (multiple t_k with h_k ≪ 1) has the opposite effect unless m_k is increased accordingly: the small-noise tasks collectively introduce more high-variance constraints, and the transition can occur at smaller width if the model can fit those constraints jointly. These trends are captured by the same plug-in prediction through changes in the measured cross-time correlation structure and in the effective sample weighting through {h_k}.

We next vary the time-embedding map τ(t) by changing its dimension d_t and by optionally orthogonalizing {τ(t_k)}_k = 1^K in ℝ^d_t. As d_t increases and the embeddings become closer to orthogonal, the empirical cross-time feature correlations Ĉ_kℓ decrease, and the observed ψ̂_eff moves toward the decoupled baseline consistent with ψ_eff ≈ ψ_n. Conversely, low-dimensional or highly correlated embeddings strengthen coupling and shift ψ̂_eff away from ψ_n, in the direction predicted by the coupled fixed-point solver. This ablation isolates the architectural mechanism by which ``sharing across times’’ can be tuned continuously: not only by sharing weights, but by controlling whether the time-conditioned features actually span distinct directions.

We finally vary {m_k}, with particular emphasis on the smallest-noise time(s). For fixed n and width, increasing m_s (new noise draws per datum at t_s) reduces the variance of the empirical design and makes the training objective closer to its conditional expectation; correspondingly, the small-noise test curve becomes smoother and the transition point stabilizes. In contrast, reusing a single noise draw (small m_s) increases stochasticity and can produce an ``early’’ appearance of memorization as diagnosed by nearest-neighbor phenomena (below), even when the averaged learning curve has not yet reached its steepest point. In the linearized regime, these effects are quantitatively explained by the same kernel perspective: m_k controls the effective number of distinct constraints at time t_k and hence the rank and conditioning of the coupled Gram matrix.

We repeat the above diagnostics with finite-width training (no linearization), keeping the optimizer and schedule fixed and scaling width over a range in which training remains stable. While the precise location of the transition drifts (consistent with feature learning modifying the effective kernel), the qualitative structure persists: small-noise test error exhibits a sharp change as width increases, and the direction of the shift under the three ablations (time grid, embedding correlation, and m_k) matches the linearized predictions. The main systematic deviation is that feature learning tends to reduce the apparent coupling (empirically lowering Ĉ_kℓ at convergence relative to initialization), which moves the observed threshold toward the decoupled baseline; this is consistent with the expectation that learned representations can partially separate times even when the initial embedding is correlated.

To connect the small-noise alignment phenomenon to an observable in sampling, we run the reverse diffusion (or probability flow ODE) using the trained score and examine the terminal samples. In the regime where E_test(t_s) has crossed the transition, the generated samples exhibit markedly reduced novelty under standard nearest-neighbor metrics: for each generated output x̂₀, the distance to the closest training point decreases sharply, and the identity of the nearest neighbor stabilizes across small perturbations of the initial noise. This behavior is consistent with the interpretation that, at small noise, the learned vector field inherits an empirical-mixture structure and thereby induces an attraction toward training atoms along reverse-time trajectories. Below the transition, the same retrieval statistics are diffuse and do not concentrate on single training exemplars, matching the regime in which the small-noise test error remains near its population-score baseline.

9. Discussion: implications for training schedules and architecture design; connections to early stopping, guidance, and latent diffusion; open problems (continuous time, feature learning, non-Gaussian data).

Our asymptotic characterization isolates a single scalar order parameter, the effective sample-size ratio ψ_eff, which governs whether small-noise times behave as a population-score regression problem or collapse toward the empirical optimal score. We emphasize that ψ_eff is not merely a reparametrization of ψ_n: it is a functional of the entire coupled training specification (time grid, embedding correlations, and per-time noise budgets), and it is therefore a quantity one can in principle . The most immediate implication is that, once one trains with shared parameters across times, it is generally insufficient to reason about ``capacity’’ via a single-time interpolation threshold p ≈ n; rather, the relevant threshold for memorization at a given small-noise time is p ≈ d ψ_eff, and ψ_eff may shift substantially as we change the remainder of the schedule.

In practice, diffusion schedules place substantial emphasis on small-noise times, both because those times are used heavily during reverse-time integration and because the DSM target scales as $\sqrt{h_k}$ in our normalization. Our theory suggests a tension: small-noise constraints are the most informative for refining samples, but they are also the constraints for which the population score is most sensitive to finite-sample effects (the empirical score field resembles a mixture with sharp local structure). In the coupled multi-time problem, adding additional times can either stabilize or destabilize the small-noise task depending on whether the added times increase the effective conditioning of the shared inverse problem or instead introduce additional near-singular constraints. This motivates a schedule-level principle: when we include multiple small-noise points (multiple k with h_k ≪ 1), we should anticipate that ψ_eff decreases unless we correspondingly increase the effective constraints via n or m_k, or weaken coupling through the time embedding. Conversely, adding moderate-to-large noise points (with h_k = Θ(1)) can increase ψ_eff and delay the small-noise transition, because those times provide constraints that are less sensitive to the data manifold and thus regularize the shared fit.

A complementary lever is explicit loss reweighting across k. While our objective uses the canonical DSM scaling, one may introduce weights {w_k} and consider $\sum_k w_k \| \sqrt{h_k}s(\cdot,t_k)+z\|^2$. Within the Gaussian-surrogate viewpoint, this rescales the per-time block contributions to the coupled Gram operator and thereby shifts ψ_eff. In particular, downweighting the smallest h_k effectively reduces the influence of the most sample-sensitive constraints and can be interpreted as increasing ψ_eff for the small-noise task (i.e., requiring larger width before memorization-like behavior appears). This provides a quantitative lens on heuristics that ``avoid overfitting the last timesteps’’: such heuristics are not merely stabilizing optimization, but can be understood as modifying the effective interpolation geometry.

The coupling mechanism in our analysis is explicit: the shared second-layer weights (or, in the NTK view, the shared linearized parameters) induce a joint kernel over the product space of data and times, with off-diagonal blocks determined by the time embedding and time-conditioned features. Therefore, architectural choices that alter cross-time feature correlations alter ψ_eff. Two design principles follow.

First, increasing the effective dimensionality of the time representation (larger d_t, richer embeddings, or architectural pathways that separate time conditioning) tends to reduce off-diagonal correlations and push the system toward the decoupled limit in which ψ_eff ≈ ψ_n. This suggests that, if one wishes to preserve the favorable single-time generalization behavior at small noise, it is beneficial to ensure that features at distinct times are not nearly collinear. Second, if one instead desires aggressive parameter sharing (for computational or statistical reasons), then one should anticipate that strong coupling can shift ψ_eff away from ψ_n and may either delay or accelerate small-noise memorization depending on the sign and structure of correlations. In this sense, ``time conditioning’’ is not a purely semantic input; it is an operator that determines whether the multi-time regression decomposes or becomes a genuinely coupled system with nontrivial phase behavior.

Our results are stated for ridge-regularized empirical risk minimizers, but the same resolvent-based quantities that determine test error also govern gradient-flow dynamics in linear models. In particular, for linearized training, early stopping is closely related to a spectral filter on the coupled Gram operator, and thus plays a role analogous to increasing λ in a data-dependent manner. Heuristically, early stopping reduces the contribution of small-eigenvalue directions of the joint kernel (directions that are most associated with near-interpolation and with alignment to the empirical score field). Consequently, early stopping can increase the apparent ψ_eff (postponing the transition in ψ_p) without changing width. This provides a concrete interpretation of why early stopping and explicit weight decay often mitigate memorization-like artifacts in diffusion models: they prevent the shared fit from entering the regime in which the small-noise block effectively interpolates the empirical constraints.

Classifier-free guidance and related conditioning methods alter the reverse-time vector field by effectively scaling or combining scores. Even when guidance is applied only at sampling time, it can amplify any discrepancy between the learned and population scores at small noise. Under our phase picture, once ψ_p crosses ψ_eff and the learned field aligns with the empirical optimal score near the data manifold, guidance can magnify the attraction toward training atoms, thereby increasing nearest-neighbor retrieval phenomena. This suggests that guidance strength should be treated as coupled to schedule design: if one uses strong guidance, it is prudent either to remain below the small-noise transition (by limiting effective capacity or strengthening regularization) or to mitigate coupling mechanisms that reduce ψ_eff.

Latent diffusion changes the relevant proportional-asymptotic ratios by replacing the pixel-space ambient dimension d with a smaller latent dimension. For fixed dataset size n and model width proxy p, decreasing d increases ψ_n = n/d and ψ_p = p/d simultaneously. Therefore, whether latent diffusion moves the system toward or away from the memorization boundary depends on the relative scaling of p with the latent dimension. Our framework suggests that latent compression can be beneficial insofar as it increases ψ_n faster than it increases the effective degrees of freedom of the score model, thereby increasing the margin ψ_eff − ψ_p at the smallest-noise times. However, if compression is accompanied by a proportional increase in model width, the system may remain near the boundary, and the same small-noise transition can reappear in latent space.

Several extensions are mathematically natural but technically nontrivial. (i) Our characterization is for fixed K. Passing to a dense time grid requires controlling the growth of the block-resolvent system as K → ∞ and identifying an operator-valued limit. One expects a functional fixed-point equation indexed by t, with ψ_eff becoming a functional of a time-correlation kernel; establishing this rigorously would connect the present analysis to continuous-time score matching and to the numerical stability of reverse-time solvers.

Finite-width training modifies the kernel during optimization and can, empirically, reduce cross-time correlations by learning representations that partially disentangle times. Proving such a decorrelation effect, or characterizing when it fails, remains open. A plausible route is to couple mean-field limits for feature learning with the random-matrix resolvent analysis used here, but this requires new tools because the design is no longer independent of the targets.
We have focused on (subspace) Gaussian models to make the coupled resolvent explicit. Real data introduce curvature, multimodality, and heavy tails; in such settings, the empirical score field may be even more ``atomic’’ at small noise, and the consequences of crossing ψ_eff may be more pronounced. It would be valuable to identify universality classes in which ψ_eff depends only on low-order moments and on time-feature correlations, as well as regimes where higher-order structure dominates and new order parameters are needed.

These directions are not merely refinements: they determine whether ψ_eff can be used as a predictive design quantity for modern diffusion pipelines. The present work establishes, at least in a controlled high-dimensional limit, that the multi-time coupling inherent in shared-parameter score models induces a genuine phase boundary for small-noise generalization, and that this boundary is manipulable through schedule and architecture choices.

Appendices: full linear-pencil derivations; numerical solver details; additional plots; proofs of auxiliary lemmas; experimental protocols and reproducibility.

We collect the technical components supporting the main statements in a set of appendices with two goals: (i) to make the block-resolvent method fully checkable, including the linear-pencil constructions and the closure of the associated trace system; and (ii) to provide sufficient numerical and experimental detail so that the limiting predictions and the finite-dimensional simulations can be reproduced without ambiguity.

A first appendix presents the full linearization of the rational matrix expressions that arise when we write the ridge solution and substitute it into the per-time train/test errors. Starting from the joint time-stacked design matrix (with blocks indexed by k ∈ {1, …, K}), we explicitly express the relevant quantities as traces of functions of coupled Gram operators, including off-diagonal blocks induced by time features. We then introduce a block linear pencil L_K(z, q) whose inverse contains, as Schur complements, the resolvents of interest. The appendix gives the precise block structure, the bookkeeping conventions for block indices, and the step-by-step verification that the desired traces are obtained as normalized traces of specific diagonal blocks of L_K(z, q)⁻¹ and, when needed, of its derivatives with respect to z and auxiliary perturbation parameters q. For the case K = 2, we provide a minimal pencil with a fully explicit block layout and show how the coupled system reduces to a small set of scalar Stieltjes-transform variables.

A second appendix records the time-augmented Gaussian equivalence reduction used to replace nonlinear random features by an analytically tractable Gaussian-linear surrogate. We state the conditions on φ (Hermite expansion, finite second moment) under which the surrogate matches the first two moments of the features and preserves the relevant cross-time correlations induced by τ(t_k). The outcome is a list of coefficients (e.g. μ_0, k, μ_1, k, v_k and their cross-time analogues) that parametrize the surrogate covariance operator. We include the derivations for these coefficients in the subspace-Gaussian model P₀ = 𝒩(0, Π_∥), highlighting how the Π_∥ structure enters only through a small number of scalars (via ψ_D in the proportional limit) and how the OU parameters (a_k, h_k) enter through simple rescalings. The appendix also provides consistency checks (such as the decoupled-time limit) that serve as debugging tools for implementations.

A third appendix derives the finite-dimensional fixed-point system that characterizes the deterministic limits of the block traces. We write the Dyson (self-consistent) equation for the pencil resolvent and apply the block trace map to obtain a closed algebraic system for the trace variables {ζ_r}. We emphasize the precise symmetry reductions that allow one to replace matrix-valued unknowns by O(K²) scalars, and we track carefully where the proportional-asymptotic ratios (ψ_n, ψ_p, ψ_D) enter. Since different equivalent parameterizations of the unknowns are possible (e.g. in terms of block Stieltjes transforms versus Schur-complement traces), we provide a translation table between the parameterizations used in different parts of the manuscript, along with a list of the trace identities used to convert the solved fixed-point variables into predicted train/test errors at each time.

Several lemmas used implicitly in the main line are proved in full. These include: concentration statements ensuring that empirical block Gram operators are close to their deterministic equivalents at scale d^−1/2 (sufficient for the deterministic limit claims); resolvent identities controlling derivatives with respect to z and small perturbations (needed for variance terms and for interpolation-denominator diagnostics); and small-noise asymptotics that justify the separation between population-score behavior and empirical-score alignment in the h_k ≪ 1 regime. We also include a short proof that the time-decoupled limit reduces to the single-time formulas by showing that the off-diagonal blocks vanish and the Dyson system factorizes. Where a step relies on standard random-matrix results, we cite a canonical reference and spell out the correspondence of assumptions (in particular, bounded moments and independence structure) to avoid hidden regularity requirements.

An implementation-focused appendix describes how we solve the fixed-point equations robustly for finite K and how we compute the derived error quantities. We specify the unknown vector, the residual map, and the Jacobian used in Newton or quasi-Newton iterations. To handle regimes near the memorization transition (where denominators become small and naive Newton updates can be unstable), we describe a continuation strategy in ψ_p or λ, damped Newton steps, and a stopping criterion based on both residual norm and relative change in the trace variables. Derivatives needed for error expressions are obtained either by automatic differentiation of the fixed-point map (when implemented in a differentiable framework) or by finite differences with step sizes chosen relative to λ to balance truncation and roundoff. We also include sanity checks: verifying positivity of the implied Stieltjes transforms, monitoring monotonicity in λ, and comparing K = 1 solutions to known single-time formulas.

We provide supplementary figures that visualize (i) per-time learning curves across multiple t_k for varying (ψ_n, ψ_p); (ii) the effect of changing noise budgets m_k at fixed width; (iii) the sensitivity of ψ_eff to the correlation structure of the time embedding (including both nearly collinear and nearly orthogonal regimes); and (iv) comparisons between finite-d simulations and deterministic-equivalent predictions, including observed convergence rates as d increases. Each plot is accompanied by a brief caption that states the parameter values, the averaging procedure across random seeds, and the specific error functional being displayed (train, test, or aggregate). We also include diagnostic plots of the fixed-point residual versus iteration and of the smallest eigenvalue of the relevant Gram blocks as a function of ψ_p, which help identify numerical artifacts versus genuine transitions.

Finally, we document the simulation pipeline used for all finite-dimensional experiments. We specify how we generate x_i under the subspace Gaussian model (construction of Π_∥, normalization conventions, and sampling), how we generate OU-perturbed samples x_{t_k}^(i, j), and how we draw and store the random feature weights (W_x, W_t) and the time embeddings τ(t_k). We record the exact ridge objective solved in finite dimensions, including the scaling of the design matrix and the regularization term, and we describe the linear algebra routines used (direct solvers versus iterative methods) together with their tolerances. We list default seeds, the number of independent trials used for reported averages, and the hardware/software environment assumed. Where appropriate, we provide pseudocode for data generation, training, and evaluation, and we state how to reproduce each figure by a single command or script entry point. This appendix is intended to eliminate degrees of freedom that commonly affect learning-curve experiments (normalizations, hidden rescalings of λ, and inconsistent definitions of error), and thereby to make the empirical portion of the work verifiable in a straightforward manner.