We study when diffusion models trained in a learned representation space memorize their training set. Concretely, given data x ∈ ℝd and an encoder E : ℝd → ℝD, we work with latent codes z = E(x) and fit a latent score model intended to approximate the score of the forward noised latent distribution at time t, namely s⋆(t, ⋅) = ∇log Ptz(⋅) for $Z_t=a_t Z_0+\sqrt{h_t}\,\xi$. Our focus is the regime where D is large and the learning algorithm is sufficiently expressive that it can interpolate empirical structure. In that regime, one must distinguish genuine generalization of the population score from a different phenomenon: accurately matching the score associated with the finite training set, which in turn can enable extraction through nearest-neighbor effects in latent space.
A first principle is that memorization is not controlled by the
ambient pixel dimension d but
by the effective dimension of the representation on which score matching
is performed. If the encoded distribution P0z
is concentrated on a low-dimensional subspace or has a highly
anisotropic covariance, then the statistical difficulty of score
estimation (and the propensity to overfit) is governed by the spectrum
of the latent covariance C
rather than by d. In
particular, if P0z
is well-approximated by a Gaussian 𝒩(0, C), then the forward diffusion
distribution is also Gaussian, Ptz = 𝒩(0, at2C + htI),
and the true score is linear:
Even in this simplified setting, the learning problem is nontrivial: the
learner observes only finitely many latent samples {zi}i = 1n
and is trained by denoising score matching (DSM), which replaces
population expectations by finite-sample averages and introduces
auxiliary noise draws. The combination of finite data and
high-dimensional function classes creates a sharp tradeoff between
approximating s⋆
and fitting the empirical idiosyncrasies of the sample.
Our analysis is in proportional asymptotics, with n, p, D → ∞ at
fixed ratios ψn = n/D
and ψp = p/D,
for a random-features score class of the form
trained at each time t by
ridge-regularized DSM with regularization parameter λ and with m independent noise samples per
datum. This setting is deliberately aligned with the mechanism by which
modern score networks can be viewed, locally in parameter space, as
kernel or random-feature regressors; it also isolates the dependence of
generalization on the ratios (ψn, ψp)
and on the noise budget m.
The central qualitative claim is that in latent DSM there is a generalization–memorization crossover controlled primarily by the comparison of p and n along directions where the data have non-negligible variance. In the small-time regime (where at ≈ 1 and ht ≈ 0), the denoising task is hardest and the score is most sensitive to the fine structure of P0z. In this regime, when m is large (including the idealized m = ∞ objective), we find a sharp rise of a memorization order parameter when p crosses n, reflecting the ability of the random-features learner to match the empirical score semp(t, ⋅) rather than the population score s⋆(t, ⋅). When m = 1, this effect is substantially mitigated: one still observes high-dimensional double-descent phenomena near the interpolation threshold ψp = ψn, but the extreme small-t blow-up characteristic of m = ∞ is absent. The dependence on m is thus not a minor constant-factor issue; it changes which failure mode dominates.
A second claim concerns representation compression under architecture
scaling. In many latent diffusion pipelines, model capacity scales with
the representation width, and it is natural to consider a feature
density κ such that p = κD. Holding
the number of training examples n fixed, reducing D necessarily reduces p, and therefore moves the system
across the boundary p ≈ n at a predictable
location:
Thus, even if the pixel dimension d is unchanged, a smaller latent
dimension can provably decrease memorization propensity by pushing the
learner into a regime where it cannot represent the empirical score with
small training error and small ridge penalty simultaneously. This
``compression law’’ is not asserted as a universal design rule; rather,
it is a concrete consequence of the asymptotic learning curves in a
setting where both the data distribution in latent space and the
estimator are analytically tractable.
A third claim is that anisotropy and representation collapse are
naturally expressed through an intrinsic dimension proxy. If C is approximately rank-r, or more generally has a spectrum
with a small number of dominant directions, then the relevant effective
dimension is closer to
(or exactly r in the
rank-r projector model) than
to the ambient D. Replacing
D by Deff in the ratios ψn, ψp
yields an immediately interpretable prediction: memorization risk
increases when ψpeff = p/Deff
exceeds ψneff = n/Deff
in the small-t, large-m regime. This observation suggests
a representation-space auditing procedure that uses only latent samples
to estimate spectral functionals of C and then reports a regime label
together with predicted train/test behavior.
The practical implications are therefore of the following type. If a latent diffusion model exhibits unwanted training-set leakage, then there are three levers that our theory separates: (i) the sample-to-dimension ratio in the representation, (ii) the score-model capacity relative to D (via p and architectural scaling), and (iii) the DSM noise budget m induced by the training procedure. Conversely, if one seeks high fidelity without leakage, then it is necessary to increase n or decrease effective capacity (in the relevant directions) in a manner that is cognizant of the latent covariance structure. The remainder of the paper formalizes these claims by deriving asymptotically exact learning curves for latent DSM under Gaussian (and subspace) models for P0z, and by connecting these curves to memorization proxies that correlate with extraction by nearest-neighbor retrieval in latent space.
We recall the denoising score matching (DSM) principle for diffusion models and isolate the mechanism by which finite-sample objectives admit ``empirical’’ optima that may differ substantially from the population score. We then summarize why, for the random-features (or NTK-linearized) estimators used throughout, the resulting learning curves in high dimension are governed by a small number of ratios and spectral statistics, and why passing to a learned latent space modifies these ratios through the latent covariance.
Consider a forward noising operator of Ornstein–Uhlenbeck /
variance-preserving type,
with at = e−t
and ht = 1 − e−2t
(so at2 + ht = 1).
Writing Pt
for the law of Zt, the score at
time t is s⋆(t, y) = ∇ylog Pt(y).
DSM replaces direct score estimation by regression: for each observed
clean'' sample $z$, one draws $y\sim \mathcal{N}(a_t z, h_t I)$ and regresses a function $s(t,\cdot)$ toward the conditional score $\nabla_y \log p(y\mid z)$, yielding the standard objective \begin{equation} \mathcal{L}_t(s) \;=\; \mathbb{E}_{z\sim P_0}\,\mathbb{E}_{y\mid z}\,\Big\| s(t,y) - \nabla_y \log p(y\mid z)\Big\|_2^2 . \end{equation} For Gaussian perturbations one has $\nabla_y \log p(y\mid z)=-(y-a_t z)/h_t$, hence DSM is a least-squares problem with a particularly explicitlabel.’’
Moreover, the population minimizer over all measurable s is the marginal score s⋆(t, y),
since 𝔼[∇log p(y ∣ z) ∣ y] = ∇log p(y).
In practice, P0
is replaced by the empirical measure $\widehat
P_0=(1/n)\sum_{i=1}^n \delta_{z_i}$ and the outer expectation is
replaced by a finite average. If we additionally idealize the inner
expectation over the corruption noise as exact (corresponding to a large
noise budget per datum), then the unconstrained minimizer of the
empirical DSM objective is the score of the
This function is generally equal to s⋆(t, ⋅) even
when zi
are i.i.d. from P0.
When ht is
small, P̂t
is a mixture of narrow Gaussians centered at the training points (up to
the factor at), and the
score of such a mixture admits the familiar ``responsibility’’
form
In the small-noise regime, wi(y)
concentrates near the nearest neighbor of y among {atzi},
so semp(t, ⋅)
becomes a piecewise-nearest-neighbor vector field pulling samples toward
individual training codes. Consequently, a sufficiently expressive score
model can achieve small empirical risk by approximating semp rather than s⋆, providing a concrete
channel for training-set retrieval in reverse diffusion.
The preceding discussion implicitly takes an infinite number of noise draws per training point. With only m noise samples per datum, the DSM objective becomes a random-design, noisy-label regression problem: the covariates are the perturbed points $y_{i\ell}=a_t z_i+\sqrt{h_t}\,\xi_{i\ell}$ and the labels are −(yiℓ − atzi)/ht. The case m = 1 is therefore qualitatively different from the m = ∞ idealization: the learner cannot average out the label noise induced by ξ, and the effective regression problem includes an additional source of randomness even conditional on {zi}. One should thus expect m to affect not only constants but also the location and nature of interpolation phenomena, since the target itself is observed with different signal-to-noise ratios.
We focus on score models of the form $s_A(y)=(A/\sqrt{p})\varphi(Wy/\sqrt{D})$, which are linear in the second-layer weights A and random in the first-layer weights W. At fixed t, training with squared loss and ridge regularization reduces to a (vector-valued) ridge regression with feature matrix $\Phi(y)=\varphi(Wy/\sqrt{D})\in\mathbb{R}^p$. This class captures, in an analytically tractable setting, the kernel/NTK viewpoint: in the large-p limit with suitable scaling one recovers kernel ridge regression, while for p proportional to D one can track the full random-feature regime. In proportional asymptotics, test and train errors concentrate and admit deterministic limits that depend on the sample ratio ψn = n/D, the feature ratio ψp = p/D, and the spectrum of the covariance of the covariates. These limits exhibit the now-standard ``double descent’’ behavior near the interpolation threshold p ≈ n and, in our setting, can also separate fitting the population score from fitting the empirical score.
When score matching is carried out in a learned representation z = E(x) ∈ ℝD,
the ambient pixel dimension d
is no longer the relevant scale. The forward noising is performed in
latent space and hence the covariates presented to the regressor are
distributed according to Ptz,
whose geometry is controlled by the latent covariance
Therefore, both the statistical difficulty of score estimation and the
interpolation threshold for random-feature regression depend on the
eigenvalue distribution of C
rather than on d. If C is highly anisotropic or
approximately low rank, then only a small number of directions carry
substantial signal, and the relevant effective dimension is better
summarized by intrinsic proxies such as (tr C)2/tr(C2)
(or exactly the rank in subspace models). In particular, the comparison
between model size p and
sample size n should be
interpreted : a model may be overparameterized relative to an r-dimensional latent subspace even
if p ≪ D, and this is
precisely the regime in which matching semp(t, ⋅)
becomes feasible.
These observations motivate the setup and quantities studied next: we formalize the latent OU forward process, the DSM objectives for general m, and the resulting notions of test error, training error, and an order parameter measuring alignment to the empirical optimal score.
We fix an encoder E : ℝd → ℝD
and write
Our analysis is carried out directly in representation space and is
therefore parameterized by the latent dimension D rather than the ambient pixel
dimension d. Throughout we
impose proportional asymptotics
with ψn, ψp ∈ (0, ∞)
fixed.
We model the latent codes as i.i.d. draws from an approximately
Gaussian distribution
where C ∈ ℝD × D
is positive semidefinite and may be anisotropic and/or approximately low
rank. We assume that (1/D)tr C → τ ∈ (0, ∞)
and that the empirical spectral distribution of C converges as D → ∞ (allowing, as special cases,
isotropic latents C = I and
subspace/collapsed representations C = Π∥ with
rank(Π∥) = r).
We denote by $\widehat
P_0^z=(1/n)\sum_{i=1}^n\delta_{z_i}$ the empirical latent
distribution.
For each t ≥ 0, we consider
the variance-preserving Ornstein–Uhlenbeck forward process in latent
space
so that at2 + ht = 1.
Writing Ptz
for the law of Zt when Z0 ∼ P0z,
we have in particular
and we denote the time-t
population score by
At a fixed time t, the DSM
regression task is generated as follows. For each latent code zi we draw m independent corruption noises
$\{\xi_{i\ell}\}_{\ell=1}^m\stackrel{\mathrm{i.i.d.}}{\sim}\mathcal{N}(0,I_D)$
and form corrupted samples
The DSM ``label’’ associated to (zi, yiℓ)
is the conditional score
Given a candidate score function s(t, ⋅) : ℝD → ℝD,
the empirical DSM objective at time t with noise budget m is
We emphasize that, conditional on {zi}, the
regression covariates {yiℓ}
have covariance at2C + htID,
so the eigen-structure of C
enters the learning curves through the design distribution already at
the level of the forward process.
In the idealized limit m = ∞ (or, equivalently, replacing
the average over ξ in by its
conditional expectation given zi), the
unconstrained empirical minimizer is the score of the empirical noised
distribution
We record the explicit mixture-score identity
This object will serve as the reference ``memorizing’’ vector field in
our quantitative diagnostics.
We restrict s(t, ⋅) to a
random-features class that is linear in the trainable weights. Let W ∈ ℝp × D
have i.i.d. 𝒩(0, 1) entries and let
φ : ℝ → ℝ be applied
entrywise. For each t we
consider estimators of the form
and we fit A by
ridge-regularized least squares:
We treat t as fixed and train
independently at each time (extensions to time-discretized training may
be accommodated but are not needed for the present asymptotics).
Our primary performance metric is the latent-space DSM test error at
time t,
where the expectation is with respect to an independent test draw Z0 ∼ P0z
and forward noise ξ. The
corresponding train error is the achieved empirical objective,
To quantify the extent to which ŝ aligns with the empirical-mixture
score (and hence with the nearest-neighbor channel implicit in ), we
introduce the order parameter
Small ℳ(t) indicates that the
learned score is close, on average under the empirical noised
distribution, to the score of the empirical mixture itself; in the
small-noise regime this constitutes an operational proxy for
memorization and extraction risk. In the sequel we will compute
asymptotically exact limits for – as functions of (ψn, ψp, m, λ, t)
and the spectrum of C.
We now state the high-dimensional limits of the ridge-DSM estimator . Throughout this section we work at a fixed time t ≥ 0 and suppress the dependence on t in the notation when no confusion arises. The key point is that, under proportional asymptotics, all the quantities of interest—test error , train error , and the order parameter —converge to deterministic limits that depend on the latent geometry only through the limiting spectral distribution of C (and, in particular, through intrinsic-dimension surrogates in structured cases).
Let μC denote the empirical spectral distribution of C and assume μC ⇒ μ weakly as D → ∞. The random-features design at time t is generated by the covariates $y_{i\ell}=a_t z_i+\sqrt{h_t}\xi_{i\ell}$, whose covariance is at2C + htID. Under the Gaussian-equivalence principle, the random matrix describing the linearized features admits an asymptotically equivalent Gaussian surrogate whose covariance is an explicit functional of μ and of the activation through a finite set of scalar moments (e.g. the first nontrivial Hermite coefficients of φ). Consequently, the learning curves are governed by a finite system of coupled algebraic fixed-point equations for a small number of resolvent traces (Stieltjes transforms evaluated at shifts involving λ). We do not reproduce the full system here; it is obtained by the same linear-pencil argument used for the ambient-space analysis, with D in place of d and with C inserted in the population covariance.
In the idealized regime m = ∞, the DSM objective replaces the average over corruption noise by a conditional expectation, so that the effective regression targets coincide with the empirical-mixture score semp(t, ⋅). This is the regime in which nearest-neighbor structure in is most directly available to the learner when t is small.
The qualitative content of Theorem~ is that the latent learning curves admit a closed-form characterization up to the solution of a small fixed-point system, and this characterization is sensitive to overparameterization in a way that is sharply controlled by the ratios ψn and ψp. In particular, for small t (so that at ≈ 1 and ht ≪ 1) the empirical-mixture score semp(t, ⋅) becomes increasingly localized around individual latent codes, and the estimator can enter a regime in which it closely tracks semp under P̂tz (small ℳ∞(t)) while simultaneously suffering a large population test error against s⋆ (large ℰtest∞(t)). This is the operational manifestation, in our formalism, of memorization and extraction risk.
For finite noise budget, and in particular for m = 1, the regression task is genuinely noisy even conditional on {zi}, which modifies both the interpolation properties and the alignment of ŝ with semp.
A salient consequence is that the extreme small-t ``memorization blow-up’’ present at m = ∞ is not generic at m = 1: the finite-noise regression labels $u_{i\ell}=-(1/\sqrt{h_t})\xi_{i\ell}$ inject label noise that persists even when the covariates become close to the training points. At the level of the limiting equations, this appears as additional variance terms that regularize the effective signal-to-noise ratio and prevent the strong alignment with semp that drives ℳ∞(t) downward in the overparameterized, small-t regime.
The general resolvent characterization specializes cleanly in two cases that will organize our phase diagrams.
When C = ID, the limits in Theorems~– reduce to closed-form expressions depending only on (ψn, ψp, λ, t) and activation moments. In this setting, the dominant transition is controlled by the ratio p/n = ψp/ψn. For m = ∞ and small t, the limiting curves exhibit a sharp crossover at p ≃ n: when p < n the estimator is constrained and cannot represent the empirical-mixture score, leading to comparatively larger ℳ∞(t) and smaller memorization risk; when p > n the estimator can track semp closely under P̂tz, and ℰtest∞(t) correspondingly increases in a manner consistent with memorization. For m = 1, one still observes a classical double-descent phenomenon in the test curve near p ≃ n, but without the same small-t divergence.
When C is (approximately) a
rank-r projector, the theory
decouples into parallel'' (data) directions andorthogonal’’
directions dominated by diffusion noise. In this case, the relevant
effective dimension for memorization is r rather than D, and the p ≃ n transition occurs in
the parallel component: the estimator memorizes primarily along the
subspace supporting the data, while the orthogonal component contributes
a separate, noise-controlled baseline to ℰtest. This decomposition
formalizes the intuition that representation collapse changes the
capacity available along data directions without changing the ambient
latent dimension.
We emphasize that, across these cases, the phase behavior is expressed in terms of the ratios (ψn, ψp) (or their effective variants under intrinsic-dimension reductions). This ratio-level characterization is precisely what will allow us, in the next section, to translate the phase diagram into explicit and practically checkable threshold conditions under compression and architecture scaling.
The phase behavior described above is most naturally expressed in
terms of the proportional ratios
$$
\psi_n=\frac{n}{D},
\qquad
\psi_p=\frac{p}{D},
$$
since these are the parameters that remain O(1) in the asymptotic analysis. In
applications, however, one typically varies the latent dimension D (by changing the encoder
bottleneck) while keeping the dataset size n fixed, and the score network size
is coupled to D by
architectural design constraints. We therefore translate the ratio-level
picture into explicit scaling laws in D, emphasizing when compression
provably reduces memorization risk and when it can instead be neutral or
even adverse.
A convenient abstraction of many width-scaling rules is that the
effective number of random features is proportional to the latent
dimension,
for some feature density κ > 0 determined by channel
counts and block multiplicities (or, more generally, by an approximately
linear scaling of the score-model capacity with D). Under we have ψp = κ
and ψn = n/D,
so varying D moves the
operating point vertically in the (ψn, ψp)
plane. Since the dominant small-t memorization transition for m = ∞ is controlled by the
overparameterization ratio
$$
\frac{\psi_p}{\psi_n}=\frac{p}{n},
$$
the practical boundary occurs when p ≃ n, i.e. when
Thus, in this scaling regime, decreasing D decreases p proportionally and can move the
learner from the regime p > n (in which the
estimator can closely track semp at small t) to the regime p < n (in which such
tracking is representationally constrained). This yields an explicit
compression law in D.
We stress that is a statement about the capacity relative to the available latent samples, not about the pixel dimension d. In particular, holding n fixed, the compression benefit is largest when κ is not simultaneously increased to compensate: the relevant quantity is p/n = κD/n. Ridge regularization λ and the time t shift the precise location and height of peaks in ℰtest(t), but they do not alter the leading-order scaling of the capacity-driven transition.
A different and common experimental practice is to change the encoder and thus D while leaving the score architecture unchanged in a way that keeps p (or an effective width proxy) approximately constant. In our proportional formalism, this corresponds to moving along rays in which ψp = p/D varies with D, but the ratio ψp/ψn = p/n remains constant when n is fixed. Consequently, in the isotropic latent model (and more generally when the phase boundary depends primarily on p/n along the data directions), changing D at fixed p does systematically move the system across the principal p ≃ n boundary: if p > n (resp. p < n) before compression, then p > n (resp. p < n) after compression. In this sense, purely representational compression with a decoupled score-model size may be largely neutral with respect to the dominant overparameterization-driven memorization mechanism highlighted above.
This neutrality should be interpreted carefully: the learning curves still depend on ψn = n/D and on the latent spectrum through C, and hence compression can affect constants (e.g. effective signal-to-noise ratios through at2C + htI). However, the sharp crossover phenomenon associated to interpolation of the empirical-mixture score at small t is governed by the availability of degrees of freedom relative to n, and that availability is not changed by varying D alone if p is held fixed.
Compression can become adverse when architectural choices implicitly
increase p as D decreases. One idealized way this
arises is if one keeps a fixed rather than a feature count. In the
random-features model, the trainable second-layer weights A ∈ ℝD × p
contribute Dp
parameters, so fixing a budget P ≈ Dp yields the
inverse scaling
If n is fixed, then p/n ≈ P/(Dn)
grows as D decreases, and
compression can move the system into the overparameterized regime.
Indeed, under , the condition p > n becomes
$$
\frac{P}{D} > n
\quad\Longleftrightarrow\quad
D < \frac{P}{n},
$$
so sufficiently aggressive compression forces p to exceed n, increasing the propensity to fit
semp and hence to
exhibit memorization-like behavior in the small-t, large-m regime. The same effect appears
whenever one ``compensates’’ for smaller D by widening the score model so
that its effective feature count grows faster than linearly in 1/D.
This observation explains an apparent paradox: although reducing D can reduce memorization under the natural scaling p = κD, it can have the opposite effect under a fixed compute/parameter envelope if that envelope is used to increase width as the latent representation is compressed. The relevant control variable remains p/n (or, in structured cases, its intrinsic-dimension analogue), and any design that increases p/n tends to move toward the memorization-prone side of the diagram.
The preceding translation is most operational in the m = ∞ regime, where small t exposes the empirical-mixture structure and the main risk is strong alignment with semp. For m = 1, the same architectural scalings still control classical interpolation phenomena (e.g. the location of double descent near p ≃ n), but the extreme small-t blow-up is suppressed by label noise, so compression is less tightly coupled to extraction risk. Nevertheless, p/n remains the first-order indicator of whether the estimator has sufficient flexibility to match training-specific structure.
In summary, the phase diagram yields explicit design rules once one specifies how p scales with D: under feature-density scaling p = κD, compression decreases p and can move the model below the p ≃ n boundary; under fixed p, compression is largely neutral with respect to that boundary; and under fixed parameter budgets (or compensatory widening) leading to p ∝ 1/D, compression can increase p/n and thereby worsen memorization risk. In the next section we refine these statements beyond isotropy by replacing D with an intrinsic effective dimension determined by the spectrum of C, which captures representation collapse and anisotropy.
The compression discussion above implicitly treated the latent distribution as essentially isotropic, so that the phase behavior could be summarized in terms of the nominal latent dimension D. In the general latent Gaussian model P0z = 𝒩(0, C), however, the learning curves of Theorems~– depend on C only through its limiting spectrum, and the relevant notion of ``dimension’’ is therefore spectral rather than nominal. We now isolate the appropriate intrinsic-dimension proxies and explain how anisotropy induces directional generalization and directional memorization.
Let C = UΛU⊤
with Λ = diag(λ1, …, λD)
and empirical spectral distribution $\mu_C=\frac1D\sum_{j=1}^D
\delta_{\lambda_j}$. At time t, the forward OU perturbation
yields
Zt ∼ 𝒩 (0, Σt), Σt := at2C + htI.
In particular, along the eigen-direction uj the marginal
variance is at2λj + ht,
and the true score is explicit:
$$
s^\star(t,y)
=
-\Sigma_t^{-1}y
=
-\sum_{j=1}^D \frac{\langle y,u_j\rangle}{a_t^2\lambda_j+h_t}\,u_j.
$$
Thus the target varies across directions according to the
inverse-variance weights (at2λj + ht)−1.
More importantly for memorization, the structure of Ptz
depends on the direction-wise signal-to-noise ratios
$$
\mathrm{SNR}_j(t) := \frac{a_t^2\lambda_j}{h_t}.
$$
When SNRj(t) ≫ 1, the
time-t distribution remains
strongly influenced by the discrete empirical support of P0z
along uj
(hence semp(t, ⋅)
exhibits the sharp sample-localized behavior characteristic of small
t); when SNRj(t) ≪ 1, the
diffusion noise dominates and the mixture becomes close to a single
Gaussian along that direction, so that semp(t, ⋅) is
comparatively smooth and less sample-specific. Anisotropy of C therefore produces memorization:
the score learner may have sufficient degrees of freedom to interpolate
the empirical-mixture score in a subset of directions, while remaining
effectively regularized in others.
A clean instance is the rank-r projector model C = Π∥, which
captures representation collapse to an r-dimensional subspace. Writing
Π⟂ = I − Π∥,
we have
Σt = at2Π∥ + htI = (at2 + ht)Π∥ + htΠ⟂ = Π∥ + htΠ⟂,
and hence
$$
s^\star(t,y)=-\Pi_{\parallel}y-\frac{1}{h_t}\Pi_{\perp}y.
$$
This decomposition exhibits two distinct components. The ∥ part carries the data-dependent structure
(it is the only component where P0z
is nontrivial), and it is precisely in this r-dimensional component that the
small-t, large-m memorization transition of
Theorem~ manifests. The ⟂ part is
pure noise'': it is independent of $P_0^z$ and determined solely by the diffusion schedule. Accordingly, the limiting test error naturally decomposes as \[ \mathcal{E}_{\mathrm{test}}(t)=\mathcal{E}_{\mathrm{test}}^{\parallel}(t)+\mathcal{E}_{\mathrm{test}}^{\perp}(t), \] with the sharp rise associated to matching $s^{\mathrm{emp}}$ occurring in $\mathcal{E}_{\mathrm{test}}^{\parallel}(t)$ as a function of the effective ratios $n/r$ and $p/r$ (equivalently, with $D$ replaced by $r$ in the proportional parameters along the data subspace). This is the mechanism behind the statement that collapse changes thedimension
controlling memorization along data directions’’: shrinking r increases the effective sample
ratio n/r and can
mitigate the ability to fit sample-specific structure, even if the
ambient latent size D is
unchanged.
Outside exact subspace models, a convenient scalar proxy for
intrinsic dimension is the participation ratio
$$
D_{\mathrm{eff}} := \frac{(\mathrm{tr}\,C)^2}{\mathrm{tr}(C^2)}
=
\frac{\left(\sum_{j=1}^D \lambda_j\right)^2}{\sum_{j=1}^D \lambda_j^2},
$$
which interpolates between 1 (fully
collapsed to one direction) and D (isotropic). This functional is
invariant to orthogonal changes of basis and depends only on μC, matching the
way C enters the resolvent
characterizations of the learning curves. In anisotropic settings, Deff captures the
intuition that a few large eigenvalues (a ``spiked’’ spectrum) behave
like a low-dimensional signal with many nuisance directions, even when
D is large.
Since memorization at small t is driven by the mixture geometry
at time t, it is often useful
to refine Deff into
a effective dimension that discounts directions drowned by diffusion
noise. One natural choice is to weight each direction by its
contribution to the score-relevant geometry, for instance via the
matrix
$$
G_t := C^{1/2}\Sigma_t^{-1}C^{1/2}
=
U\,\mathrm{diag}\!\left(\frac{\lambda_j}{a_t^2\lambda_j+h_t}\right)U^\top,
$$
and define
$$
D_{\mathrm{eff}}(t)
:=
\frac{(\mathrm{tr}\,G_t)^2}{\mathrm{tr}(G_t^2)}.
$$
When t is small, Deff(t)
emphasizes the directions with SNRj(t) ≳ 1
(those that preserve sample separation); when t is large, Σt ≈ I
and Deff(t)
approaches the usual participation ratio of C. While we do not require a
specific proxy in the formal statements, any such scalar summary is
intended to replace D in
regime diagnostics whenever the spectrum is far from flat.
These spectral notions correspond to common representation pathologies. If an encoder exhibits collapsed variance (e.g. dead units or overly aggressive bottleneck regularization), then C develops many near-zero eigenvalues, Deff decreases, and SNRj(t) becomes small in most directions; consequently the time-t distribution becomes closer to a single Gaussian in those directions and the empirical-mixture score is less sharp, reducing the opportunity (and incentive) for the score model to align with semp there. Conversely, spiked spectra (a few very large eigenvalues) can concentrate mixture geometry into a small set of directions: even if Deff is moderate, the learner may effectively overparameterize the spiked subspace, fitting sample-specific structure primarily along those directions and thereby increasing directional extraction risk.
We therefore interpret anisotropy as producing a : the same global (n, p, m) may place different eigenspaces of C on different sides of the generalization–memorization crossover. The next section operationalizes this viewpoint by estimating the latent spectrum (or Deff-type proxies) from samples and converting the resulting effective ratios into actionable audit predictions for memorization and score–empirical-score alignment.
We now describe a representation-space audit whose inputs are the latent codes {zi}i = 1n ⊂ ℝD (or streaming access thereto), architectural scalings (p, λ) of the score model class, and a characterization of the effective DSM noise budget m (including noise reuse across epochs). The audit produces (i) a spectral summary of the representation via Ĉ and intrinsic-dimension proxies, (ii) predicted regime indicators for generalization versus memorization at small~t, and (iii) practical alignment proxies that correlate with extraction-style behaviors (nearest-neighbor retrieval and membership inference).
Given latent samples, we form the empirical covariance $\hat C=\frac1n\sum_{i=1}^n z_i z_i^\top$.
When D is moderate, we may
compute the full eigenspectrum of Ĉ; when D is large, we instead compute the
top-k eigenpairs via
randomized SVD and approximate the remainder by a trace estimator.
Concretely, let {λ̂j}j = 1k
denote the leading eigenvalues, and estimate tr(Ĉ) and tr(Ĉ2) by Hutchinson-type
sketches when needed. We then report
$$
\widehat D_{\mathrm{eff}}=\frac{\mathrm{tr}(\hat C)^2}{\mathrm{tr}(\hat
C^2)},
\qquad
\widehat D_{\mathrm{eff}}(t)=\frac{\mathrm{tr}(\hat
G_t)^2}{\mathrm{tr}(\hat G_t^2)},
\qquad
\hat G_t:=\hat C^{1/2}\hat\Sigma_t^{-1}\hat C^{1/2},\ \
\hat\Sigma_t:=a_t^2\hat C+h_t I,
$$
as well as the empirical distribution of the scalars ĝj(t) := λ̂j/(at2λ̂j + ht).
The latter serves as a diagnostic for mixture geometry: directions with
ĝj(t) ≈ 1
retain data separation at time~t, whereas directions with ĝj(t) ≈ 0
are noise-dominated. In practice, we emphasize that any scalar summary
necessarily discards some anisotropic detail; accordingly, we recommend
reporting both D̂eff-type aggregates and
a small set of quantiles of {ĝj(t)}.
The principal output of the audit is a predicted risk score R(t) built from
proportional parameters
$$
\hat\psi_n(t):=\frac{n}{\widehat D_{\mathrm{eff}}(t)},\qquad
\hat\psi_p(t):=\frac{p}{\widehat D_{\mathrm{eff}}(t)},
$$
together with the DSM noise budget m and ridge λ. The simplest indicator, motivated
by the small-t behavior in the
large-noise regime, is the sign of
$$
\Delta(t):=\frac{\hat\psi_p(t)}{\hat\psi_n(t)}-1=\frac{p}{n}-1,
$$
with the understanding that D̂eff(t)
determines directions behave as the effective data subspace but that the
sharp interpolation boundary, when it manifests, is controlled by p/n. We therefore output a
regime label (low versus high memorization risk) given by a calibrated
threshold rule of the form
R(t) = 1{Δ(t) > δ0} ⋅ wm(t) ⋅ wλ(t) ⋅ wanis(t),
where wm(t)
increases with an estimate of the noise multiplicity meff(t), wλ(t)
decreases as λ grows, and
wanis(t)
increases when a small fraction of directions dominates tr(Ĝt) (a spiked
{ĝj(t)}
profile). In settings where one is willing to numerically instantiate
the limiting trace equations, the audit refines R(t) by plugging the
empirical spectral measure of Ĉ into the corresponding fixed-point
system, yielding predicted ℰtest(t) and ℰtrain(t) curves as a
function of (p, n, m, λ).
In many training pipelines, the nominal m differs from the number of noise draws per datum (e.g. due to reusing the same noise across epochs, or sharing noises across times). We therefore define a bookkeeping quantity meff as the number of statistically independent (t, ξ) perturbations per training point that actually enter the regression objective. When direct instrumentation is possible, meff is measured; otherwise, it is bounded using the number of epochs, batch schedule, and random seed management. This matters because large meff drives the objective closer to its population-noise limit, amplifying the small-t sample-localized structure that the score model can potentially fit.
To connect regime prediction to extraction behavior, we estimate
alignment with the empirical mixture score semp(t, ⋅),
which is available in closed form at a fixed t given {zi}:
$$
s^{\mathrm{emp}}(t,y)=\nabla_y \log\Bigl(\frac1n\sum_{i=1}^n
\exp\bigl(-\|y-a_t z_i\|^2/(2h_t)\bigr)\Bigr)
=\frac{1}{h_t}\Bigl(\sum_{i=1}^n \omega_i(y)\,(a_t z_i-y)\Bigr),
$$
where ωi(y)
are the normalized Gaussian-kernel weights. We then define an empirical
alignment functional on query points y ∼ Ptz
(or on held-out perturbed latents)
$$
\widehat{\mathcal{M}}(t)=\frac{1}{N}\sum_{\ell=1}^N \bigl\|\hat
s(t,y_\ell)-s^{\mathrm{emp}}(t,y_\ell)\bigr\|_2^2,
$$
and approximate semp by truncating the
sum to the top-k nearest
neighbors of y among {atzi}
(with k ≪ n), since
for small t the weights ωi(y)
are sharply localized. In addition, we record the proxy
Hk(t; y) := −∑i ∈ NNk(y)ωi(k)(y)log ωi(k)(y),
whose small values indicate near-single-sample dominance and empirically
correlate with nearest-neighbor retrieval and membership inference
susceptibility.
The audit finally reports a compute-aware privacy diagnostic: (i)
whether the training configuration sits in a regime where $\widehat{\mathcal{M}}(t)$ is predicted to be
small (generalizing) or large (memorizing), and (ii) whether this
alignment is concentrated in a small set of spectral directions
(spike-dominated). Operationally, we output a tuple
$$
\Bigl(\widehat D_{\mathrm{eff}},\widehat
D_{\mathrm{eff}}(t),\hat\psi_n(t),\hat\psi_p(t),m_{\mathrm{eff}},\widehat{\mathcal{M}}(t),\mathrm{Quantiles}(\hat
g_j(t))\Bigr),
$$
together with a recommended mitigation axis when risk is high: decrease
p (or feature density),
increase λ, reduce meff by refreshing noise
less aggressively, or reduce the representation dimension along the
high-ĝj(t)
directions (e.g. by whitening and truncation). These outputs are
designed to be compared directly against downstream extraction metrics
in the experimental protocol that follows.
We evaluate the theory and the associated audit by training latent score models across a grid of latent dimensions D, while holding the raw dataset and encoder family fixed. Concretely, we select a dataset {xi}i = 1n ⊂ ℝd (with a fixed n across all runs) and train or reuse an encoder E : ℝd → ℝD that outputs latent codes zi = E(xi). We then train denoising score matching (DSM) estimators in latent space at a collection of small times t ∈ Tsmall (and optionally a broader grid for completeness), using the same forward OU/VP coefficients (at, ht). Our primary experimental axis is D, and we emphasize two scaling regimes for the score model width: (i) p = κD for several fixed feature densities κ (as in Corollary~3), and (ii) p = p0 independent of D, which decouples representation compression from score-model scaling and serves as a robustness check.
For each (D, p) configuration we train a random-features score model ŝ(t, ⋅) at each t by ridge regression with regularization λ, using DSM targets implied by the perturbation $y=a_t z+\sqrt{h_t}\,\xi$ with ξ ∼ 𝒩(0, I). We implement both m = 1 (one perturbation per datum per epoch) and large-noise settings approaching m = ∞ by increasing the number of independent perturbations. Since the operative quantity in practice is not the nominal m but the number of noises used throughout optimization, we instrument training to record the effective noise multiplicity meff(t): the number of independent (t, ξ) draws per zi that contribute to the objective (accounting for reuse across epochs and any caching). In addition to the idealized extremes m = 1 and m = ∞, we include a controlled ablation where we fix a bank of noises per datum and reuse it over epochs, thereby interpolating between low and high meff at fixed compute.
For each encoder and latent size D, we estimate $\hat C=\frac1n\sum_{i=1}^n z_i z_i^\top$ and compute spectral summaries used by the audit: tr(Ĉ), tr(Ĉ2), D̂eff = (tr(Ĉ))2/tr(Ĉ2), and the time-dependent quantities built from Σ̂t = at2Ĉ + htI, namely Ĝt = Ĉ1/2Σ̂t−1Ĉ1/2 and D̂eff(t) = (tr(Ĝt))2/tr(Ĝt2). We also record quantiles of ĝj(t) = λ̂j/(at2λ̂j + ht) to diagnose whether only a small set of directions remains data-dominated at the relevant small times. When D is large we compute only the top-k eigenpairs and approximate traces by randomized sketches; we verify that the resulting D̂eff is stable in k and in the number of probe vectors.
Given (n, p, λ, meff)
and the estimated spectrum, we compute the audit’s predicted regime
indicator R(t) (and,
when feasible, predicted test/train DSM errors obtained by numerically
instantiating the limiting fixed-point system with the empirical
spectral measure). We then measure empirical latent DSM errors using a
held-out set of latents {zitest},
reporting
$$
\widehat{\mathcal{E}}_{\mathrm{test}}(t):=\mathbb{E}_{z\sim
\mathrm{test},\,\xi}\bigl\|\hat s(t,a_t z+\sqrt{h_t}\xi)-s^\star(t,a_t
z+\sqrt{h_t}\xi)\bigr\|_2^2,
$$
approximated by Monte Carlo. Since s⋆ is unknown, we also
report proxy errors against the empirical-mixture score semp computed from the
training latents, which is the relevant object for extraction-style
behavior at small~t. In
particular, we estimate the alignment functional $\widehat{\mathcal{M}}(t)$ on y ∼ Ptz
using a nearest-neighbor truncation of semp(t, y)
for computational tractability.
To connect the audit to concrete privacy-relevant behaviors, we consider two classes of extraction metrics. First, : we generate query points y by perturbing either training or held-out latents and compute the top-1 nearest neighbor index among {atzi}i = 1n (in Euclidean distance) and compare it to the maximal-weight index under the kernel weights ωi(y) defining semp(t, y). We report (i) the probability that the nearest neighbor is unique and dominates the weights (e.g. maxiωi(y) exceeding a threshold), and (ii) the entropy proxy Hk(t; y) averaged over queries. Second, : we train a classifier to distinguish whether a query y was generated from a training latent or a held-out latent using statistics derived from the learned score, such as ∥ŝ(t, y)∥2, local score divergence estimates, and the top-k weight concentration of the induced kernel neighborhood. We summarize membership inference by AUC and advantage at fixed false-positive rate. Our hypothesis, aligned with the theory, is that these extraction metrics increase sharply when (p/n) crosses the interpolation boundary in the relevant effective subspace and when meff is large at small t.
We systematically vary λ to test the predicted monotone mitigation effect of ridge regularization on both alignment and extraction metrics, holding (n, p, D, meff) fixed. Separately, we vary encoder quality to induce controlled changes in the latent spectrum: (i) using encoders at different training checkpoints, (ii) introducing explicit bottlenecks or orthogonality/whitening penalties to modify anisotropy, and (iii) inducing partial collapse by reducing decoder pressure or adding strong latent regularizers. For each encoder we compare the audit’s D̂eff(t) and ĝj(t) quantiles against the observed extraction metrics, with the expectation that spike-dominated spectra (small effective dimension along data-dominated directions) concentrate risk and change the relevant threshold when expressed in terms of Deff.
Across all configurations we report tuples
$$
(D,p,n,\lambda,m_{\mathrm{eff}};\ \widehat D_{\mathrm{eff}},\widehat
D_{\mathrm{eff}}(t),\ \widehat{\mathcal{M}}(t);\ \text{retrieval},\
\text{membership-AUC}),
$$
and we plot these against the audit’s predicted regime boundary as
functions of p/n and
of the spectral concentration diagnostics. We additionally verify that,
under architecture scaling p = κD with n fixed, decreasing D moves the system across the
predicted threshold D ≈ n/κ in tandem
with a reduction in alignment to semp and a concomitant
drop in retrieval and membership inference success.
Our analysis is organized around the surrogate P0z ≃ 𝒩(0, C) (or structured degenerations thereof), which is both mathematically convenient and empirically plausible after aggressive representation learning and whitening. Nevertheless, many practical encoders produce latent distributions that are visibly non-Gaussian: heavy tails, skewness, and (most importantly) class-conditional clustering that is better modeled by mixtures. At a qualitative level, the OU perturbation $Z_t=a_t Z_0+\sqrt{h_t}\xi$ mollifies such non-Gaussianity as t grows, so one expects the Gaussian proxy to become more accurate for moderate t, while the small-t regime—the regime in which our memorization boundary is sharpest—remains sensitive to mixture structure. For a mixture $P_0^z=\sum_{k=1}^K \pi_k \mathcal{N}(\mu_k,C_k)$, the exact score s⋆(t, ⋅) is a soft assignment over components with weights proportional to πkφt(⋅; atμk, at2Ck + htI), hence the ``data directions’’ and effective dimension become point-dependent. We view a rigorous extension to mixtures as requiring new tools: the random-design matrices become heteroscedastic and correlated with the regression target through the posterior responsibilities, so the present resolvent reductions do not directly apply.
Even when P0z is not Gaussian, two mechanisms suggest partial robustness of the phase diagram. First, in proportional asymptotics, random-features regression exhibits a degree of universality: many test/train risk expressions depend primarily on low-order moments of the design under appropriate Lindeberg conditions. Second, the OU convolution renders Ptz smoother and closer to a single Gaussian in total variation as t increases, and one may hope to control the discrepancy between s⋆(t, ⋅) and the Gaussian-score surrogate using stability estimates for score matching under perturbations of the underlying density. Making either statement precise is nontrivial: score errors depend on derivatives of log pt, and thus on non-asymptotic lower bounds for pt and regularity of ∇log pt. We therefore restrict ourselves to using C (or an intrinsic proxy derived from it) as a principled rather than claiming distribution-free guarantees. A natural intermediate model class is elliptical P0z with a deterministic radial distribution, for which one might retain tractable resolvent formulas while capturing tail effects.
We analyze (and audit) each t separately, which matches the form of the theoretical DSM objective and isolates the dependence on (at, ht). Practical diffusion models, however, learn a single time-conditioned network ŝθ(t, ⋅) with shared parameters across t, typically trained with a non-uniform time sampling distribution. This coupling can move the effective capacity away from the nominal p at a fixed t: the model may allocate more degrees of freedom to times that dominate the training measure, and may be under-parameterized at some t even if p ≫ n globally. In a random-features analogue, one may model time conditioning by augmenting the features as $\phi(W z/\sqrt{D},\, u(t))$ for a deterministic embedding u(t), leading to a multi-task ridge regression with a block-structured Gram matrix indexed by time. The corresponding limiting risk should be governed by a matrix-valued fixed-point system (involving the joint spectral distribution of the per-time covariances Σt = at2C + htI and the time kernel induced by u). We expect the sharp boundary in terms of an effective (p/n) ratio to persist, but with a boundary that depends on the time-sampling distribution and on the ``task overlap’’ across nearby times.
Our predictions are stated at the level of the learned score at a fixed diffusion time t, whereas privacy-relevant extraction is typically mediated by a discrete-time reverse sampler (Euler–Maruyama for the reverse SDE, or a probability-flow ODE integrator). Discretization introduces two additional layers of approximation: (i) the sampler queries ŝ only on a time grid and thus may never visit the smallest times where alignment to semp is maximal; (ii) numerical error can amplify score inaccuracies in a time-accumulated manner. Accordingly, a complete risk characterization should weight score error not only by Ptz but also by the visitation distribution of the sampler. One plausible extension is to analyze a discretized reverse recursion whose drift uses the RF score, and to propagate the score approximation error through the recursion using stability bounds for stochastic numerical schemes. In particular, one may obtain an error decomposition into a term (our ℰtest-type quantities) and a term depending on step size and on Lipschitz constants of the learned drift.
The audit replaces D by an estimated Deff (or Deff(t)), which requires accurate estimation of spectral functionals of C. When n is only moderately larger than Deff, both tr(Ĉ2) and the top eigenvalues can be biased, and the participation ratio can be unstable under heavy tails or strong anisotropy. This interacts with the very phenomenon we seek to detect: near the boundary, small perturbations of ψn and ψp can flip the regime label. We therefore regard the audit as a whose reliability improves when (a) n is sufficiently large to estimate the spectrum at the needed resolution, and (b) the latent spectrum has a pronounced gap or otherwise yields a stable effective dimension. A principled refinement would attach confidence intervals to Deff via bootstrap or via non-asymptotic concentration bounds for trace estimators, and propagate these intervals through the phase-boundary rule.
A further limitation is that the latent codes zi = E(xi) are not exogenous: the encoder is itself trained on {xi}i = 1n, so C and the latent distribution depend on the sample. Our asymptotics treat zi as i.i.d. from a fixed P0z, which is appropriate when E is frozen and the downstream score model is trained on new data, but is only approximate in end-to-end settings. Extending the theory to incorporate encoder-induced dependence would require a joint limit in which E is a random object correlated with {xi}, and one tracks how this correlation alters both the effective spectrum and the DSM targets. We expect the leading-order lesson to remain: collapse or strong anisotropy reduces the number of data-dominated directions and can concentrate memorization risk; however, quantifying this effect requires modeling the encoder training dynamics or imposing stability assumptions on E.
We close by listing concrete extensions suggested by our results. (i) Establish a rigorous universality theorem for latent DSM with non-Gaussian P0z, ideally identifying minimal moment and smoothness conditions under which the Gaussian learning curves remain asymptotically valid. (ii) Develop a mixture-aware phase diagram in which the boundary is expressed in terms of a local or component-wise effective dimension, and relate it to cluster-wise extraction phenomena. (iii) Analyze time-conditioned training as a coupled multi-task regression problem and determine how time sampling and parameter sharing shift the boundary. (iv) Integrate discretized sampling into the risk metric, producing an end-to-end prediction for retrieval/membership behavior under a specified sampler. (v) Replace the random-features class with deeper architectures while retaining a tractable asymptotic characterization, for example via linearization at initialization or via kernel limits, thereby connecting representational capacity, memorization, and diffusion-time conditioning in a single framework.