← Back

Intrinsic-Dimension Scaling Laws for Memorization in Latent Diffusion via Precise DSM Learning Curves

Metadata

Table of Contents

  1. 1. Introduction: latent diffusion memorization; why intrinsic dimension matters; summary of scaling-law claims and practical implications.
  2. 2. Background: DSM for diffusion, empirical optimal score and memorization mechanism, random features/NTK learning curves, and how latent diffusion changes the effective dimension and covariance structure.
  3. 3. Problem setup in latent space: encoder-induced latent distribution assumptions; OU/VP forward process in ℝ^D; DSM with m noise samples; definitions of test/train errors and a memorization order parameter (score alignment to empirical optimal score).
  4. 4. Main theorems (learning curves): asymptotically exact test/train errors for m=∞ and m=1 with general latent covariance C; specialization to isotropic C=I and subspace C=Π_∥; phase diagram and the p ≈ n crossover in latent directions.
  5. 5. Compression and architecture scaling: translate the phase diagram from ratios (ψ_n, ψ_p) to practical scalings p = κ D (or parameter-count scaling) and derive explicit threshold conditions in D; discuss when compression reduces memorization and when it can worsen it (if p is held fixed).
  6. 6. Representation collapse and anisotropy: define effective intrinsic dimension (rank, participation ratio); show how anisotropy modifies learning curves and directional generalization; connect to encoder pathologies (collapsed variance, spiked spectra).
  7. 7. Representation-space audit algorithms: estimating latent spectra and predicting memorization risk; practical proxies for score–empirical-score alignment; compute–privacy diagnostics.
  8. 8. Experimental plan: latent diffusion training across latent sizes; covariance spectrum estimation; extraction metrics (nearest-neighbor retrieval, membership inference) vs predicted risk; ablations on m (effective noise reuse), regularization, and encoder quality.
  9. 9. Discussion and limitations: beyond Gaussian latents (mixtures), time-conditioned networks, discretization effects; open problems and how the theory could be extended.

Content

1. Introduction: latent diffusion memorization; why intrinsic dimension matters; summary of scaling-law claims and practical implications.

We study when diffusion models trained in a learned representation space memorize their training set. Concretely, given data x ∈ ℝd and an encoder E : ℝd → ℝD, we work with latent codes z = E(x) and fit a latent score model intended to approximate the score of the forward noised latent distribution at time t, namely s(t, ⋅) = ∇log Ptz(⋅) for $Z_t=a_t Z_0+\sqrt{h_t}\,\xi$. Our focus is the regime where D is large and the learning algorithm is sufficiently expressive that it can interpolate empirical structure. In that regime, one must distinguish genuine generalization of the population score from a different phenomenon: accurately matching the score associated with the finite training set, which in turn can enable extraction through nearest-neighbor effects in latent space.

A first principle is that memorization is not controlled by the ambient pixel dimension d but by the effective dimension of the representation on which score matching is performed. If the encoded distribution P0z is concentrated on a low-dimensional subspace or has a highly anisotropic covariance, then the statistical difficulty of score estimation (and the propensity to overfit) is governed by the spectrum of the latent covariance C rather than by d. In particular, if P0z is well-approximated by a Gaussian 𝒩(0, C), then the forward diffusion distribution is also Gaussian, Ptz = 𝒩(0, at2C + htI), and the true score is linear:

Even in this simplified setting, the learning problem is nontrivial: the learner observes only finitely many latent samples {zi}i = 1n and is trained by denoising score matching (DSM), which replaces population expectations by finite-sample averages and introduces auxiliary noise draws. The combination of finite data and high-dimensional function classes creates a sharp tradeoff between approximating s and fitting the empirical idiosyncrasies of the sample.

Our analysis is in proportional asymptotics, with n, p, D → ∞ at fixed ratios ψn = n/D and ψp = p/D, for a random-features score class of the form

trained at each time t by ridge-regularized DSM with regularization parameter λ and with m independent noise samples per datum. This setting is deliberately aligned with the mechanism by which modern score networks can be viewed, locally in parameter space, as kernel or random-feature regressors; it also isolates the dependence of generalization on the ratios (ψn, ψp) and on the noise budget m.

The central qualitative claim is that in latent DSM there is a generalization–memorization crossover controlled primarily by the comparison of p and n along directions where the data have non-negligible variance. In the small-time regime (where at ≈ 1 and ht ≈ 0), the denoising task is hardest and the score is most sensitive to the fine structure of P0z. In this regime, when m is large (including the idealized m = ∞ objective), we find a sharp rise of a memorization order parameter when p crosses n, reflecting the ability of the random-features learner to match the empirical score semp(t, ⋅) rather than the population score s(t, ⋅). When m = 1, this effect is substantially mitigated: one still observes high-dimensional double-descent phenomena near the interpolation threshold ψp = ψn, but the extreme small-t blow-up characteristic of m = ∞ is absent. The dependence on m is thus not a minor constant-factor issue; it changes which failure mode dominates.

A second claim concerns representation compression under architecture scaling. In many latent diffusion pipelines, model capacity scales with the representation width, and it is natural to consider a feature density κ such that p = κD. Holding the number of training examples n fixed, reducing D necessarily reduces p, and therefore moves the system across the boundary p ≈ n at a predictable location:

Thus, even if the pixel dimension d is unchanged, a smaller latent dimension can provably decrease memorization propensity by pushing the learner into a regime where it cannot represent the empirical score with small training error and small ridge penalty simultaneously. This ``compression law’’ is not asserted as a universal design rule; rather, it is a concrete consequence of the asymptotic learning curves in a setting where both the data distribution in latent space and the estimator are analytically tractable.

A third claim is that anisotropy and representation collapse are naturally expressed through an intrinsic dimension proxy. If C is approximately rank-r, or more generally has a spectrum with a small number of dominant directions, then the relevant effective dimension is closer to

(or exactly r in the rank-r projector model) than to the ambient D. Replacing D by Deff in the ratios ψn, ψp yields an immediately interpretable prediction: memorization risk increases when ψpeff = p/Deff exceeds ψneff = n/Deff in the small-t, large-m regime. This observation suggests a representation-space auditing procedure that uses only latent samples to estimate spectral functionals of C and then reports a regime label together with predicted train/test behavior.

The practical implications are therefore of the following type. If a latent diffusion model exhibits unwanted training-set leakage, then there are three levers that our theory separates: (i) the sample-to-dimension ratio in the representation, (ii) the score-model capacity relative to D (via p and architectural scaling), and (iii) the DSM noise budget m induced by the training procedure. Conversely, if one seeks high fidelity without leakage, then it is necessary to increase n or decrease effective capacity (in the relevant directions) in a manner that is cognizant of the latent covariance structure. The remainder of the paper formalizes these claims by deriving asymptotically exact learning curves for latent DSM under Gaussian (and subspace) models for P0z, and by connecting these curves to memorization proxies that correlate with extraction by nearest-neighbor retrieval in latent space.


2. Background: DSM for diffusion, empirical optimal score and memorization mechanism, random features/NTK learning curves, and how latent diffusion changes the effective dimension and covariance structure.

We recall the denoising score matching (DSM) principle for diffusion models and isolate the mechanism by which finite-sample objectives admit ``empirical’’ optima that may differ substantially from the population score. We then summarize why, for the random-features (or NTK-linearized) estimators used throughout, the resulting learning curves in high dimension are governed by a small number of ratios and spectral statistics, and why passing to a learned latent space modifies these ratios through the latent covariance.

Consider a forward noising operator of Ornstein–Uhlenbeck / variance-preserving type,

with at = et and ht = 1 − e−2t (so at2 + ht = 1). Writing Pt for the law of Zt, the score at time t is s(t, y) = ∇ylog Pt(y). DSM replaces direct score estimation by regression: for each observed clean'' sample $z$, one draws $y\sim \mathcal{N}(a_t z, h_t I)$ and regresses a function $s(t,\cdot)$ toward the conditional score $\nabla_y \log p(y\mid z)$, yielding the standard objective \begin{equation} \mathcal{L}_t(s) \;=\; \mathbb{E}_{z\sim P_0}\,\mathbb{E}_{y\mid z}\,\Big\| s(t,y) - \nabla_y \log p(y\mid z)\Big\|_2^2 . \end{equation} For Gaussian perturbations one has $\nabla_y \log p(y\mid z)=-(y-a_t z)/h_t$, hence DSM is a least-squares problem with a particularly explicitlabel.’’ Moreover, the population minimizer over all measurable s is the marginal score s(t, y), since 𝔼[∇log p(y ∣ z) ∣ y] = ∇log p(y).

In practice, P0 is replaced by the empirical measure $\widehat P_0=(1/n)\sum_{i=1}^n \delta_{z_i}$ and the outer expectation is replaced by a finite average. If we additionally idealize the inner expectation over the corruption noise as exact (corresponding to a large noise budget per datum), then the unconstrained minimizer of the empirical DSM objective is the score of the

This function is generally equal to s(t, ⋅) even when zi are i.i.d. from P0. When ht is small, t is a mixture of narrow Gaussians centered at the training points (up to the factor at), and the score of such a mixture admits the familiar ``responsibility’’ form

In the small-noise regime, wi(y) concentrates near the nearest neighbor of y among {atzi}, so semp(t, ⋅) becomes a piecewise-nearest-neighbor vector field pulling samples toward individual training codes. Consequently, a sufficiently expressive score model can achieve small empirical risk by approximating semp rather than s, providing a concrete channel for training-set retrieval in reverse diffusion.

The preceding discussion implicitly takes an infinite number of noise draws per training point. With only m noise samples per datum, the DSM objective becomes a random-design, noisy-label regression problem: the covariates are the perturbed points $y_{i\ell}=a_t z_i+\sqrt{h_t}\,\xi_{i\ell}$ and the labels are −(yi − atzi)/ht. The case m = 1 is therefore qualitatively different from the m = ∞ idealization: the learner cannot average out the label noise induced by ξ, and the effective regression problem includes an additional source of randomness even conditional on {zi}. One should thus expect m to affect not only constants but also the location and nature of interpolation phenomena, since the target itself is observed with different signal-to-noise ratios.

We focus on score models of the form $s_A(y)=(A/\sqrt{p})\varphi(Wy/\sqrt{D})$, which are linear in the second-layer weights A and random in the first-layer weights W. At fixed t, training with squared loss and ridge regularization reduces to a (vector-valued) ridge regression with feature matrix $\Phi(y)=\varphi(Wy/\sqrt{D})\in\mathbb{R}^p$. This class captures, in an analytically tractable setting, the kernel/NTK viewpoint: in the large-p limit with suitable scaling one recovers kernel ridge regression, while for p proportional to D one can track the full random-feature regime. In proportional asymptotics, test and train errors concentrate and admit deterministic limits that depend on the sample ratio ψn = n/D, the feature ratio ψp = p/D, and the spectrum of the covariance of the covariates. These limits exhibit the now-standard ``double descent’’ behavior near the interpolation threshold p ≈ n and, in our setting, can also separate fitting the population score from fitting the empirical score.

When score matching is carried out in a learned representation z = E(x) ∈ ℝD, the ambient pixel dimension d is no longer the relevant scale. The forward noising is performed in latent space and hence the covariates presented to the regressor are distributed according to Ptz, whose geometry is controlled by the latent covariance

Therefore, both the statistical difficulty of score estimation and the interpolation threshold for random-feature regression depend on the eigenvalue distribution of C rather than on d. If C is highly anisotropic or approximately low rank, then only a small number of directions carry substantial signal, and the relevant effective dimension is better summarized by intrinsic proxies such as (tr C)2/tr(C2) (or exactly the rank in subspace models). In particular, the comparison between model size p and sample size n should be interpreted : a model may be overparameterized relative to an r-dimensional latent subspace even if p ≪ D, and this is precisely the regime in which matching semp(t, ⋅) becomes feasible.

These observations motivate the setup and quantities studied next: we formalize the latent OU forward process, the DSM objectives for general m, and the resulting notions of test error, training error, and an order parameter measuring alignment to the empirical optimal score.


3. Problem setup in latent space: encoder-induced latent distribution assumptions; OU/VP forward process in ℝ^D; DSM with m noise samples; definitions of test/train errors and a memorization order parameter (score alignment to empirical optimal score).

We fix an encoder E : ℝd → ℝD and write

Our analysis is carried out directly in representation space and is therefore parameterized by the latent dimension D rather than the ambient pixel dimension d. Throughout we impose proportional asymptotics

with ψn, ψp ∈ (0, ∞) fixed.

We model the latent codes as i.i.d. draws from an approximately Gaussian distribution

where C ∈ ℝD × D is positive semidefinite and may be anisotropic and/or approximately low rank. We assume that (1/D)tr C → τ ∈ (0, ∞) and that the empirical spectral distribution of C converges as D → ∞ (allowing, as special cases, isotropic latents C = I and subspace/collapsed representations C = Π with rank(Π) = r). We denote by $\widehat P_0^z=(1/n)\sum_{i=1}^n\delta_{z_i}$ the empirical latent distribution.

For each t ≥ 0, we consider the variance-preserving Ornstein–Uhlenbeck forward process in latent space

so that at2 + ht = 1. Writing Ptz for the law of Zt when Z0 ∼ P0z, we have in particular

and we denote the time-t population score by

At a fixed time t, the DSM regression task is generated as follows. For each latent code zi we draw m independent corruption noises $\{\xi_{i\ell}\}_{\ell=1}^m\stackrel{\mathrm{i.i.d.}}{\sim}\mathcal{N}(0,I_D)$ and form corrupted samples

The DSM ``label’’ associated to (zi, yi) is the conditional score

Given a candidate score function s(t, ⋅) : ℝD → ℝD, the empirical DSM objective at time t with noise budget m is

We emphasize that, conditional on {zi}, the regression covariates {yi} have covariance at2C + htID, so the eigen-structure of C enters the learning curves through the design distribution already at the level of the forward process.

In the idealized limit m = ∞ (or, equivalently, replacing the average over ξ in by its conditional expectation given zi), the unconstrained empirical minimizer is the score of the empirical noised distribution

We record the explicit mixture-score identity

This object will serve as the reference ``memorizing’’ vector field in our quantitative diagnostics.

We restrict s(t, ⋅) to a random-features class that is linear in the trainable weights. Let W ∈ ℝp × D have i.i.d. 𝒩(0, 1) entries and let φ : ℝ → ℝ be applied entrywise. For each t we consider estimators of the form

and we fit A by ridge-regularized least squares:

We treat t as fixed and train independently at each time (extensions to time-discretized training may be accommodated but are not needed for the present asymptotics).

Our primary performance metric is the latent-space DSM test error at time t,

where the expectation is with respect to an independent test draw Z0 ∼ P0z and forward noise ξ. The corresponding train error is the achieved empirical objective,

To quantify the extent to which aligns with the empirical-mixture score (and hence with the nearest-neighbor channel implicit in ), we introduce the order parameter

Small ℳ(t) indicates that the learned score is close, on average under the empirical noised distribution, to the score of the empirical mixture itself; in the small-noise regime this constitutes an operational proxy for memorization and extraction risk. In the sequel we will compute asymptotically exact limits for – as functions of (ψn, ψp, m, λ, t) and the spectrum of C.


4. Main theorems (learning curves): asymptotically exact test/train errors for m=∞ and m=1 with general latent covariance C; specialization to isotropic C=I and subspace C=Π_∥; phase diagram and the p ≈ n crossover in latent directions.

We now state the high-dimensional limits of the ridge-DSM estimator . Throughout this section we work at a fixed time t ≥ 0 and suppress the dependence on t in the notation when no confusion arises. The key point is that, under proportional asymptotics, all the quantities of interest—test error , train error , and the order parameter —converge to deterministic limits that depend on the latent geometry only through the limiting spectral distribution of C (and, in particular, through intrinsic-dimension surrogates in structured cases).

Let μC denote the empirical spectral distribution of C and assume μC ⇒ μ weakly as D → ∞. The random-features design at time t is generated by the covariates $y_{i\ell}=a_t z_i+\sqrt{h_t}\xi_{i\ell}$, whose covariance is at2C + htID. Under the Gaussian-equivalence principle, the random matrix describing the linearized features admits an asymptotically equivalent Gaussian surrogate whose covariance is an explicit functional of μ and of the activation through a finite set of scalar moments (e.g. the first nontrivial Hermite coefficients of φ). Consequently, the learning curves are governed by a finite system of coupled algebraic fixed-point equations for a small number of resolvent traces (Stieltjes transforms evaluated at shifts involving λ). We do not reproduce the full system here; it is obtained by the same linear-pencil argument used for the ambient-space analysis, with D in place of d and with C inserted in the population covariance.

In the idealized regime m = ∞, the DSM objective replaces the average over corruption noise by a conditional expectation, so that the effective regression targets coincide with the empirical-mixture score semp(t, ⋅). This is the regime in which nearest-neighbor structure in is most directly available to the learner when t is small.

The qualitative content of Theorem~ is that the latent learning curves admit a closed-form characterization up to the solution of a small fixed-point system, and this characterization is sensitive to overparameterization in a way that is sharply controlled by the ratios ψn and ψp. In particular, for small t (so that at ≈ 1 and ht ≪ 1) the empirical-mixture score semp(t, ⋅) becomes increasingly localized around individual latent codes, and the estimator can enter a regime in which it closely tracks semp under tz (small (t)) while simultaneously suffering a large population test error against s (large test(t)). This is the operational manifestation, in our formalism, of memorization and extraction risk.

For finite noise budget, and in particular for m = 1, the regression task is genuinely noisy even conditional on {zi}, which modifies both the interpolation properties and the alignment of with semp.

A salient consequence is that the extreme small-t ``memorization blow-up’’ present at m = ∞ is not generic at m = 1: the finite-noise regression labels $u_{i\ell}=-(1/\sqrt{h_t})\xi_{i\ell}$ inject label noise that persists even when the covariates become close to the training points. At the level of the limiting equations, this appears as additional variance terms that regularize the effective signal-to-noise ratio and prevent the strong alignment with semp that drives (t) downward in the overparameterized, small-t regime.

The general resolvent characterization specializes cleanly in two cases that will organize our phase diagrams.

When C = ID, the limits in Theorems~– reduce to closed-form expressions depending only on (ψn, ψp, λ, t) and activation moments. In this setting, the dominant transition is controlled by the ratio p/n = ψp/ψn. For m = ∞ and small t, the limiting curves exhibit a sharp crossover at p ≃ n: when p < n the estimator is constrained and cannot represent the empirical-mixture score, leading to comparatively larger (t) and smaller memorization risk; when p > n the estimator can track semp closely under tz, and test(t) correspondingly increases in a manner consistent with memorization. For m = 1, one still observes a classical double-descent phenomenon in the test curve near p ≃ n, but without the same small-t divergence.

When C is (approximately) a rank-r projector, the theory decouples into parallel'' (data) directions andorthogonal’’ directions dominated by diffusion noise. In this case, the relevant effective dimension for memorization is r rather than D, and the p ≃ n transition occurs in the parallel component: the estimator memorizes primarily along the subspace supporting the data, while the orthogonal component contributes a separate, noise-controlled baseline to test. This decomposition formalizes the intuition that representation collapse changes the capacity available along data directions without changing the ambient latent dimension.

We emphasize that, across these cases, the phase behavior is expressed in terms of the ratios (ψn, ψp) (or their effective variants under intrinsic-dimension reductions). This ratio-level characterization is precisely what will allow us, in the next section, to translate the phase diagram into explicit and practically checkable threshold conditions under compression and architecture scaling.


5. Compression and architecture scaling: translate the phase diagram from ratios (ψ_n, ψ_p) to practical scalings p = κ D (or parameter-count scaling) and derive explicit threshold conditions in D; discuss when compression reduces memorization and when it can worsen it (if p is held fixed).

The phase behavior described above is most naturally expressed in terms of the proportional ratios
$$ \psi_n=\frac{n}{D}, \qquad \psi_p=\frac{p}{D}, $$
since these are the parameters that remain O(1) in the asymptotic analysis. In applications, however, one typically varies the latent dimension D (by changing the encoder bottleneck) while keeping the dataset size n fixed, and the score network size is coupled to D by architectural design constraints. We therefore translate the ratio-level picture into explicit scaling laws in D, emphasizing when compression provably reduces memorization risk and when it can instead be neutral or even adverse.

A convenient abstraction of many width-scaling rules is that the effective number of random features is proportional to the latent dimension,

for some feature density κ > 0 determined by channel counts and block multiplicities (or, more generally, by an approximately linear scaling of the score-model capacity with D). Under we have ψp = κ and ψn = n/D, so varying D moves the operating point vertically in the (ψn, ψp) plane. Since the dominant small-t memorization transition for m = ∞ is controlled by the overparameterization ratio
$$ \frac{\psi_p}{\psi_n}=\frac{p}{n}, $$
the practical boundary occurs when p ≃ n, i.e. when

Thus, in this scaling regime, decreasing D decreases p proportionally and can move the learner from the regime p > n (in which the estimator can closely track semp at small t) to the regime p < n (in which such tracking is representationally constrained). This yields an explicit compression law in D.

We stress that is a statement about the capacity relative to the available latent samples, not about the pixel dimension d. In particular, holding n fixed, the compression benefit is largest when κ is not simultaneously increased to compensate: the relevant quantity is p/n = κD/n. Ridge regularization λ and the time t shift the precise location and height of peaks in test(t), but they do not alter the leading-order scaling of the capacity-driven transition.

A different and common experimental practice is to change the encoder and thus D while leaving the score architecture unchanged in a way that keeps p (or an effective width proxy) approximately constant. In our proportional formalism, this corresponds to moving along rays in which ψp = p/D varies with D, but the ratio ψp/ψn = p/n remains constant when n is fixed. Consequently, in the isotropic latent model (and more generally when the phase boundary depends primarily on p/n along the data directions), changing D at fixed p does systematically move the system across the principal p ≃ n boundary: if p > n (resp. p < n) before compression, then p > n (resp. p < n) after compression. In this sense, purely representational compression with a decoupled score-model size may be largely neutral with respect to the dominant overparameterization-driven memorization mechanism highlighted above.

This neutrality should be interpreted carefully: the learning curves still depend on ψn = n/D and on the latent spectrum through C, and hence compression can affect constants (e.g. effective signal-to-noise ratios through at2C + htI). However, the sharp crossover phenomenon associated to interpolation of the empirical-mixture score at small t is governed by the availability of degrees of freedom relative to n, and that availability is not changed by varying D alone if p is held fixed.

Compression can become adverse when architectural choices implicitly increase p as D decreases. One idealized way this arises is if one keeps a fixed rather than a feature count. In the random-features model, the trainable second-layer weights A ∈ ℝD × p contribute Dp parameters, so fixing a budget P ≈ Dp yields the inverse scaling

If n is fixed, then p/n ≈ P/(Dn) grows as D decreases, and compression can move the system into the overparameterized regime. Indeed, under , the condition p > n becomes
$$ \frac{P}{D} > n \quad\Longleftrightarrow\quad D < \frac{P}{n}, $$
so sufficiently aggressive compression forces p to exceed n, increasing the propensity to fit semp and hence to exhibit memorization-like behavior in the small-t, large-m regime. The same effect appears whenever one ``compensates’’ for smaller D by widening the score model so that its effective feature count grows faster than linearly in 1/D.

This observation explains an apparent paradox: although reducing D can reduce memorization under the natural scaling p = κD, it can have the opposite effect under a fixed compute/parameter envelope if that envelope is used to increase width as the latent representation is compressed. The relevant control variable remains p/n (or, in structured cases, its intrinsic-dimension analogue), and any design that increases p/n tends to move toward the memorization-prone side of the diagram.

The preceding translation is most operational in the m = ∞ regime, where small t exposes the empirical-mixture structure and the main risk is strong alignment with semp. For m = 1, the same architectural scalings still control classical interpolation phenomena (e.g. the location of double descent near p ≃ n), but the extreme small-t blow-up is suppressed by label noise, so compression is less tightly coupled to extraction risk. Nevertheless, p/n remains the first-order indicator of whether the estimator has sufficient flexibility to match training-specific structure.

In summary, the phase diagram yields explicit design rules once one specifies how p scales with D: under feature-density scaling p = κD, compression decreases p and can move the model below the p ≃ n boundary; under fixed p, compression is largely neutral with respect to that boundary; and under fixed parameter budgets (or compensatory widening) leading to p ∝ 1/D, compression can increase p/n and thereby worsen memorization risk. In the next section we refine these statements beyond isotropy by replacing D with an intrinsic effective dimension determined by the spectrum of C, which captures representation collapse and anisotropy.


6. Representation collapse and anisotropy: define effective intrinsic dimension (rank, participation ratio); show how anisotropy modifies learning curves and directional generalization; connect to encoder pathologies (collapsed variance, spiked spectra).

The compression discussion above implicitly treated the latent distribution as essentially isotropic, so that the phase behavior could be summarized in terms of the nominal latent dimension D. In the general latent Gaussian model P0z = 𝒩(0, C), however, the learning curves of Theorems~– depend on C only through its limiting spectrum, and the relevant notion of ``dimension’’ is therefore spectral rather than nominal. We now isolate the appropriate intrinsic-dimension proxies and explain how anisotropy induces directional generalization and directional memorization.

Let C = UΛU with Λ = diag(λ1, …, λD) and empirical spectral distribution $\mu_C=\frac1D\sum_{j=1}^D \delta_{\lambda_j}$. At time t, the forward OU perturbation yields
Zt ∼ 𝒩 (0, Σt),   Σt := at2C + htI.
In particular, along the eigen-direction uj the marginal variance is at2λj + ht, and the true score is explicit:
$$ s^\star(t,y) = -\Sigma_t^{-1}y = -\sum_{j=1}^D \frac{\langle y,u_j\rangle}{a_t^2\lambda_j+h_t}\,u_j. $$
Thus the target varies across directions according to the inverse-variance weights (at2λj + ht)−1. More importantly for memorization, the structure of Ptz depends on the direction-wise signal-to-noise ratios
$$ \mathrm{SNR}_j(t) := \frac{a_t^2\lambda_j}{h_t}. $$
When SNRj(t) ≫ 1, the time-t distribution remains strongly influenced by the discrete empirical support of P0z along uj (hence semp(t, ⋅) exhibits the sharp sample-localized behavior characteristic of small t); when SNRj(t) ≪ 1, the diffusion noise dominates and the mixture becomes close to a single Gaussian along that direction, so that semp(t, ⋅) is comparatively smooth and less sample-specific. Anisotropy of C therefore produces memorization: the score learner may have sufficient degrees of freedom to interpolate the empirical-mixture score in a subset of directions, while remaining effectively regularized in others.

A clean instance is the rank-r projector model C = Π, which captures representation collapse to an r-dimensional subspace. Writing Π = I − Π, we have
Σt = at2Π + htI = (at2 + ht)Π + htΠ = Π + htΠ,
and hence
$$ s^\star(t,y)=-\Pi_{\parallel}y-\frac{1}{h_t}\Pi_{\perp}y. $$
This decomposition exhibits two distinct components. The part carries the data-dependent structure (it is the only component where P0z is nontrivial), and it is precisely in this r-dimensional component that the small-t, large-m memorization transition of Theorem~ manifests. The part is pure noise'': it is independent of $P_0^z$ and determined solely by the diffusion schedule. Accordingly, the limiting test error naturally decomposes as \[ \mathcal{E}_{\mathrm{test}}(t)=\mathcal{E}_{\mathrm{test}}^{\parallel}(t)+\mathcal{E}_{\mathrm{test}}^{\perp}(t), \] with the sharp rise associated to matching $s^{\mathrm{emp}}$ occurring in $\mathcal{E}_{\mathrm{test}}^{\parallel}(t)$ as a function of the effective ratios $n/r$ and $p/r$ (equivalently, with $D$ replaced by $r$ in the proportional parameters along the data subspace). This is the mechanism behind the statement that collapse changes thedimension controlling memorization along data directions’’: shrinking r increases the effective sample ratio n/r and can mitigate the ability to fit sample-specific structure, even if the ambient latent size D is unchanged.

Outside exact subspace models, a convenient scalar proxy for intrinsic dimension is the participation ratio
$$ D_{\mathrm{eff}} := \frac{(\mathrm{tr}\,C)^2}{\mathrm{tr}(C^2)} = \frac{\left(\sum_{j=1}^D \lambda_j\right)^2}{\sum_{j=1}^D \lambda_j^2}, $$
which interpolates between 1 (fully collapsed to one direction) and D (isotropic). This functional is invariant to orthogonal changes of basis and depends only on μC, matching the way C enters the resolvent characterizations of the learning curves. In anisotropic settings, Deff captures the intuition that a few large eigenvalues (a ``spiked’’ spectrum) behave like a low-dimensional signal with many nuisance directions, even when D is large.

Since memorization at small t is driven by the mixture geometry at time t, it is often useful to refine Deff into a effective dimension that discounts directions drowned by diffusion noise. One natural choice is to weight each direction by its contribution to the score-relevant geometry, for instance via the matrix
$$ G_t := C^{1/2}\Sigma_t^{-1}C^{1/2} = U\,\mathrm{diag}\!\left(\frac{\lambda_j}{a_t^2\lambda_j+h_t}\right)U^\top, $$
and define
$$ D_{\mathrm{eff}}(t) := \frac{(\mathrm{tr}\,G_t)^2}{\mathrm{tr}(G_t^2)}. $$
When t is small, Deff(t) emphasizes the directions with SNRj(t) ≳ 1 (those that preserve sample separation); when t is large, Σt ≈ I and Deff(t) approaches the usual participation ratio of C. While we do not require a specific proxy in the formal statements, any such scalar summary is intended to replace D in regime diagnostics whenever the spectrum is far from flat.

These spectral notions correspond to common representation pathologies. If an encoder exhibits collapsed variance (e.g. dead units or overly aggressive bottleneck regularization), then C develops many near-zero eigenvalues, Deff decreases, and SNRj(t) becomes small in most directions; consequently the time-t distribution becomes closer to a single Gaussian in those directions and the empirical-mixture score is less sharp, reducing the opportunity (and incentive) for the score model to align with semp there. Conversely, spiked spectra (a few very large eigenvalues) can concentrate mixture geometry into a small set of directions: even if Deff is moderate, the learner may effectively overparameterize the spiked subspace, fitting sample-specific structure primarily along those directions and thereby increasing directional extraction risk.

We therefore interpret anisotropy as producing a : the same global (n, p, m) may place different eigenspaces of C on different sides of the generalization–memorization crossover. The next section operationalizes this viewpoint by estimating the latent spectrum (or Deff-type proxies) from samples and converting the resulting effective ratios into actionable audit predictions for memorization and score–empirical-score alignment.


7. Representation-space audit algorithms: estimating latent spectra and predicting memorization risk; practical proxies for score–empirical-score alignment; compute–privacy diagnostics.

We now describe a representation-space audit whose inputs are the latent codes {zi}i = 1n ⊂ ℝD (or streaming access thereto), architectural scalings (p, λ) of the score model class, and a characterization of the effective DSM noise budget m (including noise reuse across epochs). The audit produces (i) a spectral summary of the representation via and intrinsic-dimension proxies, (ii) predicted regime indicators for generalization versus memorization at small~t, and (iii) practical alignment proxies that correlate with extraction-style behaviors (nearest-neighbor retrieval and membership inference).

Given latent samples, we form the empirical covariance $\hat C=\frac1n\sum_{i=1}^n z_i z_i^\top$. When D is moderate, we may compute the full eigenspectrum of ; when D is large, we instead compute the top-k eigenpairs via randomized SVD and approximate the remainder by a trace estimator. Concretely, let {λ̂j}j = 1k denote the leading eigenvalues, and estimate tr() and tr(2) by Hutchinson-type sketches when needed. We then report
$$ \widehat D_{\mathrm{eff}}=\frac{\mathrm{tr}(\hat C)^2}{\mathrm{tr}(\hat C^2)}, \qquad \widehat D_{\mathrm{eff}}(t)=\frac{\mathrm{tr}(\hat G_t)^2}{\mathrm{tr}(\hat G_t^2)}, \qquad \hat G_t:=\hat C^{1/2}\hat\Sigma_t^{-1}\hat C^{1/2},\ \ \hat\Sigma_t:=a_t^2\hat C+h_t I, $$
as well as the empirical distribution of the scalars j(t) := λ̂j/(at2λ̂j + ht). The latter serves as a diagnostic for mixture geometry: directions with j(t) ≈ 1 retain data separation at time~t, whereas directions with j(t) ≈ 0 are noise-dominated. In practice, we emphasize that any scalar summary necessarily discards some anisotropic detail; accordingly, we recommend reporting both eff-type aggregates and a small set of quantiles of {j(t)}.

The principal output of the audit is a predicted risk score R(t) built from proportional parameters
$$ \hat\psi_n(t):=\frac{n}{\widehat D_{\mathrm{eff}}(t)},\qquad \hat\psi_p(t):=\frac{p}{\widehat D_{\mathrm{eff}}(t)}, $$
together with the DSM noise budget m and ridge λ. The simplest indicator, motivated by the small-t behavior in the large-noise regime, is the sign of
$$ \Delta(t):=\frac{\hat\psi_p(t)}{\hat\psi_n(t)}-1=\frac{p}{n}-1, $$
with the understanding that eff(t) determines directions behave as the effective data subspace but that the sharp interpolation boundary, when it manifests, is controlled by p/n. We therefore output a regime label (low versus high memorization risk) given by a calibrated threshold rule of the form
R(t) = 1{Δ(t) > δ0} ⋅ wm(t) ⋅ wλ(t) ⋅ wanis(t),
where wm(t) increases with an estimate of the noise multiplicity meff(t), wλ(t) decreases as λ grows, and wanis(t) increases when a small fraction of directions dominates tr(t) (a spiked {j(t)} profile). In settings where one is willing to numerically instantiate the limiting trace equations, the audit refines R(t) by plugging the empirical spectral measure of into the corresponding fixed-point system, yielding predicted test(t) and train(t) curves as a function of (p, n, m, λ).

In many training pipelines, the nominal m differs from the number of noise draws per datum (e.g. due to reusing the same noise across epochs, or sharing noises across times). We therefore define a bookkeeping quantity meff as the number of statistically independent (t, ξ) perturbations per training point that actually enter the regression objective. When direct instrumentation is possible, meff is measured; otherwise, it is bounded using the number of epochs, batch schedule, and random seed management. This matters because large meff drives the objective closer to its population-noise limit, amplifying the small-t sample-localized structure that the score model can potentially fit.

To connect regime prediction to extraction behavior, we estimate alignment with the empirical mixture score semp(t, ⋅), which is available in closed form at a fixed t given {zi}:
$$ s^{\mathrm{emp}}(t,y)=\nabla_y \log\Bigl(\frac1n\sum_{i=1}^n \exp\bigl(-\|y-a_t z_i\|^2/(2h_t)\bigr)\Bigr) =\frac{1}{h_t}\Bigl(\sum_{i=1}^n \omega_i(y)\,(a_t z_i-y)\Bigr), $$
where ωi(y) are the normalized Gaussian-kernel weights. We then define an empirical alignment functional on query points y ∼ Ptz (or on held-out perturbed latents)
$$ \widehat{\mathcal{M}}(t)=\frac{1}{N}\sum_{\ell=1}^N \bigl\|\hat s(t,y_\ell)-s^{\mathrm{emp}}(t,y_\ell)\bigr\|_2^2, $$
and approximate semp by truncating the sum to the top-k nearest neighbors of y among {atzi} (with k ≪ n), since for small t the weights ωi(y) are sharply localized. In addition, we record the proxy
Hk(t; y) := −∑i ∈ NNk(y)ωi(k)(y)log ωi(k)(y),
whose small values indicate near-single-sample dominance and empirically correlate with nearest-neighbor retrieval and membership inference susceptibility.

The audit finally reports a compute-aware privacy diagnostic: (i) whether the training configuration sits in a regime where $\widehat{\mathcal{M}}(t)$ is predicted to be small (generalizing) or large (memorizing), and (ii) whether this alignment is concentrated in a small set of spectral directions (spike-dominated). Operationally, we output a tuple
$$ \Bigl(\widehat D_{\mathrm{eff}},\widehat D_{\mathrm{eff}}(t),\hat\psi_n(t),\hat\psi_p(t),m_{\mathrm{eff}},\widehat{\mathcal{M}}(t),\mathrm{Quantiles}(\hat g_j(t))\Bigr), $$
together with a recommended mitigation axis when risk is high: decrease p (or feature density), increase λ, reduce meff by refreshing noise less aggressively, or reduce the representation dimension along the high-j(t) directions (e.g. by whitening and truncation). These outputs are designed to be compared directly against downstream extraction metrics in the experimental protocol that follows.


8. Experimental plan: latent diffusion training across latent sizes; covariance spectrum estimation; extraction metrics (nearest-neighbor retrieval, membership inference) vs predicted risk; ablations on m (effective noise reuse), regularization, and encoder quality.

We evaluate the theory and the associated audit by training latent score models across a grid of latent dimensions D, while holding the raw dataset and encoder family fixed. Concretely, we select a dataset {xi}i = 1n ⊂ ℝd (with a fixed n across all runs) and train or reuse an encoder E : ℝd → ℝD that outputs latent codes zi = E(xi). We then train denoising score matching (DSM) estimators in latent space at a collection of small times t ∈ Tsmall (and optionally a broader grid for completeness), using the same forward OU/VP coefficients (at, ht). Our primary experimental axis is D, and we emphasize two scaling regimes for the score model width: (i) p = κD for several fixed feature densities κ (as in Corollary~3), and (ii) p = p0 independent of D, which decouples representation compression from score-model scaling and serves as a robustness check.

For each (D, p) configuration we train a random-features score model (t, ⋅) at each t by ridge regression with regularization λ, using DSM targets implied by the perturbation $y=a_t z+\sqrt{h_t}\,\xi$ with ξ ∼ 𝒩(0, I). We implement both m = 1 (one perturbation per datum per epoch) and large-noise settings approaching m = ∞ by increasing the number of independent perturbations. Since the operative quantity in practice is not the nominal m but the number of noises used throughout optimization, we instrument training to record the effective noise multiplicity meff(t): the number of independent (t, ξ) draws per zi that contribute to the objective (accounting for reuse across epochs and any caching). In addition to the idealized extremes m = 1 and m = ∞, we include a controlled ablation where we fix a bank of noises per datum and reuse it over epochs, thereby interpolating between low and high meff at fixed compute.

For each encoder and latent size D, we estimate $\hat C=\frac1n\sum_{i=1}^n z_i z_i^\top$ and compute spectral summaries used by the audit: tr(), tr(2), eff = (tr())2/tr(2), and the time-dependent quantities built from Σ̂t = at2 + htI, namely t = 1/2Σ̂t−11/2 and eff(t) = (tr(t))2/tr(t2). We also record quantiles of j(t) = λ̂j/(at2λ̂j + ht) to diagnose whether only a small set of directions remains data-dominated at the relevant small times. When D is large we compute only the top-k eigenpairs and approximate traces by randomized sketches; we verify that the resulting eff is stable in k and in the number of probe vectors.

Given (n, p, λ, meff) and the estimated spectrum, we compute the audit’s predicted regime indicator R(t) (and, when feasible, predicted test/train DSM errors obtained by numerically instantiating the limiting fixed-point system with the empirical spectral measure). We then measure empirical latent DSM errors using a held-out set of latents {zitest}, reporting
$$ \widehat{\mathcal{E}}_{\mathrm{test}}(t):=\mathbb{E}_{z\sim \mathrm{test},\,\xi}\bigl\|\hat s(t,a_t z+\sqrt{h_t}\xi)-s^\star(t,a_t z+\sqrt{h_t}\xi)\bigr\|_2^2, $$
approximated by Monte Carlo. Since s is unknown, we also report proxy errors against the empirical-mixture score semp computed from the training latents, which is the relevant object for extraction-style behavior at small~t. In particular, we estimate the alignment functional $\widehat{\mathcal{M}}(t)$ on y ∼ Ptz using a nearest-neighbor truncation of semp(t, y) for computational tractability.

To connect the audit to concrete privacy-relevant behaviors, we consider two classes of extraction metrics. First, : we generate query points y by perturbing either training or held-out latents and compute the top-1 nearest neighbor index among {atzi}i = 1n (in Euclidean distance) and compare it to the maximal-weight index under the kernel weights ωi(y) defining semp(t, y). We report (i) the probability that the nearest neighbor is unique and dominates the weights (e.g. maxiωi(y) exceeding a threshold), and (ii) the entropy proxy Hk(t; y) averaged over queries. Second, : we train a classifier to distinguish whether a query y was generated from a training latent or a held-out latent using statistics derived from the learned score, such as (t, y)∥2, local score divergence estimates, and the top-k weight concentration of the induced kernel neighborhood. We summarize membership inference by AUC and advantage at fixed false-positive rate. Our hypothesis, aligned with the theory, is that these extraction metrics increase sharply when (p/n) crosses the interpolation boundary in the relevant effective subspace and when meff is large at small t.

We systematically vary λ to test the predicted monotone mitigation effect of ridge regularization on both alignment and extraction metrics, holding (n, p, D, meff) fixed. Separately, we vary encoder quality to induce controlled changes in the latent spectrum: (i) using encoders at different training checkpoints, (ii) introducing explicit bottlenecks or orthogonality/whitening penalties to modify anisotropy, and (iii) inducing partial collapse by reducing decoder pressure or adding strong latent regularizers. For each encoder we compare the audit’s eff(t) and j(t) quantiles against the observed extraction metrics, with the expectation that spike-dominated spectra (small effective dimension along data-dominated directions) concentrate risk and change the relevant threshold when expressed in terms of Deff.

Across all configurations we report tuples
$$ (D,p,n,\lambda,m_{\mathrm{eff}};\ \widehat D_{\mathrm{eff}},\widehat D_{\mathrm{eff}}(t),\ \widehat{\mathcal{M}}(t);\ \text{retrieval},\ \text{membership-AUC}), $$
and we plot these against the audit’s predicted regime boundary as functions of p/n and of the spectral concentration diagnostics. We additionally verify that, under architecture scaling p = κD with n fixed, decreasing D moves the system across the predicted threshold D ≈ n/κ in tandem with a reduction in alignment to semp and a concomitant drop in retrieval and membership inference success.


9. Discussion and limitations: beyond Gaussian latents (mixtures), time-conditioned networks, discretization effects; open problems and how the theory could be extended.

Our analysis is organized around the surrogate P0z ≃ 𝒩(0, C) (or structured degenerations thereof), which is both mathematically convenient and empirically plausible after aggressive representation learning and whitening. Nevertheless, many practical encoders produce latent distributions that are visibly non-Gaussian: heavy tails, skewness, and (most importantly) class-conditional clustering that is better modeled by mixtures. At a qualitative level, the OU perturbation $Z_t=a_t Z_0+\sqrt{h_t}\xi$ mollifies such non-Gaussianity as t grows, so one expects the Gaussian proxy to become more accurate for moderate t, while the small-t regime—the regime in which our memorization boundary is sharpest—remains sensitive to mixture structure. For a mixture $P_0^z=\sum_{k=1}^K \pi_k \mathcal{N}(\mu_k,C_k)$, the exact score s(t, ⋅) is a soft assignment over components with weights proportional to πkφt(⋅; atμk, at2Ck + htI), hence the ``data directions’’ and effective dimension become point-dependent. We view a rigorous extension to mixtures as requiring new tools: the random-design matrices become heteroscedastic and correlated with the regression target through the posterior responsibilities, so the present resolvent reductions do not directly apply.

Even when P0z is not Gaussian, two mechanisms suggest partial robustness of the phase diagram. First, in proportional asymptotics, random-features regression exhibits a degree of universality: many test/train risk expressions depend primarily on low-order moments of the design under appropriate Lindeberg conditions. Second, the OU convolution renders Ptz smoother and closer to a single Gaussian in total variation as t increases, and one may hope to control the discrepancy between s(t, ⋅) and the Gaussian-score surrogate using stability estimates for score matching under perturbations of the underlying density. Making either statement precise is nontrivial: score errors depend on derivatives of log pt, and thus on non-asymptotic lower bounds for pt and regularity of ∇log pt. We therefore restrict ourselves to using C (or an intrinsic proxy derived from it) as a principled rather than claiming distribution-free guarantees. A natural intermediate model class is elliptical P0z with a deterministic radial distribution, for which one might retain tractable resolvent formulas while capturing tail effects.

We analyze (and audit) each t separately, which matches the form of the theoretical DSM objective and isolates the dependence on (at, ht). Practical diffusion models, however, learn a single time-conditioned network θ(t, ⋅) with shared parameters across t, typically trained with a non-uniform time sampling distribution. This coupling can move the effective capacity away from the nominal p at a fixed t: the model may allocate more degrees of freedom to times that dominate the training measure, and may be under-parameterized at some t even if p ≫ n globally. In a random-features analogue, one may model time conditioning by augmenting the features as $\phi(W z/\sqrt{D},\, u(t))$ for a deterministic embedding u(t), leading to a multi-task ridge regression with a block-structured Gram matrix indexed by time. The corresponding limiting risk should be governed by a matrix-valued fixed-point system (involving the joint spectral distribution of the per-time covariances Σt = at2C + htI and the time kernel induced by u). We expect the sharp boundary in terms of an effective (p/n) ratio to persist, but with a boundary that depends on the time-sampling distribution and on the ``task overlap’’ across nearby times.

Our predictions are stated at the level of the learned score at a fixed diffusion time t, whereas privacy-relevant extraction is typically mediated by a discrete-time reverse sampler (Euler–Maruyama for the reverse SDE, or a probability-flow ODE integrator). Discretization introduces two additional layers of approximation: (i) the sampler queries only on a time grid and thus may never visit the smallest times where alignment to semp is maximal; (ii) numerical error can amplify score inaccuracies in a time-accumulated manner. Accordingly, a complete risk characterization should weight score error not only by Ptz but also by the visitation distribution of the sampler. One plausible extension is to analyze a discretized reverse recursion whose drift uses the RF score, and to propagate the score approximation error through the recursion using stability bounds for stochastic numerical schemes. In particular, one may obtain an error decomposition into a term (our test-type quantities) and a term depending on step size and on Lipschitz constants of the learned drift.

The audit replaces D by an estimated Deff (or Deff(t)), which requires accurate estimation of spectral functionals of C. When n is only moderately larger than Deff, both tr(2) and the top eigenvalues can be biased, and the participation ratio can be unstable under heavy tails or strong anisotropy. This interacts with the very phenomenon we seek to detect: near the boundary, small perturbations of ψn and ψp can flip the regime label. We therefore regard the audit as a whose reliability improves when (a) n is sufficiently large to estimate the spectrum at the needed resolution, and (b) the latent spectrum has a pronounced gap or otherwise yields a stable effective dimension. A principled refinement would attach confidence intervals to Deff via bootstrap or via non-asymptotic concentration bounds for trace estimators, and propagate these intervals through the phase-boundary rule.

A further limitation is that the latent codes zi = E(xi) are not exogenous: the encoder is itself trained on {xi}i = 1n, so C and the latent distribution depend on the sample. Our asymptotics treat zi as i.i.d. from a fixed P0z, which is appropriate when E is frozen and the downstream score model is trained on new data, but is only approximate in end-to-end settings. Extending the theory to incorporate encoder-induced dependence would require a joint limit in which E is a random object correlated with {xi}, and one tracks how this correlation alters both the effective spectrum and the DSM targets. We expect the leading-order lesson to remain: collapse or strong anisotropy reduces the number of data-dominated directions and can concentrate memorization risk; however, quantifying this effect requires modeling the encoder training dynamics or imposing stability assumptions on E.

We close by listing concrete extensions suggested by our results. (i) Establish a rigorous universality theorem for latent DSM with non-Gaussian P0z, ideally identifying minimal moment and smoothness conditions under which the Gaussian learning curves remain asymptotically valid. (ii) Develop a mixture-aware phase diagram in which the boundary is expressed in terms of a local or component-wise effective dimension, and relate it to cluster-wise extraction phenomena. (iii) Analyze time-conditioned training as a coupled multi-task regression problem and determine how time sampling and parameter sharing shift the boundary. (iv) Integrate discretized sampling into the risk metric, producing an end-to-end prediction for retrieval/membership behavior under a specified sampler. (v) Replace the random-features class with deeper architectures while retaining a tractable asymptotic characterization, for example via linearization at initialization or via kernel limits, thereby connecting representational capacity, memorization, and diffusion-time conditioning in a single framework.