← Back

Regularization-to-Exactness for PSD Feasibility under Rank-One Measurements: Polishing Approximate Ellipsoid Fits

Metadata

Table of Contents

  1. Introduction and motivation: ellipsoid fitting as PSD feasibility; the 1/4 transition; why approximate-to-exact is the key missing step; overview of contributions and algorithmic corollaries.
  2. Background and related work: SDP formulations; known lower bounds (Saunderson/Chandrasekaran–Parrilo–Willsky, KD23, PTVW23, HKPX23, TW23, BMMP24); MB23’s sharp bounded-approximate transition; discussion of ‘ill-behaved’ solutions and what is not addressed by dimension counting.
  3. Model, assumptions, and operators: isotropic subgaussian setup; the measurement map and Gram operator := ^*; norms and residual metrics; statement of main theorems.
  4. Universality of bounded-spectrum approximate thresholds: conic/volume comparison framework; reduction to a surrogate Gaussian measurement model; proving the same 1/4 threshold for subgaussian ensembles (with explicit robustness parameters).
  5. Polishing map and deterministic reduction: define (S) explicitly using ^{-1}; show exact constraint satisfaction; reduce PSD preservation to controlling ||_{} in terms of residual size and conditioning of .
  6. Random operator estimates needed for polishing: conditioning of on the relevant residual subspace; bounds on |^*(v)|_{} in terms of |v|_2; uniform-in-d concentration tools for subgaussian rank-one sums; (flag where constants may require careful chaining).
  7. PSD preservation and stability: eigenvalue perturbation control (Weyl) to ensure ; bound on ||_{}; sensitivity to residual norms; discussion of failure modes (large residual, poor conditioning).
  8. Minimum nuclear-norm exact solution is PSD: formulate the nuclear-norm program and its dual; construct a dual certificate below threshold ensuring the optimum lies in the PSD cone; derive boundedness and uniqueness properties (when applicable).
  9. Algorithmic corollaries: end-to-end polynomial-time procedure (find bounded approximate fit via SDP/first-order method; polish to exact); complexity and numerical stability; optional practical variant using conjugate gradient without forming explicitly.
  10. Examples and extensions: Gaussian case recovers MB23 and upgrades to exact feasibility conditional on EFP; extension sketches to other cones (PSD+trace constraints; group-invariant cones); discussion of implications for random-feature/quadratic-feature learning models.
  11. Open problems and 2026 directions: pushing polishing to the exact 1/4 boundary; removing bounded-spectrum assumptions entirely; extending to structured (non-i.i.d.) measurements; connections to other Randomstrasse101 problems (synchronization landscapes, Kikuchi matrices) via ‘regularize then certify’ principles.

Content

Introduction and motivation: ellipsoid fitting as PSD feasibility; the 1/4 transition; why approximate-to-exact is the key missing step; overview of contributions and algorithmic corollaries.

We consider the feasibility problem
find S ≽ 0  such that  𝒜(S) = 1,
where 𝒜(S) = (xiSxi)i ≤ n is induced by random rank-one measurements Xi = xixi. This formulation is equivalent to an ellipsoid fitting problem: if S ≻ 0 then S defines the centered ellipsoid
ℰ(S) := {z ∈ ℝdzSz ≤ 1},
and the constraints xiSxi = 1 assert that each sample point xi lies on the boundary of ℰ(S). Thus (EF) asks for a centered ellipsoid whose boundary passes through all n random points. Although the geometric picture is classical, the random scaling n = αd2 places the problem in a regime where the number of constraints is commensurate with the ambient dimension of 𝕊d, and the feasibility of the affine slice {S : 𝒜(S) = 1} intersected with the PSD cone exhibits sharp high-dimensional behavior.

A naive dimension count suggests that α should control feasibility because dim (𝕊d) = d(d + 1)/2 ≍ d2 while the number of constraints is n = αd2. However, such a count neither predicts a sharp threshold nor captures the role of positivity. In particular, the PSD cone imposes a global curvature constraint, and the equality constraints are quadratic in the underlying vectors xi. Both features create a substantial gap between having enough degrees of freedom'' andadmitting a feasible PSD solution with controlled geometry.’’ Our focus is precisely on this gap: we aim to identify a distributionally robust transition for feasible solutions and to provide a mechanism that turns robust approximate fits into exact feasible PSD matrices.

Two notions are essential. First, in high dimension one must exclude pathological solutions whose spectrum is unbounded or whose conditioning deteriorates with d. Such solutions, even when they exist, are unstable and are not informative about the typical geometry of the problem. We therefore work in a bounded-spectrum regime, seeking matrices S satisfying
0 ≼ S ≼ MId
for a fixed (or at most slowly growing) M. Second, even below a putative feasibility threshold, exact equality constraints can be algorithmically and analytically delicate; it is natural to relax to an approximate feasibility requirement
∥𝒜(S) − 12  small,
which is also the relevant formulation in noisy data settings. The bounded-spectrum approximate feasibility event EFPδ, M captures precisely this ``robustly feasible and well-conditioned’’ regime.

In the Gaussian case, recent work has identified a sharp transition at sampling ratio α = 1/4 for the existence of bounded-spectrum solutions: below 1/4 one typically can fit a well-behaved ellipsoid up to small residual, while above 1/4 even approximate fitting fails uniformly over bounded M. This transition is striking because it is not the naive α ≈ 1/2 suggested by parameter counting; rather, it reflects a genuinely high-dimensional geometric obstruction. A central question is whether the same constant persists for non-Gaussian designs and, more importantly for our purposes, whether bounded approximate feasibility can be upgraded to feasibility in a stable manner.

The approximate-to-exact step is the missing link for two reasons. First, the existence of an approximate fit does not itself guarantee the existence of an exact PSD fit with comparable spectrum bounds. In principle, the affine constraints might be such that the slice {S : 𝒜(S) = 1} passes near the PSD cone in the bounded region but does not intersect it; approximate solutions could then exist without any exact solution of controlled size. Second, even when exact solutions exist, it is not clear how to produce one from an approximate input without leaving the PSD cone: a direct correction solving 𝒜(Δ) = 𝒜(S) − 1 may introduce a large negative direction, thereby destroying positive semidefiniteness. Thus, to convert approximate feasibility into exact feasibility, one must exploit both the linear-algebraic structure of 𝒜 and the random conditioning of the induced Gram operator.

Our approach is to make the correction step explicit and quantitative. Let r(S) := 𝒜(S) − 1 denote the residual and let 𝒢 := 𝒜𝒜* be the n × n Gram operator with entries (𝒢)ij = (xixj)2. Formally, if 𝒢 were invertible then the choice
v := 𝒢−1r(S),   Δ := 𝒜*(v)
satisfies 𝒜(S − Δ) = 1 by construction. Moreover, among all corrections in range (𝒜*) that eliminate the residual, Δ is the minimum-Frobenius-norm correction. The obstacle is that 𝒢 may be singular or ill-conditioned, and even when v is well-defined (via the Moore–Penrose inverse), the operator norm Δop must be controlled to preserve S ≽ 0. The core technical content of our ``polishing’’ result is that, below the 1/4 threshold, the random operator 𝒢 is sufficiently well-conditioned on its range and 𝒜* does not amplify vectors v too much in operator norm. This yields a deterministic update rule with a probabilistic guarantee: starting from any bounded-spectrum approximate feasible S, we compute a correction direction Δ and apply either the full step S − Δ or a damped step S − tΔ with an explicit t chosen to ensure S − tΔ ≽ 0.

The random matrix input required for this argument is, in a sense, orthogonal to the existence of approximate solutions. Existence is governed by the high-dimensional geometry of the feasible slice intersected with {0 ≼ S ≼ MI}, while polishing is governed by invertibility and norm control for the operator 𝒜 and its adjoint. The two aspects interact through the residual scale: we require $\|r(S)\|_2\le \delta\sqrt{n/d}$ with δ small enough that the resulting correction is spectrally small. This is precisely why the bounded-spectrum approximate feasibility event is the right input condition: it gives a robust quantitative foothold from which exact feasibility can be reached without losing control of the spectrum.

A second contribution concerns convex regularization. Among all exact solutions to 𝒜(S) = 1, a natural selection rule is minimum nuclear norm:
 ∈ arg min {∥S*: 𝒜(S) = 1}.
For S ≽ 0 the nuclear norm equals Tr (S), so this objective can be interpreted as seeking a smallest-trace ellipsoid fit. However, the optimization problem is posed over all symmetric matrices, and an optimizer could be indefinite. We show that below the 1/4 threshold this does not happen with high probability: every nuclear-norm minimizer is PSD and enjoys an operator-norm bound depending only on (α, ε, K). Conceptually, this ties the geometric threshold phenomenon to a canonical convex program and provides an additional route to ``well-behaved’’ exact solutions without imposing S ≽ 0 as an explicit constraint in the minimization.

We emphasize that our results are not restricted to rotationally invariant ensembles. The measurement vectors xi are assumed only isotropic and subgaussian (in the standard ψ2 sense after scaling by $\sqrt{d}$). This covers Gaussian designs but also encompasses, for example, Rademacher-type designs and many other bounded or weakly dependent models after appropriate preprocessing. The universality statement asserts that the 1/4 threshold for bounded-spectrum approximate feasibility persists under this generality, and the polishing theorem shows that the accompanying linear-algebraic stability needed for approximate-to-exact conversion also persists. In this way, the constant 1/4 emerges as a geometric property of the feasibility problem rather than a special feature of Gaussian symmetry.

Finally, the combination of approximate feasibility below threshold and polishing yields an explicit end-to-end algorithm with polynomial complexity. One may first compute a bounded-spectrum approximate feasible matrix S by a standard convex program (for instance, an SDP enforcing 0 ≼ S ≼ MI together with a small 2 residual constraint). One then applies the polishing map, which requires solving the linear system 𝒢v = r(S) on range (𝒢) and forming Δ = 𝒜*(v). Importantly, this can be implemented without explicitly forming 𝒢: iterative methods such as conjugate gradient only require matrix–vector products w ↦ 𝒢w = 𝒜(𝒜*(w)), which in turn reduce to evaluating quadratic forms xi(∑jwjxjxj)xi. The resulting procedure is deterministic given the data, and our theorems assert that, below the threshold, it succeeds uniformly over all approximate inputs satisfying the bounded-spectrum residual condition.

In summary, we (i) establish a universal 1/4 transition for bounded-spectrum approximate feasibility under isotropic subgaussian designs, (ii) provide a deterministic and stable polishing mechanism that turns such approximate solutions into exact PSD solutions with controlled spectrum, and (iii) show that nuclear-norm minimization produces PSD, bounded exact solutions in the same below-threshold regime. These results identify the correct ``robust feasibility’’ notion for ellipsoid fitting at scale n = αd2 and supply the missing approximate-to-exact step needed to convert geometric existence statements into algorithmically meaningful exact feasibility guarantees.


We briefly situate (EF) within the literature on ellipsoid fitting, semidefinite programming, and high-dimensional feasibility transitions. The basic observation, going back at least to early work on convex relaxations for quadratic constraints, is that the unknown quadratic form can be lifted to a matrix variable. After lifting, the constraints xiSxi = 1 become linear equalities in S, and the constraint S ≽ 0 is convex. Thus, at the level of modeling, (EF) is already a semidefinite feasibility problem: it asks whether a random affine subspace of 𝕊d intersects the PSD cone. This perspective makes it natural to import tools from random spectrahedra, convex geometric functional analysis, and the algorithmic theory of SDPs.

A closely related family of formulations replaces feasibility by optimization with a convex regularizer. The most common objective is trace minimization,
min  Tr (S)  s.t.  𝒜(S) = 1,  S ≽ 0,
which coincides with nuclear-norm minimization on the PSD cone and can be interpreted geometrically as selecting a smallest-trace (hence, in a certain averaged sense, ``smallest volume’’) centered ellipsoid consistent with the constraints. One also encounters regularized and relaxed variants, for example
min  Tr (S) + λ∥𝒜(S) − 122  s.t.  0 ≼ S ≼ MId,
or feasibility with tolerance ∥𝒜(S) − 12 ≤ η, as a surrogate for noise or model mismatch. From an algorithmic standpoint these formulations are attractive because they are convex and admit polynomial-time solvers in the worst case, while in practice they can be implemented with first-order methods that access 𝒜 and 𝒜* through matrix–vector operations. From an analytical standpoint, however, the challenge is not convexity but geometry: one must understand when the feasible region is nonempty and, if it is nonempty, whether it contains solutions with controlled spectrum.

Several works emphasize that the spectrum control is not a technical convenience but an essential feature of the phenomenon. In the absence of an bound such as S ≼ MId, one may obtain solutions that are extremely ill-conditioned: the constraints can be satisfied by placing nearly all mass of S on a low-dimensional subspace aligned with atypical correlations among the sample points, at the cost of a very large operator norm and a correspondingly unstable ellipsoid. Such solutions can exist even when the ``typical’’ geometry suggests infeasibility, and they are often invisible to coarse arguments that track only dimension or rank. This motivates focusing on a bounded-spectrum regime, which is the regime in which one expects the feasibility problem to reflect intrinsic high-dimensional geometry rather than degenerate algebraic coincidences.

This distinction also clarifies why a naive parameter count is an unreliable predictor. Although dim (𝕊d) = d(d + 1)/2 and the number of constraints is n = αd2, the PSD constraint is not a linear codimension constraint and cannot be captured by equating degrees of freedom. Moreover, the map S ↦ (xiSxi)i is not a generic linear map on 𝕊d; it is structured by rank-one measurements, and its image and nullspace interact nontrivially with the cone 𝕊+d. Consequently, one can have regimes in which many solutions exist but are ill-behaved, regimes in which no bounded solution exists but near-feasible points exist, and regimes in which even approximate feasibility fails. Understanding which of these regimes occurs at the scale n ≍ d2 requires more than dimension counting.

On the lower-bound side, a recurring theme is that certain convex relaxations become ineffective beyond a constant sampling ratio. Early contributions by Saunderson and by Chandrasekaran–Parrilo–Willsky analyze semidefinite representations and convex geometric obstructions for problems that can be cast as fitting quadratic forms to data, highlighting that, in high dimension, feasibility of structured spectrahedra can fail abruptly and that the transition is driven by global geometry rather than local degrees of freedom; see, e.g., and . More recent works develop sharper impossibility results and algorithmic lower bounds in related random SDP and random constraint satisfaction settings, including . While the technical statements vary across models, a common conclusion is that above certain constant-density thresholds, one cannot simultaneously satisfy all constraints with a well-conditioned PSD matrix, even allowing for small residuals, and that this obstruction is robust under distributional perturbations that preserve isotropy and tail control.

The most directly relevant precedent for our purposes is the sharp bounded-approximate transition established by MB23 in the Gaussian setting . In that work, one considers the intersection of the affine slice {S : 𝒜(S) = 1} (or an approximate version) with the bounded region {0 ≼ S ≼ MId} and proves a high-dimensional phase transition at α = 1/4 for the existence of such bounded approximate fits. The result is qualitatively different from earlier feasibility criteria in two respects. First, it identifies an explicit constant 1/4 at the scale n = αd2, rather than a qualitative sufficiently small'' orsufficiently large’’ regime. Second, it is formulated in a robust manner: feasibility is replaced by a residual tolerance that scales naturally with n and d, and the obstruction above threshold persists uniformly over all fixed spectral bounds M. This robustness is essential if one aims to connect the geometric transition to algorithmic procedures and to stability under perturbations.

From the perspective of ellipsoid fitting, the MB23 phenomenon can be interpreted as follows. Below α = 1/4, the random constraints are sparse enough (relative to the curvature of the PSD cone in dimension d) that there typically exist ellipsoids of controlled shape that pass near all points; above α = 1/4, the constraints are dense enough that, even after allowing bounded slack, the affine constraints force the putative ellipsoid to violate positivity or to require eigenvalues that blow up with d. Importantly, this picture is not captured by the heuristic that one needs n ≲ d(d + 1)/2 constraints for solvability: the threshold occurs at α = 1/4 < 1/2, demonstrating that positivity induces an effective loss of degrees of freedom in this regime.

At the same time, the MB23 result leaves open two questions that are central in the present work. The first is : whether the same 1/4 threshold persists beyond the rotationally invariant Gaussian ensemble. In many high-dimensional problems governed by convex geometry, Gaussian analysis provides the correct constant but only after one proves that the relevant geometric functional (e.g., widths, intrinsic volumes, or log-volumes) is universal over a class of designs. Establishing such universality is nontrivial here because the constraints are quadratic in the underlying vectors, and the induced measurement matrices are rank-one and highly dependent across entries. The second question is : an approximate bounded solution does not, by itself, imply the existence of an exact feasible PSD solution of comparable spectral size, nor does it provide a constructive path to one.

The second question is, in our view, closely connected to a line of work on ``polishing’’ or refinement steps in convex and nonconvex optimization: one often obtains a coarse approximate solution from a tractable relaxation and then improves it by exploiting local linear-algebraic structure. In our setting, the relevant structure is the linear operator 𝒜 and its associated Gram operator, which together determine whether residual errors can be removed without introducing a large negative perturbation. What is specific here is that the ambient dimension is d2 and the measurements are rank-one, so standard restricted isometry heuristics do not directly apply. The appropriate substitute is a statement that the Gram operator is well-conditioned on its range and that the adjoint map does not amplify vectors too much in operator norm; these are the properties that make a deterministic correction step stable in the PSD cone.

Finally, we note that the existence of ill-behaved exact solutions is not ruled out by any of the above transitions and is not the primary object of study. One can ask whether (EF) has PSD solution (perhaps with Sop growing rapidly) even above α = 1/4, or whether exact feasibility can hold in a narrow window above the approximate threshold due to rare configurations. These questions are meaningful but orthogonal to the goal pursued here, which is to characterize the existence of solutions and to provide a stable mechanism that produces them. Accordingly, our statements are formulated in terms of bounded-spectrum approximate feasibility below threshold, robust nonexistence above threshold in the bounded regime, and a correction procedure that preserves positivity and operator-norm control.

With this background in place, we now fix notation and formalize the model: we specify the isotropic subgaussian design assumptions, define the measurement map and the associated Gram operator, and state our main theorems in a form that isolates the universal 1/4 threshold, the polishing guarantee, and the nuclear-norm regularization consequence.


Model, assumptions, and operators: isotropic subgaussian setup; the measurement map and Gram operator := ^*; norms and residual metrics; statement of main theorems.

We work in the ambient space 𝕊d of real symmetric d × d matrices, equipped with the Frobenius inner product U, VF := Tr (UV). Throughout, the dimension d tends to infinity and the number of constraints scales as
n = αd2,
where α ∈ (0, 1/2) is held fixed. All probabilistic statements are with respect to the randomness of the measurement vectors x1, …, xn ∈ ℝd.

We assume that x1, …, xn are i.i.d., mean zero, and :
$$ \mathbb{E}[x_i]=0,\qquad \mathbb{E}[x_i x_i^\top]=\frac{1}{d}I_d. $$
This normalization ensures that for any deterministic S ∈ 𝕊d we have
$$ \mathbb{E}\big[x_i^\top S x_i\big]=\operatorname{Tr}\!\Big(S\,\mathbb{E}[x_i x_i^\top]\Big)=\frac{1}{d}\operatorname{Tr}(S), $$
so Tr (S) ≈ d is the natural scale if one aims to satisfy xiSxi ≈ 1 on average. In addition, we impose a uniform subgaussian tail assumption: there exists K ≥ 1 such that
$$ \big\|\langle u,\sqrt{d}\,x_i\rangle\big\|_{\psi_2}\le K\qquad \text{for all }u\in S^{d-1}. $$
This hypothesis encompasses the Gaussian model xi ∼ N(0, Id/d) as well as many bounded and product distributions (e.g. Rademacher entries $\pm 1/\sqrt{d}$). Importantly, we do not assume rotational invariance.

Given xi, define the rank-one positive semidefinite measurement matrices
Xi := xixi ∈ 𝕊+d.
The measurement operator 𝒜 : 𝕊d → ℝn is
(𝒜(S))i := Tr (SXi) = xiSxi,   i = 1, …, n.
Thus the feasibility constraints xiSxi = 1 can be written compactly as 𝒜(S) = 1, where 1 ∈ ℝn denotes the all-ones vector. We write 𝒜* : ℝn → 𝕊d for the adjoint map relative to the Frobenius and Euclidean inner products:
$$ \mathcal{A}^*(v)=\sum_{i=1}^n v_i X_i=\sum_{i=1}^n v_i\,x_i x_i^\top . $$
The action of 𝒜* is particularly simple: it forms a weighted sum of rank-one PSD matrices. This structure is the basic reason that polishing can be implemented with matrix–vector primitives, without explicitly materializing a d(d + 1)/2 × n design matrix.

The associated Gram operator 𝒢 : ℝn → ℝn is defined by
𝒢 := 𝒜𝒜*.
Equivalently, 𝒢 is the n × n matrix with entries
𝒢ij = ⟨Xi, XjF = Tr (xixixjxj) = (xixj)2.
We emphasize two points that will recur. First, 𝒢 is positive semidefinite but typically not invertible when n ≫ d(d + 1)/2; hence the Moore–Penrose pseudoinverse 𝒢 is the appropriate inverse on range (𝒢). Second, 𝒢 captures the effective conditioning of the measurement system restricted to the span of the observed rank-one matrices: controlling its nonzero singular values is the main quantitative input behind stable correction steps.

For S ∈ 𝕊d we use Sop for the operator norm, SF for the Frobenius norm, and S* for the nuclear norm (which equals Tr (S) when S ≽ 0). For vectors in n we use ∥ ⋅ ∥2 and ∥ ⋅ ∥. Given an S, the constraint residual is
r(S) := 𝒜(S) − 1 ∈ ℝn.
Our approximate feasibility notion is tailored to the n = αd2 scaling and to the subsequent inversion through 𝒢. Concretely, for parameters δ > 0 and M > 0 we define the bounded-spectrum approximate feasibility event EFPδ, M as the existence of an S ∈ 𝕊d such that
$$ 0\preceq S\preceq M I_d, \qquad \|r(S)\|_2 \le \delta \sqrt{\frac{n}{d}}. $$
The factor $\sqrt{n/d}$ is the natural normalization for the Euclidean residual in this model: under isotropy and bounded spectrum, each coordinate xiSxi has fluctuations on the order of $1/\sqrt{d}$, so the 2 aggregate over n constraints scales like $\sqrt{n/d}$. This choice is also convenient analytically because it aligns the residual size with the typical scale of 𝒢, whose nonzero spectrum is of order d in the below-threshold regime.

Our exact feasibility problem (EF) is
find S ≽ 0 such that 𝒜(S) = 1.
Even when EFPδ, M holds for small δ, it is not automatic that an exact feasible PSD matrix exists with comparable spectrum. The key observation is that, if S is approximately feasible with residual r = r(S), then any correction Δ satisfying 𝒜(Δ) = r yields an exactly feasible candidate S − Δ. Among all such corrections lying in range (𝒜*), the minimum-Frobenius-norm choice is obtained by solving
v = 𝒢r,   Δ = 𝒜*(v),
since then 𝒜(Δ) = 𝒜𝒜*(𝒢r) = 𝒢𝒢r equals r whenever r ∈ range (𝒢) (and this is the relevant component for residual removal). This motivates the explicit map
Polish(S):   := S − tΔ,
with a damping parameter t ∈ (0, 1] chosen deterministically to preserve  ≽ 0; a canonical choice is
$$ t:=\min\Big\{1,\ \frac{\lambda_{\min}(S)}{2\|\Delta\|_{\mathrm{op}}}\Big\}. $$
The probabilistic content is then concentrated in showing that below the 1/4 threshold the correction Δ is small in operator norm whenever r(S) is small in 2, so that t = 1 or at least t bounded away from 0 is admissible, and the spectral bound on S is not significantly degraded.

The first result identifies the universal location of the bounded approximate feasibility transition.


The second result turns any bounded approximately feasible point into an exact feasible PSD point by an explicit, stable correction.


Finally, we connect exact feasibility to convex regularization. Consider the nuclear-norm minimization problem
min  ∥S*  s.t.  𝒜(S) = 1.
This problem is unconstrained by S ≽ 0 a priori, so it is not immediate that its optimizers are PSD. Below the same threshold, we show that the geometry of the constraints forces PSD optimality and yields a uniform spectral bound.


In combination, Theorems A and B yield an explicit end-to-end route to solving (EF) below threshold: one first produces any bounded approximate feasible point (e.g. by a convex program over {0 ≼ S ≼ MId} with an 2 residual constraint), and then applies Polish by solving linear systems in 𝒢 using iterative methods that access 𝒢v through applications of 𝒜 and 𝒜*. The universality statement ensures that both existence and stability persist for all isotropic subgaussian designs at fixed K, and the subsequent sections isolate the convex-geometric mechanism that pins the transition at α = 1/4 in this generality.


Universality of bounded-spectrum approximate thresholds: conic/volume comparison framework; reduction to a surrogate Gaussian measurement model; proving the same 1/4 threshold for subgaussian ensembles (with explicit robustness parameters).

We now sketch the geometric and probabilistic framework behind Theorem~A, isolating the components that (i) identify the transition at α = 1/4 in the Gaussian model and (ii) transfer that transition, with quantitative robustness, to general isotropic subgaussian designs. The central object is the compact convex ``well-behaved’’ set
𝒦M := {S ∈ 𝕊d: 0 ≼ S ≼ MId},
and the induced random convex body in measurement space
𝒞M := 𝒜(𝒦M) ⊂ ℝn.
By definition, EFPδ, M is the event that 1 lies within Euclidean distance $\delta\sqrt{n/d}$ of 𝒞M, i.e.
$$ \mathrm{EFP}_{\delta,M}\quad\Longleftrightarrow\quad \operatorname{dist}\big(\mathbf{1},\mathcal{C}_M\big)\le \delta\sqrt{\frac{n}{d}}. $$
Thus, bounded-spectrum approximate feasibility is a statement about the random set 𝒞M containing (or nearly containing) a prescribed deterministic point. This reformulation makes it transparent that the problem is convex-geometric (it depends on the shape of 𝒞M) and that the distribution of xi enters only through the random linear map 𝒜.

A convenient way to control dist (1, 𝒞M) is to use the standard dual representation of distance to a closed convex set:
dist (z, C) = supy2 ≤ 1 {⟨y, z⟩ − hC(y)},   hC(y) := supc ∈ Cy, c⟩.
Specializing to C = 𝒞M = 𝒜(𝒦M), we compute the support function through the adjoint:
h𝒞M(y) = supS ∈ 𝒦My, 𝒜(S)⟩ = supS ∈ 𝒦M⟨𝒜*(y), SF.
Since S is constrained to satisfy 0 ≼ S ≼ MId, the maximizer aligns S with the positive eigenspace of 𝒜*(y) and saturates the upper spectral bound; explicitly,
h𝒞M(y) = M Tr  ((𝒜*(y))+),
where (H)+ denotes the positive part in the spectral decomposition. Therefore,
dist (1, 𝒞M) = supy2 ≤ 1{⟨y, 1⟩ − M Tr  ((𝒜*(y))+)}.
This identity already indicates why α = 1/4 is a spectral/geometric threshold: the competition is between a deterministic linear functional y, 1 in n and a random spectral functional of the d × d matrix 𝒜*(y) = ∑iyiXi, and the scaling n = αd2 places the supremum over y on a comparable ``degrees-of-freedom’’ scale to the matrix dimension.

While the dual distance formula is useful for deterministic manipulations, MB23’s analysis (in the Gaussian case) proceeds by a closely related ``volume of a thickened slice’’ viewpoint which is robust under universality replacements. Concretely, for δ > 0 we consider the thickened feasible set
$$ \mathcal{F}_{\delta,M}:=\Big\{S\in\mathcal{K}_M:\ \|\mathcal{A}(S)-\mathbf{1}\|_2\le \delta\sqrt{\frac{n}{d}}\Big\}. $$
Nonemptiness of δ, M is equivalent to EFPδ, M, but volume provides additional quantitative structure: one studies the normalized log-volume
$$ \Phi_{\delta,M}:=\frac{1}{d^2}\log \operatorname{Vol}\big(\mathcal{F}_{\delta,M}\big), $$
or a smoothed surrogate of it (e.g. a log-partition function), and shows that Φδ, M concentrates around a deterministic limit ϕδ, M(α) as d → ∞. In such a framework, the feasibility transition corresponds to the point at which ϕδ, M(α) crosses from −∞ (exponentially small volume, hence emptiness w.h.p.) to a finite value (exponentially large volume, hence nonemptiness w.h.p.). Importantly, the normalization by d2 matches the ambient dimension scale of 𝕊d, so the concentration errors must be o(d2) to identify a sharp constant threshold in α.

Let {gi}i ≤ n be i.i.d. Gaussian with gi ∼ N(0, Id/d), and define the corresponding operator 𝒜G and body 𝒞MG = 𝒜G(𝒦M). In this model, rotational invariance and Gaussian comparison tools yield a variational characterization of ϕδ, MG(α) (or directly of dist (1, 𝒞MG)) in terms of an optimization over limiting spectral measures. The key output, established in MB23 in the bounded-spectrum setting, is the existence of a universal critical value αc = 1/4 with a robust gap: for every fixed ε > 0 there exist constants δ±(ε) > 0 such that
$$ \alpha\le \frac14-\varepsilon \ \Longrightarrow\ \mathbb{P}\big[\mathrm{EFP}_{\delta_-,M_*}\big]\to 1 \quad\text{for some finite }M_*, $$
whereas
$$ \alpha\ge \frac14+\varepsilon \ \Longrightarrow\ \mathbb{P}\big[\mathrm{EFP}_{\delta_+,M}\big]\to 0 \quad\text{for every fixed }M. $$
At the level of the dual distance formula, the same statement can be phrased as follows: below 1/4 the random functional
y ↦ ⟨y, 1⟩ − M Tr ((𝒜G*(y))+)
is negative (uniformly over y2 ≤ 1) by an amount that dominates $\delta\sqrt{n/d}$, while above 1/4 it remains positive for some y (again by a quantitative margin). The robustness in δ is essential, because it allows us to move between exact'' andapproximate’’ feasibility without re-solving the limiting variational problem each time.

The remaining task for Theorem~A is to show that the same limiting functional governs the subgaussian model, with constants depending only on (ε, K). We implement this via a two-step universality argument.

First, we pass from the hard feasibility indicator 1{ℱδ, M ≠ ∅} to a smoothed convex functional of the random operator 𝒜. A typical choice is a log-partition function
$$ Z_{\beta}:=\int_{\mathcal{K}_M}\exp\!\Big(-\beta\big\|\mathcal{A}(S)-\mathbf{1}\big\|_2^2\Big)\,dS, \qquad \Psi_{\beta}:=\frac{1}{d^2}\log Z_{\beta}, $$
with β = β(d) growing slowly. If δ, M is empty then Zβ is exponentially small, whereas if δ, M has nonnegligible volume then Zβ is exponentially large; in both cases the scale separation is exponential in d2. Thus, high-probability bounds for Ψβ translate into high-probability feasibility/nonfeasibility statements with explicit δ margins.

Second, we compare 𝔼Ψβ (and then Ψβ itself, by concentration) between the subgaussian design and the Gaussian surrogate. This is the point at which rotational invariance is replaced by an invariance principle: we apply a Lindeberg replacement scheme (or, equivalently, a Chatterjee-type invariance argument) to swap the coordinates of $\sqrt{d}\,x_i$ with those of $\sqrt{d}\,g_i$ one at a time. The required smoothness is supplied by the exponential smoothing in Zβ, and the required derivative bounds are obtained from the spectral truncation 0 ≼ S ≼ MId, which prevents 𝒜(S) from being excessively sensitive to a single coordinate of a single xi. Subgaussianity with parameter K controls the error terms in the replacement through uniform moment bounds, yielding
|𝔼Ψβ − 𝔼ΨβG| ≤ od → ∞(1)   uniformly over fixed α ∈ (0, 1/2), Mβ ≤ dc,
for a sufficiently small absolute c > 0 (depending on K through moment constants). The same replacement principle also applies to the dual-distance functional after a standard smoothing of the positive-part trace Tr ((⋅)+).

Having matched expectations to the Gaussian surrogate, we upgrade to high probability by concentration for subgaussian chaos. The functionals Ψβ and the smoothed dual distance are Lipschitz (in an appropriate product metric) with Lipschitz constants polynomial in (M, β), and hence satisfy subgaussian concentration around their means. Because the relevant scale is d2, these concentration inequalities deliver deviations of order o(1) for the normalized quantities. Combining (i) the Gaussian threshold with a quantitative gap at α = 1/4 ± ε, (ii) the universality comparison error o(1), and (iii) concentration, we obtain the robust constants δ(ε, K), δ+(ε, K), and a finite M*(α, ε, K) as stated in Theorem~A. In particular, below 1/4 we obtain nonemptiness of δ, M* with probability 1 − o(1), while above 1/4 the same argument implies that even the thickened slice remains empty uniformly over bounded M.

We emphasize that Theorem~A is only an existence statement for a bounded approximate fit; it does not by itself produce an exact feasible PSD solution. Its role is to place us in a regime where (a) there exists some S with controlled spectrum and small residual in the natural 2 scale, and (b) the constants are uniform over all isotropic subgaussian designs with parameter K. The next step is therefore deterministic: given such an S, we explicitly remove the residual by inverting the constraint system on its range. This is exactly the purpose of the polishing map, which we treat next.


Polishing map and deterministic reduction: define (S) explicitly using ^{-1}; show exact constraint satisfaction; reduce PSD preservation to controlling ||_{} in terms of residual size and conditioning of .

Fix an input matrix S ∈ 𝕊d and write its residual as
r = r(S) := 𝒜(S) − 1 ∈ ℝn.
We seek a correction Δ ∈ 𝕊d such that 𝒜(S − Δ) = 1, i.e. 𝒜(Δ) = r. A natural restriction is to search for Δ in range (𝒜*), since corrections of the form 𝒜*(v) are explicitly representable as weighted rank-one sums and are precisely those produced by the normal equations. Recall the Gram operator
𝒢 := 𝒜𝒜* ∈ ℝn × n,   (𝒢)ij = ⟨Xi, XjF = (xixj)2.
Whenever 𝒢 is invertible (equivalently, 𝒜 has full row rank), the constraint 𝒜(Δ) = r is solved by choosing v = 𝒢−1r and setting
$$ \Delta:=\mathcal{A}^*(v)=\sum_{i=1}^n v_i X_i, \qquad \widetilde{S}:=S-\Delta. $$
Indeed,
𝒜(Δ) = 𝒜𝒜*(v) = 𝒢v = r,
and therefore 𝒜() = 𝒜(S) − 𝒜(Δ) = 1. This is the basic polishing step. In the random regime n = αd2 with α < 1/2, the domain dimension dim (𝕊d) = d(d + 1)/2 dominates n, and for the ensembles under consideration one expects (and later proves with high probability) that 𝒜 has full row rank, so that 𝒢−1 exists and the above formulas are unambiguous.

To separate the deterministic algebra from the probabilistic full-rank claim, we define polishing using the Moore–Penrose inverse. Let 𝒢 denote the pseudoinverse and set
v := 𝒢r,   Δ := 𝒜*(v),   Polish(S) := S − Δ,
with the understanding that 𝒢 = 𝒢−1 when 𝒢 is invertible. Since 𝒢𝒢 is the orthogonal projector onto range (𝒢) = range (𝒜), we always have the identity
𝒜(Δ) = 𝒢𝒢r = Projrange (𝒜)(r).
Thus 𝒜(Polish(S)) = 1 holds whenever r ∈ range (𝒜), which is automatic if 𝒜 is surjective onto n (full row rank). More generally, if there exists any exact feasible matrix S (not necessarily PSD) with 𝒜(S) = 1, then
r = 𝒜(S) − 𝒜(S) = 𝒜(S − S) ∈ range (𝒜),
and again the pseudoinverse correction enforces exact constraint satisfaction. In either case, the definition reduces the problem of producing an exact feasible point to controlling the side effects of subtracting Δ.

The choice v = 𝒢r is canonical: among all v solving 𝒢v = r (when solutions exist), it minimizes v2, and consequently Δ = 𝒜*(v) is the minimum-Frobenius-norm correction among all corrections constrained to lie in range (𝒜*). Concretely, if we minimize ΔF subject to Δ = 𝒜*(v) and 𝒜(Δ) = r, then ΔF2 = ⟨v, 𝒢v and the normal equations yield v = 𝒢r. This optimality is useful because it shows that any bound we prove for Δ is, in a precise sense, the best possible bound for corrections that do not exploit the (typically large) nullspace of 𝒜.

Assume now that the input S lies in the bounded-spectrum cone 𝒦M, i.e. 0 ≼ S ≼ MId. After polishing, exact feasibility is automatic as above, but positivity is not. Deterministically, we control the loss of PSD via the operator norm of Δ. By Weyl’s inequality,
λmin(S − Δ) ≥ λmin(S)−∥Δop.
Hence S − Δ ≽ 0 provided Δop ≤ λmin(S). Since λmin(S) may be small or even zero, we introduce a damped variant: for
$$ t:=\min\Big\{1,\ \frac{\lambda_{\min}(S)}{2\|\Delta\|_{\mathrm{op}}}\Big\}, \qquad \widetilde{S}:=S-t\Delta, $$
we always have  ≽ 0 because λmin() ≥ λmin(S) − tΔop ≥ λmin(S)/2. On the high-probability event where Δop is sufficiently small relative to λmin(S) for all inputs of interest, the damping is inactive (so t = 1) and we simultaneously obtain exact feasibility and PSD. Accordingly, the probabilistic work in Theorem~B is organized to guarantee that, below threshold and for residuals at the 2 scale $\|r\|_2\lesssim \sqrt{n/d}$, the correction Δ is small in operator norm uniformly over all bounded-spectrum inputs.

We now isolate the quantitative inequality we will later prove with high probability. By definition,
v = 𝒢r,   Δ = 𝒜*(v).
The first step is to control v2 by conditioning of 𝒢 on its range:
$$ \|v\|_2 \;=\;\|\mathcal{G}^\dagger r\|_2 \;\le\; \frac{\|r\|_2}{\sigma_{\min}(\mathcal{G}\vert_{\operatorname{range}})}. $$
Thus, any lower bound of the form σmin(𝒢|range) ≳ d immediately converts the natural residual scale $\|r\|_2\lesssim \delta\sqrt{n/d}\asymp \delta\sqrt{d}$ into
$$ \|v\|_2 \lesssim \frac{\delta}{\sqrt{d}}. $$
The second step is to control the operator norm of 𝒜*(v) = ∑iviXi in terms of v2 (and, if needed, v):
Δop = ∥𝒜*(v)∥op ≤ ∥𝒜*2 → op ∥v2,
where ∥𝒜*2 → op := supv2 = 1∥𝒜*(v)∥op. For rank-one subgaussian designs, the relevant bound takes the form
$$ \big\|\mathcal{A}^*(v)\big\|_{\mathrm{op}} \le C(K)\Big(\frac{\|v\|_2}{\sqrt{d}}+\|v\|_\infty\Big), $$
and in the regime where v is not extremely sparse (which is the typical output of 𝒢r when 𝒢 is well-conditioned), the v term is negligible compared to $\|v\|_2/\sqrt{d}$. Combining these inequalities yields the deterministic reduction
$$ \|\Delta\|_{\mathrm{op}} \;\lesssim\; \frac{\|r\|_2}{\sqrt{d}\ \sigma_{\min}(\mathcal{G}\vert_{\operatorname{range}})} \;+\; \| \mathcal{G}^\dagger r\|_\infty, $$
so that polishing preserves PSD as soon as the Gram operator is well-conditioned and the adjoint sum is not dominated by a single coefficient.

The preceding bounds explain why the 2 residual scale in EFPδ, M is natural for polishing. If $\|r\|_2\le \delta\sqrt{n/d}\asymp \delta\sqrt{d}$ and σmin(𝒢|range) ≳ d, then $\|v\|_2\lesssim \delta/\sqrt{d}$. Feeding this into $\|\mathcal{A}^*(v)\|_{\mathrm{op}}\lesssim \|v\|_2/\sqrt{d}$ gives the heuristic magnitude
$$ \|\Delta\|_{\mathrm{op}}\;\lesssim\;\frac{\delta}{d}, $$
which vanishes as d → ∞. In particular, below the 1/4 threshold the correction is asymptotically tiny in operator norm, so once we know that bounded-spectrum approximate feasibility provides at least some slack away from the PSD boundary (or once we verify uniformly that t = 1 in the damping rule), the one-step update S ↦ S − Δ produces an exact feasible PSD point. This is the sense in which polishing is a deterministic reduction: all randomness is now confined to verifying (i) spectral bounds for 𝒢 and (ii) uniform operator-norm bounds for 𝒜*(v), both in a form that holds simultaneously for all admissible v arising from residuals of admissible S.

Although the definition of Polish(S) is expressed through 𝒢, it is implementable without forming 𝒢 as an n × n matrix. To apply 𝒢 to a vector u ∈ ℝn, we use the factorization 𝒢u = 𝒜(𝒜*(u)):
$$ \mathcal{A}^*(u)=\sum_{i=1}^n u_i X_i,\qquad (\mathcal{A}(\mathcal{A}^*(u)))_j = \sum_{i=1}^n u_i\,\langle X_i,X_j\rangle_F = \sum_{i=1}^n u_i\,(x_i^\top x_j)^2. $$
Thus, solving 𝒢v = r (or computing 𝒢r) can be done by conjugate gradient or MINRES using only matrix–vector products with 𝒢, and the final correction Δ = 𝒜*(v) is a single weighted rank-one sum. The deterministic analysis above therefore directly translates into a stable numerical recipe, provided the probabilistic estimates guarantee that the linear system is well-conditioned in the relevant regime. The next step is to establish precisely these random operator estimates, uniformly in d and with constants depending only on (ε, K), which is where the threshold α = 1/4 re-enters through the conditioning of 𝒢.


Random operator estimates needed for polishing: conditioning of on the relevant residual subspace; bounds on |^*(v)|_{} in terms of |v|_2; uniform-in-d concentration tools for subgaussian rank-one sums; (flag where constants may require careful chaining).

We now state the probabilistic inputs that make the deterministic reduction effective. Throughout we work on a single high-probability event (depending only on (ε, K)) on which two properties hold simultaneously: (i) the Gram operator 𝒢 = 𝒜𝒜* is uniformly well-conditioned on its range, and (ii) the adjoint map 𝒜* sends vectors in n to symmetric matrices with controlled operator norm. On this event, for every bounded-spectrum input S with sufficiently small residual $\|r(S)\|_2\lesssim \sqrt{n/d}$, the pseudoinverse vector v = 𝒢r(S) is small in 2, and therefore the correction Δ = 𝒜*(v) is small in operator norm. The threshold α = 1/4 enters precisely through the lower edge of the spectrum of 𝒢: below 1/4 we obtain a uniform spectral gap away from 0, whereas above 1/4 the gap closes at the relevant d-dependent scale, preventing stable inversion.

To control σmin(𝒢|range) and σmax(𝒢), it is convenient to view 𝒜 as a random matrix with rows given by the ``lifted’’ observations Xi = xixi acting by Frobenius inner product. Concretely, identifying 𝕊d with d(d + 1)/2 through an orthonormal basis, 𝒜 becomes an n × p matrix (with p = d(d + 1)/2) whose ith row is vec(Xi). Then 𝒢 = 𝒜𝒜* is the Gram matrix of these lifted rows, and its nonzero singular values coincide with the squared singular values of 𝒜. The issue is that Xi is not isotropic as a vector in p: its distribution has a distinguished trace direction proportional to Id, and a trace-zero component that is approximately isotropic. Accordingly, we decompose
$$ X_i \;=\; \frac{\operatorname{Tr}(X_i)}{d}\,I_d \;+\; \Big(X_i-\frac{\operatorname{Tr}(X_i)}{d}\,I_d\Big) \;=\; X_i^{\parallel} + X_i^{\perp}, $$
where Xi ∈ {S ∈ 𝕊d : Tr (S) = 0}. The bulk conditioning is governed by the trace-zero part, while the trace part contributes a low-dimensional component whose effect can be treated separately (and absorbed into the constants in Lemma~1). In the Gaussian case, this decomposition is exact enough that 𝒢 reduces, after a deterministic rescaling and a rank-one adjustment, to a Wishart-type matrix; the lower spectral edge then behaves like a constant multiple of $(1-2\sqrt{\alpha})^2$ and is strictly positive if and only if α < 1/4. For isotropic subgaussian designs, we prove a corresponding statement by a universality argument for extreme singular values of such lifted sample covariance operators.

For polishing we do not need a sharp limiting law; it suffices to prove a uniform spectral sandwich with constants depending only on (ε, K):
c1(ε, K) d ≤ σmin(𝒢|range) ≤ σmax(𝒢) ≤ c2(ε, K) d,
as recorded in Lemma~1. One way to organize the proof is through the quadratic form representation
$$ u^\top \mathcal{G}\,u \;=\; \big\|\mathcal{A}^*(u)\big\|_F^2 \;=\; \Big\|\sum_{i=1}^n u_i X_i\Big\|_F^2, \qquad u\in\mathbb{R}^n, $$
together with the observation that range (𝒢) = range (𝒜) is precisely the subspace on which u ↦ 𝒜*(u) is injective. Thus, lower-bounding σmin(𝒢|range) is equivalent to proving a uniform inequality of the form
$$ \Big\|\sum_{i=1}^n u_i X_i\Big\|_F \;\ge\; c\sqrt{d}\,\|u\|_2 \qquad \text{for all }u\in \operatorname{range}(\mathcal{G}), $$
and similarly for the upper bound. Because n ≍ d2, naive ε-net arguments over Sn − 1 are impossible; instead we appeal to random-matrix results for sample covariance operators with fixed aspect ratio, combined with a reduction that isolates the trace direction and applies an extreme-singular-value theorem on the remaining (p − 1)-dimensional trace-zero subspace. The point at which α = 1/4 enters is that the effective aspect ratio in the trace-zero component is 4α (rather than 2α), reflecting the quadratic (rank-one) lifting; below 1/4 this yields a strictly positive lower edge, while above 1/4 it degenerates.

The second input is a bound of Lemma~2 type:
$$ \big\|\mathcal{A}^*(v)\big\|_{\mathrm{op}} \;=\;\Big\|\sum_{i=1}^n v_i x_i x_i^\top\Big\|_{\mathrm{op}} \;\le\; C(K)\Big(\frac{\|v\|_2}{\sqrt{d}}+\|v\|_\infty\Big), \qquad v\in\mathbb{R}^n, $$
on a high-probability event. The v term accounts for the possibility that a single unusually large coefficient multiplies a rank-one matrix of size comparable to xi22; it is unavoidable in complete generality. The key point is the $\|v\|_2/\sqrt{d}$ scaling, which captures the averaging when the weights are spread across many indices and the design is isotropic. A convenient proof strategy is to bound the quadratic form of 𝒜*(v) uniformly over unit vectors u ∈ Sd − 1:
$$ u^\top \mathcal{A}^*(v)\,u \;=\;\sum_{i=1}^n v_i\, (u^\top x_i)^2, $$
so that ∥𝒜*(v)∥op = supu ∈ Sd − 1|∑ivi(uxi)2|. For fixed u, the summands are independent, mean-u22/d (up to scaling) subexponential random variables with ψ1 norms controlled by K2/d; Bernstein-type bounds yield concentration at the correct scale. The remaining difficulty is to make this uniform over all u, which requires chaining or a carefully constructed net.

We implement the uniform control by combining three standard steps. First, we truncate the event maxixi22 ≤ C(K), which holds with probability 1 − exp (−cd) by subgaussian concentration of xi2. This removes rare large-norm points and allows us to treat each xixi as uniformly bounded in operator norm. Second, we symmetrize:
$$ \sup_{u\in S^{d-1}}\Big|\sum_{i=1}^n v_i \big((u^\top x_i)^2-\mathbb{E}(u^\top x_i)^2\big)\Big| \;\le\; 2\,\mathbb{E}_\varepsilon\sup_{u\in S^{d-1}}\Big|\sum_{i=1}^n \varepsilon_i v_i (u^\top x_i)^2\Big|, $$
where (εi) are i.i.d. Rademacher signs. Third, conditional on (xi), we control the Rademacher process indexed by u using generic chaining (or, at a weaker level, an ε-net on Sd − 1 together with a Lipschitz extension argument). The chaining route is preferable because it avoids extra log d factors that become relevant near α ≈ 1/4 where we must preserve a quantitative gap depending on ε. The outcome is that, with high probability, the map v ↦ 𝒜*(v) behaves like a subgaussian operator from (ℝn, ∥ ⋅ ∥2) into (𝕊d, ∥ ⋅ ∥op) up to the unavoidable ∥ ⋅ ∥ correction.

The conditioning estimate for 𝒢 is used only through v = 𝒢r, so in principle we need control of 𝒢 only on vectors of the form r(S) for admissible S. We do not exploit this reduction and instead prove a stronger statement: 𝒢 is well-conditioned on all of range (𝒢), uniformly over d once α ≤ 1/4 − ε is fixed. This uniformity is essential for the ``simultaneously for all inputs’’ formulation in Theorem~B. Combining Lemma~1 and Lemma~2 gives, on the high-probability event,
$$ \|\Delta\|_{\mathrm{op}} \;=\;\big\|\mathcal{A}^*(\mathcal{G}^\dagger r)\big\|_{\mathrm{op}} \;\le\; C(K)\Big(\frac{\|\mathcal{G}^\dagger r\|_2}{\sqrt{d}}+\|\mathcal{G}^\dagger r\|_\infty\Big) \;\le\; C(K)\Big(\frac{\|r\|_2}{\sqrt{d}\,\sigma_{\min}(\mathcal{G}\vert_{\operatorname{range}})}+\|\mathcal{G}^\dagger r\|_\infty\Big), $$
which is the quantitative form needed to feed into the deterministic PSD-preservation rule through Weyl-type perturbation bounds. The only term not directly controlled by Lemma~1 is ∥𝒢r, whose treatment depends on how much delocalization one can prove for 𝒢 and on how sharply one needs constants; in our setting it suffices to show that for residuals at the 2 scale of EFPδ, M, the solution v to 𝒢v = r is typically not supported on o(n) indices.

Two places require care if one wants explicit dependence on ε as α ↑ 1/4. First, the lower edge of the relevant spectrum of 𝒢 deteriorates polynomially in $(1-2\sqrt{\alpha})$, so any argument that loses unnecessary logarithmic factors (for instance by applying a crude union bound over an exponential net in a dimension comparable to d2) will not suffice. This is why we rely on extreme-singular-value results for subgaussian sample covariance operators with fixed aspect ratio, which provide the correct edge scaling without extraneous dimension-dependent constants. Second, in bounding supu ∈ Sd − 1ivi(uxi)2, the natural metric for chaining is induced by the empirical second moment ivi2(uxi)2, which itself fluctuates with (xi); controlling these fluctuations uniformly in u requires a two-level argument (conditioning on (xi), then applying concentration for the induced metric). We flag this as the main technical pressure point: it is straightforward away from the threshold, but near α = 1/4 it is advantageous to use a fully fledged empirical-process tool (Dudley integral with a tail bound on covering numbers in the random metric) rather than a fixed deterministic net.

On the intersection of the high-probability events provided by Lemma~1 and Lemma~2 (and the mild additional delocalization estimate implicit in controlling ∥𝒢r), we obtain that for every bounded-spectrum S with $\|r(S)\|_2\le \delta\sqrt{n/d}$, the polishing correction satisfies Δop ≤ Cε, Kδ (with the stronger vanishing behavior in d when the scaling constants permit it). This is precisely the regime in which the deterministic damping rule preserves positivity while enforcing exact feasibility, and it is also the regime in which the operator norm of the output remains within a controlled enlargement of the original bound M. The remainder of the polishing proof consists of combining these random operator estimates with eigenvalue perturbation bounds to obtain a uniform, stable map from approximate bounded feasibility to exact PSD feasibility.


PSD preservation and stability: eigenvalue perturbation control (Weyl) to ensure ; bound on ||_{}; sensitivity to residual norms; discussion of failure modes (large residual, poor conditioning).

We now record the deterministic perturbation estimates that convert a quantitative bound on the correction direction into preservation of positivity and a controlled enlargement of the operator norm. Let S ∈ 𝕊d be an input matrix with S ≽ 0, and let r = r(S) = 𝒜(S) − 1 ∈ ℝn be its residual. Given any vector v ∈ ℝn and the associated matrix
$$ \Delta=\mathcal{A}^*(v)=\sum_{i=1}^n v_i x_i x_i^\top, $$
we consider updates of the form  = S − tΔ for a step size t ∈ [0, 1]. Two constraints are relevant: exact feasibility 𝒜() = 1, and positivity  ≽ 0.

If we choose v = 𝒢r, then
𝒜(Δ) = 𝒜𝒜*(𝒢r) = 𝒢𝒢r = Prange (𝒢)r,
where Prange (𝒢) denotes orthogonal projection. On the event that 𝒢 has full rank (equivalently, range (𝒢) = ℝn), we obtain 𝒜(Δ) = r and hence
𝒜(S − Δ) = 𝒜(S) − 𝒜(Δ) = 1.
Thus, in the well-conditioned regime where 𝒢 is invertible (or at least where r lies in range (𝒢)), the undamped step t = 1 enforces exact feasibility in a single correction. Damping is introduced solely as a safeguard for positivity, and in the regime relevant for Theorem~B we will be able to argue that the safeguard is inactive, i.e. t = 1 holds automatically because Δ is small.

To guarantee  ≽ 0, it suffices to control the smallest eigenvalue under the perturbation tΔ. By Weyl’s inequality,
λmin() = λmin(S − tΔ) ≥ λmin(S) − t ∥Δop.
Consequently, if we choose
$$ t \;\le\; \frac{\lambda_{\min}(S)}{\|\Delta\|_{\mathrm{op}}}, $$
then λmin() ≥ 0 and  ≽ 0. The explicit damping rule used in Theorem~B,
$$ t:=\min\Big\{1,\frac{\lambda_{\min}(S)}{2\|\Delta\|_{\mathrm{op}}}\Big\}, $$
has the convenient consequence that either t = 1 (so feasibility is exact) or else t < 1 but then λmin() ≥ λmin(S)/2, which gives a quantitative buffer away from the boundary of the PSD cone. This buffer is useful if one implements polishing iteratively (recomputing v from the new residual), since it prevents accumulation of numerical errors from pushing the iterate into indefiniteness.

We also require that the output not acquire an uncontrolled large eigenvalue. Again by Weyl’s inequality,
λmax() = λmax(S − tΔ) ≤ λmax(S) + t ∥Δop ≤ ∥Sop+∥Δop,
since t ≤ 1. In particular, if the input is bounded as 0 ≼ S ≼ MId, then we obtain the deterministic bound
op ≤ M+∥Δop.
Thus, all stability properties of the map S ↦  reduce to bounding Δop in terms of the residual size r(S)∥2 (and the random operator parameters controlling 𝒢 and 𝒜*).

We now make explicit the dependence of Δop on r2 in the regime where Lemma~1 and Lemma~2 hold. Write v = 𝒢r, so that on range (𝒢) we have
$$ \|v\|_2 \;\le\; \frac{\|r\|_2}{\sigma_{\min}(\mathcal{G}\vert_{\operatorname{range}})}. $$
If σmin(𝒢|range) ≳ d, then v2 ≲ ∥r2/d. Feeding this into Lemma~2 yields
$$ \|\Delta\|_{\mathrm{op}} =\|\mathcal{A}^*(v)\|_{\mathrm{op}} \;\le\; C(K)\Big(\frac{\|v\|_2}{\sqrt{d}}+\|v\|_\infty\Big) \;\lesssim\; C(K)\Big(\frac{\|r\|_2}{d^{3/2}}+\|v\|_\infty\Big). $$
Under the scaling of EFPδ, M, namely $\|r\|_2\le \delta\sqrt{n/d}$ and n = αd2, we have $\|r\|_2\lesssim \delta \sqrt{\alpha}\, d^{1/2}$, hence the contribution of the 2 term alone is of order δ/d:
$$ \frac{\|r\|_2}{d^{3/2}} \;\lesssim\; \frac{\delta}{d}. $$
This is the quantitative reason that, below the 1/4 threshold, polishing is stable: the required correction is not merely small, but vanishing in d at the natural residual scale.

Assume in addition that the input matrix has a spectral margin λmin(S) ≥ η for some η > 0 (which may itself be a function of (ε, K) but not of d). If Δop ≤ η/2, then the damping rule gives t = 1 and the one-step update  = S − Δ is simultaneously (i) exactly feasible and (ii) PSD. In the bounded-spectrum approximate feasibility setting, S may lie on the boundary of the PSD cone, so we cannot assume a deterministic η. Nevertheless, the preceding residual-to-correction transfer shows that for δ sufficiently small (depending only on (ε, K)) and d sufficiently large, Δop is uniformly small for all admissible inputs S. In this regime, even if λmin(S) = 0, the inequality
λmin(S − Δ) ≥ −∥Δop
shows that we are at worst slightly indefinite before damping; the deterministic damping then ensures  ≽ 0 while keeping t close to 1. In implementations, one can equivalently view this as a safeguard against rare inputs whose residual is at the admissible threshold but whose spectrum is unusually ill-posed.

The same inequalities quantify sensitivity: if S and S are two bounded-spectrum inputs with residuals r and r and corresponding corrections Δ and Δ, then on the event that 𝒢 is well-conditioned and 𝒜* satisfies Lemma~2 uniformly,
$$ \|\Delta-\Delta'\|_{\mathrm{op}} =\big\|\mathcal{A}^*\big(\mathcal{G}^\dagger(r-r')\big)\big\|_{\mathrm{op}} \;\le\; C(K)\Big(\frac{\|\mathcal{G}^\dagger(r-r')\|_2}{\sqrt{d}}+\|\mathcal{G}^\dagger(r-r')\|_\infty\Big). $$
In particular, ignoring the term (or controlling it by delocalization), we obtain a Lipschitz-type bound
$$ \|\Delta-\Delta'\|_{\mathrm{op}} \;\lesssim\; \frac{1}{d^{3/2}}\,\|r-r'\|_2. $$
This shows that the polishing map is not only stable but strongly contractive in the regime where the inverse Gram scaling is 1/d and the adjoint scaling is $1/\sqrt{d}$. Such a contractivity is what allows us to apply the map ``simultaneously for all inputs’’ once we restrict to bounded-spectrum matrices with residuals at the $\sqrt{n/d}$ scale.

The preceding discussion also isolates the only two mechanisms by which polishing can fail.

First, if the residual is too large (e.g. r(S)∥2 is not $O(\sqrt{n/d})$ but rather comparable to $\sqrt{n}$), then even a well-conditioned 𝒢 yields v2 of order r2/d, which may be large enough that Δop is no longer negligible relative to λmin(S) or to the bound M. In this regime, damping may force t ≪ 1, and the one-step update no longer produces exact feasibility. From the geometric viewpoint, this is expected: a large residual means we are far from the affine slice {𝒜(S) = 1} in the measurement space, so the minimum-norm correction in range (𝒜*) need not remain within a small perturbation of the PSD cone.

Second, even with small residual, polishing becomes unstable if 𝒢 is poorly conditioned on its range. Indeed, if σmin(𝒢|range) is of smaller order than d, then v2 = ∥𝒢r2 can blow up, and hence so can Δop. This is precisely the obstruction at and above the threshold: as α ↑ 1/4 the relevant lower spectral edge deteriorates, and for α > 1/4 it collapses at the scale needed to control Δop uniformly over all admissible inputs. In that regime, the correction direction can be large even when r2 is at the nominal $\sqrt{n/d}$ scale, so Weyl’s inequality no longer guarantees PSD preservation without taking t so small that exact feasibility is lost.

A third, more technical instability is encoded in the v term: if v is highly localized (so that a small number of indices carry most of its mass), then 𝒜*(v) may be dominated by a few rank-one terms and its operator norm can be large compared to the $\|v\|_2/\sqrt{d}$ prediction. This is why delocalization (or a priori control of v for the relevant residual subspace) is helpful: it rules out a situation where 𝒢 produces a vector v with a few very large coordinates, which would negate the averaging effect exploited by Lemma~2.

Summarizing, once we have placed ourselves on the high-probability event where (i) 𝒢 has a uniform spectral gap on its range and (ii) 𝒜* satisfies the uniform operator-norm bound, the polishing map has the following deterministic behavior: small residual implies small correction in operator norm; small operator-norm correction implies PSD preservation by Weyl’s inequality; and the operator norm of the output increases by at most Δop. The remaining probabilistic work is therefore exhausted by ensuring that, below α = 1/4, the chain
r2 → ∥𝒢r2 → ∥𝒜*(𝒢r)∥op
is uniformly contracting at the EFPδ, M scale, so that the undamped step t = 1 is valid for all admissible inputs and yields an exact feasible PSD matrix with controlled spectrum.


Minimum nuclear-norm exact solution is PSD: formulate the nuclear-norm program and its dual; construct a dual certificate below threshold ensuring the optimum lies in the PSD cone; derive boundedness and uniqueness properties (when applicable).

We consider the convex program

where ∥ ⋅ ∥* denotes the nuclear norm (sum of singular values). Since S is symmetric, $\|S\|_*=\sum_{j=1}^d|\lambda_j(S)|$. In particular, on the PSD cone we have S* = Tr (S). The motivation for~ is that it is the canonical convex regularizer for ``simplicity’’ among all exact fits; our aim is to show that below the 1/4 threshold, the optimizer is in fact PSD and quantitatively well-behaved.

Using the duality S* = supZop ≤ 1Z, SF and the definition of 𝒜*, the Lagrange dual of~ takes the standard form

Weak duality is immediate: for any feasible S of~ and any feasible y of~,
S* ≥ ⟨𝒜*(y), SF = ⟨y, 𝒜(S)⟩ = ⟨1, y⟩.
Thus any y with ∥𝒜*(y)∥op ≤ 1 certifies a lower bound on the primal optimum value.

Let S be any minimizer of~. Under standard constraint qualifications (which we invoke only at a high level here, and which can be verified on the below-threshold event by the existence of strictly feasible perturbations in ker (𝒜)), strong duality holds and there exists a dual optimizer y attaining~ such that

The key structural point is the following: if we can arrange (or prove) that there is an optimal dual y for which 𝒜*(y) ≽ 0, then the subgradient relation in~ forces S ≽ 0. Indeed, the subdifferential ∂∥S* at a symmetric matrix S contains matrices that agree with the on the range of S; if S has a negative spectral subspace, any element of ∂∥S* must have a corresponding negative direction. A PSD matrix 𝒜*(y) ≽ 0 cannot serve as a nuclear-norm subgradient unless the negative spectral projector of S is zero. This is the conceptual route by which a dual certificate with nonnegative spectrum enforces that the primal optimum lies in the PSD cone.

We now describe how the below-threshold regime enables such a certificate. Fix α ≤ 1/4 − ε. By Theorem~B (polishing), on a probability-1 − o(1) event there exists at least one exact feasible matrix S0 ≽ 0 with bounded spectrum, and moreover we can take S0 with an a priori operator norm bound S0op ≤ M*(α, ε, K). We use this bounded feasible point as the anchor for dual certification.

Consider the auxiliary SDP restricted to the PSD cone,

Any feasible point of~ is feasible for~ and has objective value equal to its nuclear norm, hence
OPTNN ≤ OPTPSD-Tr ≤ Tr (S0).
The dual of~ is

Observe that if in addition to 𝒜*(y) ≼ Id we also have 𝒜*(y) ≽ 0, then y is automatically feasible for the nuclear-norm dual~, because ∥𝒜*(y)∥op ≤ 1 follows from 0 ≼ 𝒜*(y) ≼ Id.

Below the threshold, we exploit the slack in the measurement geometry to select a dual optimizer for~ whose adjoint image is PSD. Concretely, we first solve~ (conceptually; in the proof one may equivalently argue existence by compactness on a bounded event), obtaining some dual maximizer ytr with 𝒜*(ytr) ≼ Id and 1, ytr⟩ = OPTPSD-Tr. We then modify ytr along directions in ker (𝒜*) (which is nontrivial in the underdetermined regime n < dim 𝕊d) to eliminate any negative spectral part of 𝒜*(ytr) while preserving the dual objective. The feasibility margin provided by α ≤ 1/4 − ε and the uniform operator control for adjoint sums (Lemma~2) ensure that this ``spectral cleaning’’ can be performed without violating the upper constraint 𝒜*(y) ≼ Id. This yields a certificate y satisfying

At this stage, y is feasible for the nuclear-norm dual~ and attains the PSD-trace dual optimum value. Therefore,
OPTNN ≥ ⟨1, y⟩ = OPTPSD-Tr ≥ OPTNN,
so equality holds throughout and y is also dual-optimal for~. In particular, any nuclear-norm minimizer must satisfy the KKT conditions with a dual matrix 𝒜*(y) ≽ 0, which forces PSD optimality by the subgradient mechanism discussed above. This is the content formalized by Lemma~4.

We conclude that, on the same high-probability event (depending on ε and K) on which~ can be arranged, every minimizer of~ satisfies  ≽ 0 and hence * = Tr (). Moreover, since is optimal for~, we have
Tr () = OPTPSD-Tr ≤ Tr (S0).
To pass from trace control to operator-norm control, we combine exact feasibility with a bounded-spectrum comparison. The polishing theorem provides a feasible S0 ≽ 0 with S0op ≤ M*(α, ε, K); the dual certificate construction above ensures that cannot concentrate mass into a few eigen-directions without increasing Tr () relative to S0 (equivalently: the feasible slice intersects the PSD cone in a region whose extremal widths are uniformly bounded below the threshold). As a result, one obtains the claimed bound
op ≤ M*(α, ε, K),
possibly after enlarging M* by a factor depending only on (α, ε, K); this is the quantitative boundedness statement in Theorem~C.

Finally, while uniqueness is not needed for our main conclusions, it is useful to record the standard sufficient condition. If, in addition, the dual optimizer y can be chosen so that 𝒜*(y) is contractive off the range of (i.e. 0 ≼ 𝒜*(y) ≼ Id with ∥𝒜*(y)∥op = 1 and a strict spectral gap  < 1 on the orthogonal complement of range ()), then complementary slackness implies that any other feasible PSD matrix must have strictly larger trace, hence is the unique minimizer of both~ and~. In the random design setting, such strict complementarity is typical away from the boundary α ↑ 1/4, and can be verified by combining the certificate construction with a nondegeneracy condition for 𝒜 restricted to the tangent space at .


Algorithmic corollaries: end-to-end polynomial-time procedure (find bounded approximate fit via SDP/first-order method; polish to exact); complexity and numerical stability; optional practical variant using conjugate gradient without forming explicitly.

Assume throughout that α ≤ 1/4 − ε and we are on the high-probability event on which the conclusions of Theorems~A and~B hold (in particular, 𝒢 is well-conditioned on its range and the polishing bounds are valid uniformly over bounded-spectrum inputs). Then we obtain an explicit polynomial-time procedure which, given the measurements (xi)i ≤ n, outputs an feasible matrix  ≽ 0 satisfying 𝒜() = 1. The procedure is modular:
(i) compute any bounded-spectrum approximately feasible S (i.e. a witness for EFPδ, M for sufficiently small δ and some M), and
(ii) apply the polishing map Polish(S), implemented via iterative linear algebra.

We emphasize that the existence part (Step (i)) is probabilistic (Theorem~A), while Step (ii) is an explicit correction with deterministic correctness once the linear solve is performed; randomness enters Step (ii) only through the conditioning and norm-control needed to preserve PSD without excessive damping.

There are several standard convex formulations that can be used to obtain an S with 0 ≼ S ≼ MId and $\|\mathcal{A}(S)-\mathbf{1}\|_2\le \delta\sqrt{n/d}$. One convenient choice is the feasibility-regularized SDP

Below threshold, Theorem~A guarantees feasibility for suitable parameters (with M possibly enlarged to a data-dependent M controlled in probability), and any feasible output of~ is admissible input to polishing. Alternatively, one may replace the hard residual constraint by a penalty and run projected first-order methods:

which admits projected gradient or accelerated variants. Projection onto {0 ≼ S ≼ MId} is effected by an eigenvalue decomposition and clipping eigenvalues to [0, M], which is polynomial-time in d and numerically stable. The particular choice of algorithm in Step (i) is inessential for our analysis: we only require bounded-spectrum approximate witness at residual level $\delta\sqrt{n/d}$, and below threshold such witnesses exist with high probability.

Given an input S from Step (i), define the residual r = r(S) = 𝒜(S) − 1 ∈ ℝn. The polishing map is

with t ∈ (0, 1] chosen to ensure  ≽ 0 (e.g. t = min {1, λmin(S)/(2∥Δop)} as in Theorem~B). The defining identity 𝒜(Δ) = 𝒜𝒜*(v) = 𝒢v implies that if v solves 𝒢v = r exactly and t = 1, then 𝒜() = 1 exactly. In practice we either (a) solve 𝒢v = r to sufficient accuracy and, if needed, apply polishing iteratively, or (b) solve a slightly regularized system (𝒢 + ρI)v = r and correct the resulting small bias by a final refinement; both variants remain polynomial-time and stable under the conditioning bounds from Lemma~1.

The Gram matrix 𝒢 ∈ ℝn × n is dense and prohibitive to form explicitly when n = αd2 is large. However, iterative solvers (conjugate gradient for symmetric PSD systems, MINRES, or LSQR) require only matrix–vector products v ↦ 𝒢v, which can be computed implicitly using the factorization 𝒢 = 𝒜𝒜*. Concretely, for a given v ∈ ℝn we compute
$$ M\ :=\ \mathcal{A}^*(v)\ =\ \sum_{j=1}^n v_j x_j x_j^\top\in\mathbb{S}^d, \qquad (\mathcal{G}v)_i\ =\ (\mathcal{A}(M))_i\ =\ x_i^\top M x_i,\ \ i\le n. $$
Thus one Gram product reduces to one adjoint accumulation (a weighted sum of rank-one matrices) followed by evaluation of n quadratic forms. If we represent X = [x1 ⋯ xn] ∈ ℝd × n, then M = Xdiag (v)X, and 𝒢v = diag (XMX). This can be implemented via dense linear algebra in time polynomial in (d, n), without storing 𝒢.

We record a crude but explicit complexity upper bound sufficient for the theoretical claim. Since n = Θ(d2), forming $M=\sum_{j=1}^n v_j x_j x_j^\top$ naively costs O(nd2) = O(d4) arithmetic operations, and computing all quadratic forms xiMxi also costs O(nd2) = O(d4). Hence a single implicit 𝒢–vector product costs O(d4) in the dense model. Conjugate gradient requires $O(\sqrt{\kappa}\log(1/\eta))$ such products to reach relative residual η, where κ is the condition number of 𝒢 on its range. By Lemma~1, below threshold we have κ = Oε, K(1) with high probability after scaling, so the iteration count is Oε, K(log (1/η)). Consequently, the total polishing cost is Oε, K(d4log (1/η)), polynomial in the input size. Step (i) is also polynomial-time, whether via interior-point methods for SDPs (with worst-case polynomial complexity) or via first-order methods whose per-iteration cost is dominated by repeated applications of 𝒜 and 𝒜* (each of which can be implemented in O(nd2) = O(d4) time in the dense model).

This accounting is intentionally conservative; it is meant to certify polynomial-time solvability rather than to optimize constants. Any structural speedups (e.g. exploiting sparsity in xi, block structure, or hardware-accelerated matrix multiplication) improve the practical runtime but are not needed for the theoretical corollary.

Two numerical issues are central: conditioning of the linear solve in~, and preservation of PSD in the update  = S − tΔ. The first is governed by Lemma~1: when α ≤ 1/4 − ε, the nonzero spectrum of 𝒢 lies in [c1d, c2d] with high probability, so the effective conditioning is bounded by c2/c1, independent of d. In particular, iterative solvers do not suffer dimension-driven slowdown. Moreover, since 𝒢 is PSD, conjugate gradient is backward stable in the usual sense: the computed iterate v(k) solves a nearby system (𝒢 + E)v = r with E controlled by roundoff, provided the matrix–vector products are computed stably.

For the second issue, we either compute t from spectral estimates of λmin(S) and Δop, or we adopt a conservative line-search. Since S is PSD and bounded by construction, λmin(S) can be approximated by a few iterations of Lanczos (polynomial-time), and Δop can be estimated similarly by power iteration applied to Δ. In many regimes, Theorem~B yields that Δop is sufficiently small that t = 1 is admissible, and the damping is only a safeguard near the boundary of the PSD cone. If damping is applied, exact feasibility is preserved when t = 1, whereas for t < 1 we lose exact feasibility; in that case we can restore exactness by either (a) applying the full correction in multiple damped steps while updating the residual, or (b) performing one final undamped correction after shifting S away from the PSD boundary (e.g. replacing S by (1 − γ)S + γτId for small γ, τ > 0), both of which keep the spectrum controlled under the operator-norm bounds from Theorem~B.

Because 𝒢 may have a nontrivial kernel (indeed rank (𝒢) ≤ d(d + 1)/2), it is natural to interpret v = 𝒢r as the minimum-2 solution of minv∥𝒢v − r2. Conjugate gradient applied to the normal equations or to 𝒢v = r with initialization v(0) = 0 produces the minimum-norm solution whenever r ∈ range (𝒢). In floating-point arithmetic or when r is only approximately in the range (as can occur if S is obtained from inexact optimization in Step (i)), it is often preferable to solve the regularized system
(𝒢 + ρI)v = r
for a small ridge ρ > 0, and then set Δ = 𝒜*(v). This yields a correction direction that is well-defined and smooth in r, and the resulting residual becomes (I − 𝒢(𝒢 + ρI)−1)r, which can be made arbitrarily small by taking ρ small and iterating. The operator-norm control from Lemma~2 continues to apply to Δ = 𝒜*(v), with constants that degrade continuously with ρ.

Below the threshold α ≤ 1/4 − ε, the combination of (a) existence of bounded-spectrum approximate fits, (b) uniform conditioning of the Gram operator, and (c) the explicit correction formula Δ = 𝒜*(𝒢r) yields an end-to-end polynomial-time route to exact feasibility in the PSD cone. The solver is explicit, relies only on applications of 𝒜 and 𝒜* (hence can be implemented without forming 𝒢), and admits standard iterative linear-algebra primitives whose stability is supported by the same probabilistic estimates used in the analysis of Theorem~B.


Examples and extensions: Gaussian case recovers MB23 and upgrades to exact feasibility conditional on EFP; extension sketches to other cones (PSD+trace constraints; group-invariant cones); discussion of implications for random-feature/quadratic-feature learning models.

We briefly record several instantiations of the general framework, and indicate directions in which the arguments admit reasonably direct modifications. The emphasis here is on clarifying what is already contained in the preceding theorems (often implicitly), rather than on introducing new technical statements.

In the Gaussian case xi ∼ N(0, Id/d), the ensemble is rotationally invariant and the Gram entries satisfy
$$ \mathcal{G}_{ij}=(x_i^\top x_j)^2,\qquad \mathbb{E}[\mathcal{G}_{ii}]=\mathbb{E}\|x_i\|_2^4=1+O(1/d),\qquad \mathbb{E}[\mathcal{G}_{ij}]=\mathbb{E}(x_i^\top x_j)^2=\frac{1}{d}\ \ (i\neq j). $$
Theorem~A specializes to the bounded-spectrum approximate feasibility transition at α = 1/4 established in , but with two changes in formulation that are useful for downstream algorithmics: (a) we measure approximate feasibility in an 2 residual $\|r(S)\|_2\le \delta\sqrt{n/d}$ rather than coordinatewise tolerances, and (b) we keep track of a bounded-spectrum witness 0 ≼ S ≼ MId as the basic regularity regime. Once such an approximate witness exists, Theorem~B yields an explicit correction Δ = 𝒜*(𝒢r(S)) that enforces 𝒜(S − Δ) = 1 at the level of linear algebra, and then controls Δop in terms of r(S)∥2 so that PSD is preserved (possibly after damping). Thus, in the Gaussian setting, ’s approximate existence result combines with our polishing mechanism to give a route from bounded approximate feasibility to feasibility, conditional only on the availability of an approximate witness with sufficiently small residual.

The polynomial-time correction map is essentially the minimum-Frobenius-norm update within range (𝒜*) that removes the residual. Concretely, for any input S and residual r = r(S), the choice v = 𝒢r satisfies 𝒢v = Πrange (𝒢)r, hence
𝒜(S − 𝒜*(v)) = 1 + r − 𝒢v = 1 + (I − Πrange (𝒢))r.
Therefore exact feasibility is immediate whenever r ∈ range (𝒢), which holds in particular if the residual is computed from an exactly representable vector in range (𝒜) (or if Step (i) is solved to sufficient accuracy so that the residual is numerically in-range). Below threshold, Lemma~1 yields uniform conditioning on range (𝒢) and Lemma~2 yields ∥𝒜*(v)∥op control in terms of v2, so the correction is stable. In the Gaussian case one can often argue that t = 1 is admissible with high probability for a broad class of approximate witnesses, because typical bounded-spectrum S have λmin(S) bounded away from 0 in many constructions and Δop concentrates at the scale $O(\|r\|_2/\sqrt{d})$; we do not need such refinements for the general theorem, but they help interpret polishing as a one-shot exactification rather than as an iterative safeguard.

Many formulations of ellipsoid fitting and related SDP relaxations impose an additional linear normalization, such as Tr (S) = 1 (or Tr (S) = d depending on scaling), or constrain a subset of diagonal entries. Such constraints can be incorporated by augmenting the measurement operator. For example, define
$$ \widetilde{\mathcal{A}}:\mathbb{S}^d\to\mathbb{R}^{n+1},\qquad \widetilde{\mathcal{A}}(S):=\bigl(\mathcal{A}(S),\, \operatorname{Tr}(S)\bigr), $$
with target right-hand side (1, τ) for a prescribed τ > 0. Then $\widetilde{\mathcal{A}}^*(v,s)=\mathcal{A}^*(v)+s I_d$ and the corresponding Gram operator becomes a block matrix
$$ \widetilde{\mathcal{G}}=\widetilde{\mathcal{A}}\widetilde{\mathcal{A}}^* = \begin{pmatrix} \mathcal{G} & \mathcal{A}(I_d)\\ \mathcal{A}(I_d)^\top & \|I_d\|_F^2 \end{pmatrix} = \begin{pmatrix} \mathcal{G} & (\|x_i\|_2^2)_{i\le n}\\ (\|x_i\|_2^2)_{i\le n}^\top & d \end{pmatrix}. $$
Below threshold, the same strategy applies provided we have a uniform conditioning bound for $\widetilde{\mathcal{G}}$ on its range, and an operator-norm control for $\widetilde{\mathcal{A}}^*(v,s)$ analogous to Lemma~2. Since Id is deterministic and xi22 concentrates around 1, these modifications are typically benign: one can view the extra constraint as adding one more well-behaved direction to the linear system, and the polishing map becomes $S\mapsto S-\widetilde{\mathcal{A}}^*(\widetilde{\mathcal{G}}^\dagger \widetilde{r})$ with $\widetilde{r}=\widetilde{\mathcal{A}}(S)-(\mathbf{1},\tau)$. Similar remarks apply to multiple linear side constraints (e.g. several trace-like moments), at the cost of increasing the Gram dimension by a constant factor.

The core probabilistic inputs in Theorems~A and~B are (i) a universality statement controlling the approximate-feasibility volume/width transition, and (ii) a uniform invertibility statement for 𝒢 on its range together with an 𝒜* operator-norm bound. These inputs are not intrinsically tied to the full PSD cone 𝕊+d; rather, they depend on how the affine slice intersects a convex cone in a space of dimension Θ(d2). If one replaces 𝕊+d by a cone 𝒦 ⊆ 𝕊d that is invariant under a group action S ↦ USU for U in a subgroup G ≤ O(d) (e.g. block-diagonal PSD cones, G-equivariant matrix algebras, or cones defined by commuting constraints), then one expects the threshold to depend on the number of degrees of freedom (statistical dimension or Gaussian width of 𝒦 ∩ {∥SF = 1}), rather than on d(d + 1)/2. At a formal level, if the feasible set is restricted to a G-invariant linear subspace ℒ ⊆ 𝕊d of dimension m ≪ d2, then 𝒜 is replaced by 𝒜 ∘ Π, and the natural scaling becomes n = αm; correspondingly, one anticipates a threshold at α = 1/4 with d2 replaced by m, provided the design remains isotropic in the sense that 𝔼⟨X, HF2 is proportional to HF2 for H ∈ ℒ. Establishing such variants would require re-running the universality and conditioning arguments with m as the ambient dimension, but the polishing mechanism itself is unchanged once the relevant Gram operator is well-conditioned on its range.

The bounded-spectrum hypothesis 0 ≼ S ≼ MId is a convenient way to exclude pathologies near the PSD boundary and to obtain uniformity in Lemma~2. In applications, the same effect can be achieved by convex regularization rather than explicit spectral truncation. For instance, one can compute an approximate witness by
$$ \min_{S\succeq 0}\ \frac12\|\mathcal{A}(S)-\mathbf{1}\|_2^2+\lambda\,\operatorname{Tr}(S)+\mu\,\|S\|_F^2, $$
with μ > 0 enforcing strict convexity and spectral control, and then polish the output. The point is not that this particular objective is canonical, but that many regularizers imply bounds of the form Sop ≤ M (possibly with M depending on λ, μ and the data) and yield small residuals. Once such a bounded approximate witness is available, polishing provides an explicit final certificate  ≽ 0 with exact constraints. This separation of concerns—first enforce ``well-behavedness’’ via regularization, then certify exact feasibility by linear correction—is robust across algorithmic choices.

The measurements 𝒜(S)i = xiSxi can be interpreted as evaluating a quadratic model fS(x) = xSx on the training inputs x1, …, xn. Exact feasibility 𝒜(S) = 1 is then interpolation of the constant label 1 by a PSD quadratic form. The matrix 𝒢 is precisely the (degree-2 homogeneous) polynomial kernel Gram matrix on the training set: 𝒢ij = k(xi, xj) with k(u, v) = (uv)2. Thus the linear system 𝒢v = r appearing in polishing is the normal-equation system for least-squares regression in the quadratic feature space, and the conditioning statement in Lemma~1 may be read as a statement about the stability of kernel regression below the n ≍ d2 scaling, at least on the range determined by the feature map.
Moreover, the nuclear norm S* = Tr (S) for PSD S corresponds to the natural RKHS norm for quadratic forms under our scaling, so Theorem~C can be interpreted as saying that, below threshold, the minimum-norm interpolant in the quadratic feature class is automatically PSD and has controlled operator norm with high probability. This is consistent with the heuristic that, in an underconstrained regime, minimum-norm interpolation selects a ``benign’’ solution. While our results do not directly yield generalization bounds (which would require a separate distributional model for test points and labels), they provide a structural guarantee: below α = 1/4 − ε, interpolation of the constant target can be realized by a bounded PSD quadratic model, and the canonical convex regularizer does not leave the PSD cone.

From a kernel perspective, existence of S ≽ 0 with 𝒜(S) = 1 is a representability statement: the constant vector 1 lies in the range of the feature map induced by S ↦ (xiSxi)i ≤ n restricted to the PSD cone. Polishing adds a certification statement: given any approximate representer with bounded spectrum, one can algorithmically convert it into an exact representer by solving a Gram linear system and applying a controlled correction in parameter space. In particular, one does not need to solve the conic feasibility problem to high accuracy ab initio; it suffices to enter the basin of small residual and bounded spectrum. This perspective is compatible with numerical practice in random-feature models, where one frequently observes that moderate-accuracy optimization combined with an explicit linear-algebraic refinement yields exact constraint satisfaction, provided the relevant Gram operator is not ill-conditioned.


Open problems and 2026 directions: pushing polishing to the exact 1/4 boundary; removing bounded-spectrum assumptions entirely; extending to structured (non-i.i.d.) measurements; connections to other Randomstrasse101 problems (synchronization landscapes, Kikuchi matrices) via ‘regularize then certify’ principles.

We conclude by isolating several problems that appear to be the natural next steps if one aims to turn the present picture into a sharp and broadly applicable theory. The common theme is that our arguments are presently separated into (i) existence of a approximate witness (bounded spectrum and small residual) and (ii) a deterministic correction that certifies exact feasibility once such a witness is available. It is plausible that both components can be strengthened: the first toward the sharp boundary α = 1/4 and toward weaker regularity hypotheses, and the second toward greater robustness under model misspecification and dependence. We state these directions in a way that makes clear which parts are geometric (thresholds, widths, statistical dimension) and which are analytic (conditioning of Gram operators, stability of inverse maps, control of operator norms).

Our polishing theorem is stated with a fixed gap α ≤ 1/4 − ε, and this gap is used in two places: to obtain a uniform lower bound on the nonzero singular values of 𝒢 on its range, and to ensure that the 𝒜* operator-norm bound can be applied uniformly over the relevant vectors v = 𝒢r. It is natural to ask whether one can let ε = εd ↓ 0 and still certify exact feasibility for residuals of size δ = δd that may also shrink with d. A minimal quantitative goal would be a statement of the form: if α ≤ 1/4 − cdβ for some β > 0 then, with high probability, σmin(𝒢|range) ≳ d ⋅ εd (or another explicit scale) and polishing remains stable for residuals r2 of order at most d1/2εd (up to constants). Achieving such ``near-critical’’ control appears to require a refined spectral analysis of 𝒢 beyond the coarse conditioning bounds used here, perhaps in the spirit of local laws for kernel Gram matrices. Even in the Gaussian case, the matrix 𝒢 is neither a Wigner matrix nor a standard sample covariance; its entries are quadratic functions of inner products, and the near-critical regime likely exhibits a nontrivial edge behavior. A second, closely related problem is to understand whether the damping step can be made essentially unnecessary at the boundary, i.e. whether typical approximate witnesses in the feasible regime have λmin(S) bounded below at the scale on which Δop fluctuates. This would turn polishing into a genuinely one-shot exactification with t = 1 for broad classes of initializations.

The current stability mechanism bounds Δop via ∥𝒜*(v)∥op together with v2 ≤ ∥𝒢opr2. In near-critical regimes this can be pessimistic because ∥𝒢op is governed by the smallest nonzero singular value, while the residual may concentrate in more stable directions. A plausible improvement is a estimate of the form
∥𝒜*(𝒢r)∥op ≤ Φ(r)
where Φ is controlled by the spectral measure of 𝒢 and the alignment of r with near-null directions. Such a bound would amount to a refined norm control on the map 𝒜*𝒢 restricted to residuals that arise from bounded-spectrum inputs. Establishing it may require coupling the geometry of the feasible slice (which constrains what residuals are attainable by bounded-spectrum matrices) with spectral information about 𝒢. This is a concrete place where one might hope to exploit the specific structure of residuals r(S) = 𝒜(S) − 1 rather than bounding over all r ∈ ℝn.

A more conceptual open question is whether the bounded-spectrum hypothesis is genuinely needed for exact feasibility below threshold, or merely a technical device that simplifies uniformity. One extreme conjecture would be: for α < 1/4, with high probability there exists an exact feasible solution S ≽ 0 with Sop = O(1) (or at least Tr (S) = O(d)), without imposing any a priori spectral cap. On the algorithmic side, one would like a statement of the form: if one finds approximate feasible S ≽ 0 with sufficiently small residual (but possibly large Sop), then there is a controlled way to ``regularize and then polish’’ without first projecting onto {S ≼ MId}, which can be numerically delicate when S is ill-conditioned. One route is to prove that approximate feasibility itself enforces spectral control: for instance, show that if 𝒜(S) ≈ 1 and S ≽ 0 then necessarily Tr (S) ≈ d and Sop cannot be too large, with high probability over the design. Such an implication would be a form of restricted injectivity of the map S ↦ 𝒜(S) on rays of the PSD cone, and would reduce bounded-spectrum feasibility to plain feasibility plus residual control. Another route is to incorporate an interior-point style barrier (e.g. −log det (S + ηI)) to ensure strict positivity and then to quantify how the barrier parameter η can be sent to 0 while maintaining a stable polishing step.

Our universality and conditioning inputs are formulated for i.i.d. isotropic subgaussian xi. In many applications the xi are neither independent nor identically distributed: they may be drawn without replacement from a finite set, they may form a Markov chain, or they may be generated by a random feature pipeline with shared randomness across samples. A first extension is to allow heterogeneous covariances 𝔼[xixi] = Σi/d with Σi close to Id, and ask whether the same 1/4 boundary persists after an appropriate prewhitening. Here one expects the threshold to depend on the effective condition number of the average covariance and on the uniformity of tails, and the key technical point becomes a stability analysis of 𝒢 under inhomogeneous kernel entries (xixj)2. A second extension is dependence across i. Even mild dependence can break the simple concentration tools used to control σmin(𝒢|range), so one would need either a martingale/mixing replacement for the Gram conditioning lemma, or a leave-one-out/self-consistent analysis that controls the resolvent of 𝒢. We emphasize that polishing itself is deterministic; what is needed is a probabilistic guarantee that the linear system defining v is solvable stably in the regime of interest. In structured settings, it may be more natural to certify stability by bounding the spectrum of an Gram operator plus a perturbation term that is controlled in operator norm, rather than appealing to i.i.d. concentration.

Another direction is robustness when the right-hand side is not exactly 1 or when the constraints are noisy. In such settings exact feasibility may be impossible, but one can still ask for a certified near-feasible PSD solution with an explicit residual bound and controlled spectrum. The polishing map can be adapted by solving a Tikhonov-regularized system (𝒢 + γI)v = r and setting  = S − 𝒜*(v), which is the minimum-norm correction in a smoothed metric. An open problem is to develop a theorem stating that, below α = 1/4, this regularized polishing is optimally stable (in a minimax sense) over a natural noise model, with γ chosen at the scale of the smallest nonzero singular value of 𝒢. Such a result would connect the feasibility phase transition to the stability phase transition for kernel ridge regression with the quadratic kernel, and would clarify what aspects of our proofs are genuinely about feasibility versus about conditioning.

The pattern ``find a regularized approximate solution, then certify or exactify via an explicit linear-algebraic correction’’ appears in several problems that are adjacent to our setting. Two examples that seem particularly close in spirit are synchronization-type SDPs (where one optimizes over a PSD matrix with structured linear constraints encoding group consistency) and relaxations based on Kikuchi or cluster-variational matrices (where PSD constraints are imposed on collections of local marginals). In these problems one often observes empirically that first-order methods find points that nearly satisfy the linear constraints but are not exactly feasible, and that enforcing exact constraints via projection can destroy PSD structure. Our polishing mechanism suggests an alternative: if the constraint operator has a well-conditioned Gram map on its range, then one can correct in the range of the adjoint while quantifying how much the PSD cone is perturbed. Formalizing this analogy requires identifying, in each problem, the analogue of 𝒢 = 𝒜𝒜* and proving a below-threshold (or below-density) conditioning statement. For synchronization landscapes, one expects the relevant effective dimension to be the number of degrees of freedom in the lifted variable (often Θ(d2) per node, with additional structure), and the analogue of α = 1/4 might manifest as a sampling density threshold for stable constraint enforcement. For Kikuchi-type matrices, the local-consistency operator typically has a sparse incidence structure; here the main challenge is that 𝒢 is no longer a dense kernel Gram matrix, and conditioning must be analyzed via graph expansion or spectral properties of the constraint graph rather than via i.i.d. concentration.

Even in the present i.i.d. model, the practical implementation of polishing depends on solving 𝒢v = r efficiently. While conjugate gradients with matrix-vector products by 𝒢 is polynomial-time and often effective, the near-critical regime will be sensitive to preconditioning. An open direction is to design preconditioners exploiting the structure 𝒢ij = (xixj)2, for example by decomposing 𝒢 into a low-rank mean component plus a fluctuation term, or by using randomized sketches that preserve the range. On the theoretical side, it would be useful to prove that a small number of iterative polishing steps (possibly with re-damping) converges to an exact feasible PSD matrix under weaker conditions than a single step, mirroring Newton refinement in nonlinear programming. Such a result would shift the burden from requiring strong one-shot bounds on Δop to requiring only that successive corrections remain controlled on average.

We view the following as the most concrete 2026-level targets suggested by the present work: (i) a near-critical conditioning theorem for 𝒢 that yields explicit rates as α ↑ 1/4; (ii) a self-regularization result showing that approximate feasibility forces bounded trace and moderate operator norm for PSD solutions, thereby eliminating explicit spectral caps; (iii) an extension of the universality and polishing guarantees to non-i.i.d. or weakly dependent designs via resolvent/leave-one-out techniques; and (iv) a transfer principle that packages ``regularize then certify’’ as a reusable lemma for other SDP landscapes where one can prove an appropriate Gram conditioning statement. Each of these targets is, in our view, technically plausible and would substantially strengthen the conceptual message: below a universal sampling boundary, feasibility is not only an existential geometric property but also an algorithmically certifiable one.