We consider the feasibility problem
find S ≽ 0 such
that 𝒜(S) = 1,
where 𝒜(S) = (xi⊤Sxi)i ≤ n
is induced by random rank-one measurements Xi = xixi⊤.
This formulation is equivalent to an ellipsoid fitting problem: if S ≻ 0 then S defines the centered
ellipsoid
ℰ(S) := {z ∈ ℝd: z⊤Sz ≤ 1},
and the constraints xi⊤Sxi = 1
assert that each sample point xi lies on the
boundary of ℰ(S). Thus (EF)
asks for a centered ellipsoid whose boundary passes through all n random points. Although the
geometric picture is classical, the random scaling n = αd2
places the problem in a regime where the number of constraints is
commensurate with the ambient dimension of 𝕊d, and the feasibility
of the affine slice {S : 𝒜(S) = 1}
intersected with the PSD cone exhibits sharp high-dimensional
behavior.
A naive dimension count suggests that α should control feasibility because
dim (𝕊d) = d(d + 1)/2 ≍ d2
while the number of constraints is n = αd2.
However, such a count neither predicts a sharp threshold nor captures
the role of positivity. In particular, the PSD cone imposes a global
curvature constraint, and the equality constraints are quadratic in the
underlying vectors xi. Both
features create a substantial gap between
having enough degrees of freedom'' andadmitting a feasible
PSD solution with controlled geometry.’’ Our focus is precisely on this
gap: we aim to identify a distributionally robust transition for
feasible solutions and to provide a mechanism that turns robust
approximate fits into exact feasible PSD matrices.
Two notions are essential. First, in high dimension one must exclude
pathological solutions whose spectrum is unbounded or whose conditioning
deteriorates with d. Such
solutions, even when they exist, are unstable and are not informative
about the typical geometry of the problem. We therefore work in a
bounded-spectrum regime, seeking matrices S satisfying
0 ≼ S ≼ MId
for a fixed (or at most slowly growing) M. Second, even below a putative
feasibility threshold, exact equality constraints can be algorithmically
and analytically delicate; it is natural to relax to an approximate
feasibility requirement
∥𝒜(S) − 1∥2
small,
which is also the relevant formulation in noisy data settings. The
bounded-spectrum approximate feasibility event EFPδ, M captures
precisely this ``robustly feasible and well-conditioned’’ regime.
In the Gaussian case, recent work has identified a sharp transition at sampling ratio α = 1/4 for the existence of bounded-spectrum solutions: below 1/4 one typically can fit a well-behaved ellipsoid up to small residual, while above 1/4 even approximate fitting fails uniformly over bounded M. This transition is striking because it is not the naive α ≈ 1/2 suggested by parameter counting; rather, it reflects a genuinely high-dimensional geometric obstruction. A central question is whether the same constant persists for non-Gaussian designs and, more importantly for our purposes, whether bounded approximate feasibility can be upgraded to feasibility in a stable manner.
The approximate-to-exact step is the missing link for two reasons. First, the existence of an approximate fit does not itself guarantee the existence of an exact PSD fit with comparable spectrum bounds. In principle, the affine constraints might be such that the slice {S : 𝒜(S) = 1} passes near the PSD cone in the bounded region but does not intersect it; approximate solutions could then exist without any exact solution of controlled size. Second, even when exact solutions exist, it is not clear how to produce one from an approximate input without leaving the PSD cone: a direct correction solving 𝒜(Δ) = 𝒜(S) − 1 may introduce a large negative direction, thereby destroying positive semidefiniteness. Thus, to convert approximate feasibility into exact feasibility, one must exploit both the linear-algebraic structure of 𝒜 and the random conditioning of the induced Gram operator.
Our approach is to make the correction step explicit and
quantitative. Let r(S) := 𝒜(S) − 1
denote the residual and let 𝒢 := 𝒜𝒜* be the n × n Gram operator with
entries (𝒢)ij = (xi⊤xj)2.
Formally, if 𝒢 were invertible then the
choice
v := 𝒢−1r(S), Δ := 𝒜*(v)
satisfies 𝒜(S − Δ) = 1
by construction. Moreover, among all corrections in range (𝒜*) that eliminate the
residual, Δ is the
minimum-Frobenius-norm correction. The obstacle is that 𝒢 may be singular or ill-conditioned, and
even when v is well-defined
(via the Moore–Penrose inverse), the operator norm ∥Δ∥op must be controlled
to preserve S ≽ 0. The core
technical content of our ``polishing’’ result is that, below the 1/4 threshold, the random operator 𝒢 is sufficiently well-conditioned on its
range and 𝒜* does not
amplify vectors v too much in
operator norm. This yields a deterministic update rule with a
probabilistic guarantee: starting from any bounded-spectrum approximate
feasible S, we compute a
correction direction Δ and
apply either the full step S − Δ or a damped step
S − tΔ with
an explicit t chosen to ensure
S − tΔ ≽ 0.
The random matrix input required for this argument is, in a sense, orthogonal to the existence of approximate solutions. Existence is governed by the high-dimensional geometry of the feasible slice intersected with {0 ≼ S ≼ MI}, while polishing is governed by invertibility and norm control for the operator 𝒜 and its adjoint. The two aspects interact through the residual scale: we require $\|r(S)\|_2\le \delta\sqrt{n/d}$ with δ small enough that the resulting correction is spectrally small. This is precisely why the bounded-spectrum approximate feasibility event is the right input condition: it gives a robust quantitative foothold from which exact feasibility can be reached without losing control of the spectrum.
A second contribution concerns convex regularization. Among all exact
solutions to 𝒜(S) = 1, a natural
selection rule is minimum nuclear norm:
Ŝ ∈ arg min {∥S∥*: 𝒜(S) = 1}.
For S ≽ 0 the nuclear norm
equals Tr (S), so this
objective can be interpreted as seeking a smallest-trace ellipsoid fit.
However, the optimization problem is posed over all symmetric matrices,
and an optimizer could be indefinite. We show that below the 1/4 threshold this does not happen with high
probability: every nuclear-norm minimizer is PSD and enjoys an
operator-norm bound depending only on (α, ε, K).
Conceptually, this ties the geometric threshold phenomenon to a
canonical convex program and provides an additional route to
``well-behaved’’ exact solutions without imposing S ≽ 0 as an explicit constraint in
the minimization.
We emphasize that our results are not restricted to rotationally invariant ensembles. The measurement vectors xi are assumed only isotropic and subgaussian (in the standard ψ2 sense after scaling by $\sqrt{d}$). This covers Gaussian designs but also encompasses, for example, Rademacher-type designs and many other bounded or weakly dependent models after appropriate preprocessing. The universality statement asserts that the 1/4 threshold for bounded-spectrum approximate feasibility persists under this generality, and the polishing theorem shows that the accompanying linear-algebraic stability needed for approximate-to-exact conversion also persists. In this way, the constant 1/4 emerges as a geometric property of the feasibility problem rather than a special feature of Gaussian symmetry.
Finally, the combination of approximate feasibility below threshold and polishing yields an explicit end-to-end algorithm with polynomial complexity. One may first compute a bounded-spectrum approximate feasible matrix S by a standard convex program (for instance, an SDP enforcing 0 ≼ S ≼ MI together with a small ℓ2 residual constraint). One then applies the polishing map, which requires solving the linear system 𝒢v = r(S) on range (𝒢) and forming Δ = 𝒜*(v). Importantly, this can be implemented without explicitly forming 𝒢: iterative methods such as conjugate gradient only require matrix–vector products w ↦ 𝒢w = 𝒜(𝒜*(w)), which in turn reduce to evaluating quadratic forms xi⊤(∑jwjxjxj⊤)xi. The resulting procedure is deterministic given the data, and our theorems assert that, below the threshold, it succeeds uniformly over all approximate inputs satisfying the bounded-spectrum residual condition.
In summary, we (i) establish a universal 1/4 transition for bounded-spectrum approximate feasibility under isotropic subgaussian designs, (ii) provide a deterministic and stable polishing mechanism that turns such approximate solutions into exact PSD solutions with controlled spectrum, and (iii) show that nuclear-norm minimization produces PSD, bounded exact solutions in the same below-threshold regime. These results identify the correct ``robust feasibility’’ notion for ellipsoid fitting at scale n = αd2 and supply the missing approximate-to-exact step needed to convert geometric existence statements into algorithmically meaningful exact feasibility guarantees.
We briefly situate (EF) within the literature on ellipsoid fitting, semidefinite programming, and high-dimensional feasibility transitions. The basic observation, going back at least to early work on convex relaxations for quadratic constraints, is that the unknown quadratic form can be lifted to a matrix variable. After lifting, the constraints xi⊤Sxi = 1 become linear equalities in S, and the constraint S ≽ 0 is convex. Thus, at the level of modeling, (EF) is already a semidefinite feasibility problem: it asks whether a random affine subspace of 𝕊d intersects the PSD cone. This perspective makes it natural to import tools from random spectrahedra, convex geometric functional analysis, and the algorithmic theory of SDPs.
A closely related family of formulations replaces feasibility by
optimization with a convex regularizer. The most common objective is
trace minimization,
min Tr (S) s.t. 𝒜(S) = 1, S ≽ 0,
which coincides with nuclear-norm minimization on the PSD cone and can
be interpreted geometrically as selecting a smallest-trace (hence, in a
certain averaged sense, ``smallest volume’’) centered ellipsoid
consistent with the constraints. One also encounters regularized and
relaxed variants, for example
min Tr (S) + λ∥𝒜(S) − 1∥22 s.t. 0 ≼ S ≼ MId,
or feasibility with tolerance ∥𝒜(S) − 1∥2 ≤ η,
as a surrogate for noise or model mismatch. From an algorithmic
standpoint these formulations are attractive because they are convex and
admit polynomial-time solvers in the worst case, while in practice they
can be implemented with first-order methods that access 𝒜 and 𝒜* through matrix–vector
operations. From an analytical standpoint, however, the challenge is not
convexity but geometry: one must understand when the feasible region is
nonempty and, if it is nonempty, whether it contains solutions with
controlled spectrum.
Several works emphasize that the spectrum control is not a technical convenience but an essential feature of the phenomenon. In the absence of an bound such as S ≼ MId, one may obtain solutions that are extremely ill-conditioned: the constraints can be satisfied by placing nearly all mass of S on a low-dimensional subspace aligned with atypical correlations among the sample points, at the cost of a very large operator norm and a correspondingly unstable ellipsoid. Such solutions can exist even when the ``typical’’ geometry suggests infeasibility, and they are often invisible to coarse arguments that track only dimension or rank. This motivates focusing on a bounded-spectrum regime, which is the regime in which one expects the feasibility problem to reflect intrinsic high-dimensional geometry rather than degenerate algebraic coincidences.
This distinction also clarifies why a naive parameter count is an unreliable predictor. Although dim (𝕊d) = d(d + 1)/2 and the number of constraints is n = αd2, the PSD constraint is not a linear codimension constraint and cannot be captured by equating degrees of freedom. Moreover, the map S ↦ (xi⊤Sxi)i is not a generic linear map on 𝕊d; it is structured by rank-one measurements, and its image and nullspace interact nontrivially with the cone 𝕊+d. Consequently, one can have regimes in which many solutions exist but are ill-behaved, regimes in which no bounded solution exists but near-feasible points exist, and regimes in which even approximate feasibility fails. Understanding which of these regimes occurs at the scale n ≍ d2 requires more than dimension counting.
On the lower-bound side, a recurring theme is that certain convex relaxations become ineffective beyond a constant sampling ratio. Early contributions by Saunderson and by Chandrasekaran–Parrilo–Willsky analyze semidefinite representations and convex geometric obstructions for problems that can be cast as fitting quadratic forms to data, highlighting that, in high dimension, feasibility of structured spectrahedra can fail abruptly and that the transition is driven by global geometry rather than local degrees of freedom; see, e.g., and . More recent works develop sharper impossibility results and algorithmic lower bounds in related random SDP and random constraint satisfaction settings, including . While the technical statements vary across models, a common conclusion is that above certain constant-density thresholds, one cannot simultaneously satisfy all constraints with a well-conditioned PSD matrix, even allowing for small residuals, and that this obstruction is robust under distributional perturbations that preserve isotropy and tail control.
The most directly relevant precedent for our purposes is the sharp
bounded-approximate transition established by MB23 in the Gaussian
setting . In that work, one considers the intersection of the affine
slice {S : 𝒜(S) = 1}
(or an approximate version) with the bounded region {0 ≼ S ≼ MId}
and proves a high-dimensional phase transition at α = 1/4 for the existence of such
bounded approximate fits. The result is qualitatively different from
earlier feasibility criteria in two respects. First, it identifies an
explicit constant 1/4 at the scale
n = αd2,
rather than a qualitative
sufficiently small'' orsufficiently large’’ regime. Second,
it is formulated in a robust manner: feasibility is replaced by a
residual tolerance that scales naturally with n and d, and the obstruction above
threshold persists uniformly over all fixed spectral bounds M. This robustness is essential if
one aims to connect the geometric transition to algorithmic procedures
and to stability under perturbations.
From the perspective of ellipsoid fitting, the MB23 phenomenon can be interpreted as follows. Below α = 1/4, the random constraints are sparse enough (relative to the curvature of the PSD cone in dimension d) that there typically exist ellipsoids of controlled shape that pass near all points; above α = 1/4, the constraints are dense enough that, even after allowing bounded slack, the affine constraints force the putative ellipsoid to violate positivity or to require eigenvalues that blow up with d. Importantly, this picture is not captured by the heuristic that one needs n ≲ d(d + 1)/2 constraints for solvability: the threshold occurs at α = 1/4 < 1/2, demonstrating that positivity induces an effective loss of degrees of freedom in this regime.
At the same time, the MB23 result leaves open two questions that are central in the present work. The first is : whether the same 1/4 threshold persists beyond the rotationally invariant Gaussian ensemble. In many high-dimensional problems governed by convex geometry, Gaussian analysis provides the correct constant but only after one proves that the relevant geometric functional (e.g., widths, intrinsic volumes, or log-volumes) is universal over a class of designs. Establishing such universality is nontrivial here because the constraints are quadratic in the underlying vectors, and the induced measurement matrices are rank-one and highly dependent across entries. The second question is : an approximate bounded solution does not, by itself, imply the existence of an exact feasible PSD solution of comparable spectral size, nor does it provide a constructive path to one.
The second question is, in our view, closely connected to a line of work on ``polishing’’ or refinement steps in convex and nonconvex optimization: one often obtains a coarse approximate solution from a tractable relaxation and then improves it by exploiting local linear-algebraic structure. In our setting, the relevant structure is the linear operator 𝒜 and its associated Gram operator, which together determine whether residual errors can be removed without introducing a large negative perturbation. What is specific here is that the ambient dimension is d2 and the measurements are rank-one, so standard restricted isometry heuristics do not directly apply. The appropriate substitute is a statement that the Gram operator is well-conditioned on its range and that the adjoint map does not amplify vectors too much in operator norm; these are the properties that make a deterministic correction step stable in the PSD cone.
Finally, we note that the existence of ill-behaved exact solutions is not ruled out by any of the above transitions and is not the primary object of study. One can ask whether (EF) has PSD solution (perhaps with ∥S∥op growing rapidly) even above α = 1/4, or whether exact feasibility can hold in a narrow window above the approximate threshold due to rare configurations. These questions are meaningful but orthogonal to the goal pursued here, which is to characterize the existence of solutions and to provide a stable mechanism that produces them. Accordingly, our statements are formulated in terms of bounded-spectrum approximate feasibility below threshold, robust nonexistence above threshold in the bounded regime, and a correction procedure that preserves positivity and operator-norm control.
With this background in place, we now fix notation and formalize the model: we specify the isotropic subgaussian design assumptions, define the measurement map and the associated Gram operator, and state our main theorems in a form that isolates the universal 1/4 threshold, the polishing guarantee, and the nuclear-norm regularization consequence.
We work in the ambient space 𝕊d of real symmetric
d × d matrices,
equipped with the Frobenius inner product ⟨U, V⟩F := Tr (UV).
Throughout, the dimension d
tends to infinity and the number of constraints scales as
n = αd2,
where α ∈ (0, 1/2) is held
fixed. All probabilistic statements are with respect to the randomness
of the measurement vectors x1, …, xn ∈ ℝd.
We assume that x1, …, xn
are i.i.d., mean zero, and :
$$
\mathbb{E}[x_i]=0,\qquad \mathbb{E}[x_i x_i^\top]=\frac{1}{d}I_d.
$$
This normalization ensures that for any deterministic S ∈ 𝕊d we
have
$$
\mathbb{E}\big[x_i^\top S
x_i\big]=\operatorname{Tr}\!\Big(S\,\mathbb{E}[x_i
x_i^\top]\Big)=\frac{1}{d}\operatorname{Tr}(S),
$$
so Tr (S) ≈ d is the
natural scale if one aims to satisfy xi⊤Sxi ≈ 1
on average. In addition, we impose a uniform subgaussian tail
assumption: there exists K ≥ 1
such that
$$
\big\|\langle u,\sqrt{d}\,x_i\rangle\big\|_{\psi_2}\le K\qquad \text{for
all }u\in S^{d-1}.
$$
This hypothesis encompasses the Gaussian model xi ∼ N(0, Id/d)
as well as many bounded and product distributions (e.g. Rademacher
entries $\pm 1/\sqrt{d}$). Importantly,
we do not assume rotational invariance.
Given xi, define the
rank-one positive semidefinite measurement matrices
Xi := xixi⊤ ∈ 𝕊+d.
The measurement operator 𝒜 : 𝕊d → ℝn
is
(𝒜(S))i := Tr (SXi) = xi⊤Sxi, i = 1, …, n.
Thus the feasibility constraints xi⊤Sxi = 1
can be written compactly as 𝒜(S) = 1, where
1 ∈ ℝn
denotes the all-ones vector. We write 𝒜* : ℝn → 𝕊d
for the adjoint map relative to the Frobenius and Euclidean inner
products:
$$
\mathcal{A}^*(v)=\sum_{i=1}^n v_i X_i=\sum_{i=1}^n v_i\,x_i x_i^\top .
$$
The action of 𝒜* is
particularly simple: it forms a weighted sum of rank-one PSD matrices.
This structure is the basic reason that polishing can be implemented
with matrix–vector primitives, without explicitly materializing a d(d + 1)/2 × n
design matrix.
The associated Gram operator 𝒢 : ℝn → ℝn
is defined by
𝒢 := 𝒜𝒜*.
Equivalently, 𝒢 is the n × n matrix with
entries
𝒢ij = ⟨Xi, Xj⟩F = Tr (xixi⊤xjxj⊤) = (xi⊤xj)2.
We emphasize two points that will recur. First, 𝒢 is positive semidefinite but typically not
invertible when n ≫ d(d + 1)/2;
hence the Moore–Penrose pseudoinverse 𝒢† is the appropriate inverse on
range (𝒢). Second, 𝒢 captures the effective conditioning of the
measurement system restricted to the span of the observed rank-one
matrices: controlling its nonzero singular values is the main
quantitative input behind stable correction steps.
For S ∈ 𝕊d we use
∥S∥op for the
operator norm, ∥S∥F for the
Frobenius norm, and ∥S∥* for the nuclear norm
(which equals Tr (S) when
S ≽ 0). For vectors in ℝn we use ∥ ⋅ ∥2 and ∥ ⋅ ∥∞. Given an S, the constraint residual is
r(S) := 𝒜(S) − 1 ∈ ℝn.
Our approximate feasibility notion is tailored to the n = αd2
scaling and to the subsequent inversion through 𝒢. Concretely, for parameters δ > 0 and M > 0 we define the
bounded-spectrum approximate feasibility event EFPδ, M as the
existence of an S ∈ 𝕊d such
that
$$
0\preceq S\preceq M I_d,
\qquad
\|r(S)\|_2 \le \delta \sqrt{\frac{n}{d}}.
$$
The factor $\sqrt{n/d}$ is the natural
normalization for the Euclidean residual in this model: under isotropy
and bounded spectrum, each coordinate xi⊤Sxi
has fluctuations on the order of $1/\sqrt{d}$, so the ℓ2 aggregate over n constraints scales like $\sqrt{n/d}$. This choice is also convenient
analytically because it aligns the residual size with the typical scale
of 𝒢, whose nonzero spectrum is of
order d in the below-threshold
regime.
Our exact feasibility problem (EF) is
find S ≽ 0 such that
𝒜(S) = 1.
Even when EFPδ, M holds
for small δ, it is not
automatic that an exact feasible PSD matrix exists with comparable
spectrum. The key observation is that, if S is approximately feasible with
residual r = r(S), then any
correction Δ satisfying 𝒜(Δ) = r yields an exactly
feasible candidate S − Δ. Among all such
corrections lying in range (𝒜*), the
minimum-Frobenius-norm choice is obtained by solving
v = 𝒢†r, Δ = 𝒜*(v),
since then 𝒜(Δ) = 𝒜𝒜*(𝒢†r) = 𝒢𝒢†r
equals r whenever r ∈ range (𝒢) (and this is the
relevant component for residual removal). This motivates the explicit
map
Polish(S): S̃ := S − tΔ,
with a damping parameter t ∈ (0, 1] chosen deterministically
to preserve S̃ ≽ 0; a canonical
choice is
$$
t:=\min\Big\{1,\
\frac{\lambda_{\min}(S)}{2\|\Delta\|_{\mathrm{op}}}\Big\}.
$$
The probabilistic content is then concentrated in showing that below the
1/4 threshold the correction Δ is small in operator norm whenever
r(S) is small in
ℓ2, so that t = 1 or at least t bounded away from 0 is admissible, and the spectral bound on
S is not significantly
degraded.
The first result identifies the universal location of the bounded approximate feasibility transition.
The second result turns any bounded approximately feasible point into an exact feasible PSD point by an explicit, stable correction.
Finally, we connect exact feasibility to convex regularization.
Consider the nuclear-norm minimization problem
min ∥S∥* s.t. 𝒜(S) = 1.
This problem is unconstrained by S ≽ 0 a priori, so it is not
immediate that its optimizers are PSD. Below the same threshold, we show
that the geometry of the constraints forces PSD optimality and yields a
uniform spectral bound.
In combination, Theorems A and B yield an explicit end-to-end route to solving (EF) below threshold: one first produces any bounded approximate feasible point (e.g. by a convex program over {0 ≼ S ≼ MId} with an ℓ2 residual constraint), and then applies Polish by solving linear systems in 𝒢 using iterative methods that access 𝒢v through applications of 𝒜 and 𝒜*. The universality statement ensures that both existence and stability persist for all isotropic subgaussian designs at fixed K, and the subsequent sections isolate the convex-geometric mechanism that pins the transition at α = 1/4 in this generality.
We now sketch the geometric and probabilistic framework behind
Theorem~A, isolating the components that (i) identify the transition at
α = 1/4 in the Gaussian model
and (ii) transfer that transition, with quantitative robustness, to
general isotropic subgaussian designs. The central object is the compact
convex ``well-behaved’’ set
𝒦M := {S ∈ 𝕊d: 0 ≼ S ≼ MId},
and the induced random convex body in measurement space
𝒞M := 𝒜(𝒦M) ⊂ ℝn.
By definition, EFPδ, M is the
event that 1 lies
within Euclidean distance $\delta\sqrt{n/d}$ of 𝒞M, i.e.
$$
\mathrm{EFP}_{\delta,M}\quad\Longleftrightarrow\quad
\operatorname{dist}\big(\mathbf{1},\mathcal{C}_M\big)\le
\delta\sqrt{\frac{n}{d}}.
$$
Thus, bounded-spectrum approximate feasibility is a statement about the
random set 𝒞M
containing (or nearly containing) a prescribed deterministic point. This
reformulation makes it transparent that the problem is convex-geometric
(it depends on the shape of 𝒞M) and that the
distribution of xi enters only
through the random linear map 𝒜.
A convenient way to control dist (1, 𝒞M)
is to use the standard dual representation of distance to a closed
convex set:
dist (z, C) = sup∥y∥2 ≤ 1 {⟨y, z⟩ − hC(y)}, hC(y) := supc ∈ C⟨y, c⟩.
Specializing to C = 𝒞M = 𝒜(𝒦M),
we compute the support function through the adjoint:
h𝒞M(y) = supS ∈ 𝒦M⟨y, 𝒜(S)⟩ = supS ∈ 𝒦M⟨𝒜*(y), S⟩F.
Since S is constrained to
satisfy 0 ≼ S ≼ MId,
the maximizer aligns S with
the positive eigenspace of 𝒜*(y) and saturates the
upper spectral bound; explicitly,
h𝒞M(y) = M Tr ((𝒜*(y))+),
where (H)+ denotes
the positive part in the spectral decomposition. Therefore,
dist (1, 𝒞M) = sup∥y∥2 ≤ 1{⟨y, 1⟩ − M Tr ((𝒜*(y))+)}.
This identity already indicates why α = 1/4 is a spectral/geometric
threshold: the competition is between a deterministic linear functional
⟨y, 1⟩ in
ℝn and a random
spectral functional of the d × d matrix 𝒜*(y) = ∑iyiXi,
and the scaling n = αd2
places the supremum over y on
a comparable ``degrees-of-freedom’’ scale to the matrix dimension.
While the dual distance formula is useful for deterministic
manipulations, MB23’s analysis (in the Gaussian case) proceeds by a
closely related ``volume of a thickened slice’’ viewpoint which is
robust under universality replacements. Concretely, for δ > 0 we consider the thickened
feasible set
$$
\mathcal{F}_{\delta,M}:=\Big\{S\in\mathcal{K}_M:\
\|\mathcal{A}(S)-\mathbf{1}\|_2\le \delta\sqrt{\frac{n}{d}}\Big\}.
$$
Nonemptiness of ℱδ, M is
equivalent to EFPδ, M, but
volume provides additional quantitative structure: one studies the
normalized log-volume
$$
\Phi_{\delta,M}:=\frac{1}{d^2}\log
\operatorname{Vol}\big(\mathcal{F}_{\delta,M}\big),
$$
or a smoothed surrogate of it (e.g. a log-partition function), and shows
that Φδ, M
concentrates around a deterministic limit ϕδ, M(α)
as d → ∞. In such a framework,
the feasibility transition corresponds to the point at which ϕδ, M(α)
crosses from −∞ (exponentially small
volume, hence emptiness w.h.p.) to a finite value (exponentially large
volume, hence nonemptiness w.h.p.). Importantly, the normalization by
d2 matches the
ambient dimension scale of 𝕊d, so the concentration
errors must be o(d2) to
identify a sharp constant threshold in α.
Let {gi}i ≤ n
be i.i.d. Gaussian with gi ∼ N(0, Id/d),
and define the corresponding operator 𝒜G and body 𝒞MG = 𝒜G(𝒦M).
In this model, rotational invariance and Gaussian comparison tools yield
a variational characterization of ϕδ, MG(α)
(or directly of dist (1, 𝒞MG))
in terms of an optimization over limiting spectral measures. The key
output, established in MB23 in the bounded-spectrum setting, is the
existence of a universal critical value αc = 1/4 with a
robust gap: for every fixed ε > 0 there exist constants δ±(ε) > 0
such that
$$
\alpha\le \frac14-\varepsilon
\ \Longrightarrow\
\mathbb{P}\big[\mathrm{EFP}_{\delta_-,M_*}\big]\to 1
\quad\text{for some finite }M_*,
$$
whereas
$$
\alpha\ge \frac14+\varepsilon
\ \Longrightarrow\
\mathbb{P}\big[\mathrm{EFP}_{\delta_+,M}\big]\to 0
\quad\text{for every fixed }M.
$$
At the level of the dual distance formula, the same statement can be
phrased as follows: below 1/4 the
random functional
y ↦ ⟨y, 1⟩ − M Tr ((𝒜G*(y))+)
is negative (uniformly over ∥y∥2 ≤ 1) by an amount
that dominates $\delta\sqrt{n/d}$,
while above 1/4 it remains positive for
some y (again by a
quantitative margin). The robustness in δ is essential, because it allows us
to move between exact'' andapproximate’’ feasibility
without re-solving the limiting variational problem each time.
The remaining task for Theorem~A is to show that the same limiting functional governs the subgaussian model, with constants depending only on (ε, K). We implement this via a two-step universality argument.
First, we pass from the hard feasibility indicator 1{ℱδ, M ≠ ∅}
to a smoothed convex functional of the random operator 𝒜. A typical choice is a log-partition
function
$$
Z_{\beta}:=\int_{\mathcal{K}_M}\exp\!\Big(-\beta\big\|\mathcal{A}(S)-\mathbf{1}\big\|_2^2\Big)\,dS,
\qquad
\Psi_{\beta}:=\frac{1}{d^2}\log Z_{\beta},
$$
with β = β(d) growing
slowly. If ℱδ, M is empty
then Zβ is
exponentially small, whereas if ℱδ, M has
nonnegligible volume then Zβ is
exponentially large; in both cases the scale separation is exponential
in d2. Thus,
high-probability bounds for Ψβ translate
into high-probability feasibility/nonfeasibility statements with
explicit δ margins.
Second, we compare 𝔼Ψβ (and then
Ψβ itself,
by concentration) between the subgaussian design and the Gaussian
surrogate. This is the point at which rotational invariance is replaced
by an invariance principle: we apply a Lindeberg replacement scheme (or,
equivalently, a Chatterjee-type invariance argument) to swap the
coordinates of $\sqrt{d}\,x_i$ with
those of $\sqrt{d}\,g_i$ one at a time.
The required smoothness is supplied by the exponential smoothing in
Zβ, and
the required derivative bounds are obtained from the spectral truncation
0 ≼ S ≼ MId,
which prevents 𝒜(S) from being
excessively sensitive to a single coordinate of a single xi.
Subgaussianity with parameter K controls the error terms in the
replacement through uniform moment bounds, yielding
|𝔼Ψβ − 𝔼ΨβG| ≤ od → ∞(1) uniformly
over fixed
α ∈ (0, 1/2), M, β ≤ dc,
for a sufficiently small absolute c > 0 (depending on K through moment constants). The
same replacement principle also applies to the dual-distance functional
after a standard smoothing of the positive-part trace Tr ((⋅)+).
Having matched expectations to the Gaussian surrogate, we upgrade to high probability by concentration for subgaussian chaos. The functionals Ψβ and the smoothed dual distance are Lipschitz (in an appropriate product metric) with Lipschitz constants polynomial in (M, β), and hence satisfy subgaussian concentration around their means. Because the relevant scale is d2, these concentration inequalities deliver deviations of order o(1) for the normalized quantities. Combining (i) the Gaussian threshold with a quantitative gap at α = 1/4 ± ε, (ii) the universality comparison error o(1), and (iii) concentration, we obtain the robust constants δ−(ε, K), δ+(ε, K), and a finite M*(α, ε, K) as stated in Theorem~A. In particular, below 1/4 we obtain nonemptiness of ℱδ−, M* with probability 1 − o(1), while above 1/4 the same argument implies that even the thickened slice remains empty uniformly over bounded M.
We emphasize that Theorem~A is only an existence statement for a bounded approximate fit; it does not by itself produce an exact feasible PSD solution. Its role is to place us in a regime where (a) there exists some S with controlled spectrum and small residual in the natural ℓ2 scale, and (b) the constants are uniform over all isotropic subgaussian designs with parameter K. The next step is therefore deterministic: given such an S, we explicitly remove the residual by inverting the constraint system on its range. This is exactly the purpose of the polishing map, which we treat next.
Fix an input matrix S ∈ 𝕊d and write
its residual as
r = r(S) := 𝒜(S) − 1 ∈ ℝn.
We seek a correction Δ ∈ 𝕊d such that
𝒜(S − Δ) = 1,
i.e. 𝒜(Δ) = r. A
natural restriction is to search for Δ in range (𝒜*), since corrections of
the form 𝒜*(v) are
explicitly representable as weighted rank-one sums and are precisely
those produced by the normal equations. Recall the Gram operator
𝒢 := 𝒜𝒜* ∈ ℝn × n, (𝒢)ij = ⟨Xi, Xj⟩F = (xi⊤xj)2.
Whenever 𝒢 is invertible (equivalently,
𝒜 has full row rank), the constraint
𝒜(Δ) = r is solved by
choosing v = 𝒢−1r and
setting
$$
\Delta:=\mathcal{A}^*(v)=\sum_{i=1}^n v_i X_i,
\qquad
\widetilde{S}:=S-\Delta.
$$
Indeed,
𝒜(Δ) = 𝒜𝒜*(v) = 𝒢v = r,
and therefore 𝒜(S̃) = 𝒜(S) − 𝒜(Δ) = 1.
This is the basic polishing step. In the random regime n = αd2
with α < 1/2, the domain
dimension dim (𝕊d) = d(d + 1)/2
dominates n, and for the
ensembles under consideration one expects (and later proves with high
probability) that 𝒜 has full row rank,
so that 𝒢−1 exists and the
above formulas are unambiguous.
To separate the deterministic algebra from the probabilistic
full-rank claim, we define polishing using the Moore–Penrose inverse.
Let 𝒢† denote the
pseudoinverse and set
v := 𝒢†r, Δ := 𝒜*(v), Polish(S) := S − Δ,
with the understanding that 𝒢† = 𝒢−1 when 𝒢 is invertible. Since 𝒢𝒢† is the orthogonal projector
onto range (𝒢) = range (𝒜), we always
have the identity
𝒜(Δ) = 𝒢𝒢†r = Projrange (𝒜)(r).
Thus 𝒜(Polish(S)) = 1
holds whenever r ∈ range (𝒜),
which is automatic if 𝒜 is surjective
onto ℝn (full row
rank). More generally, if there exists any exact feasible matrix S⋆ (not necessarily PSD)
with 𝒜(S⋆) = 1,
then
r = 𝒜(S) − 𝒜(S⋆) = 𝒜(S − S⋆) ∈ range (𝒜),
and again the pseudoinverse correction enforces exact constraint
satisfaction. In either case, the definition reduces the problem of
producing an exact feasible point to controlling the side effects of
subtracting Δ.
The choice v = 𝒢†r is canonical: among all v solving 𝒢v = r (when solutions exist), it minimizes ∥v∥2, and consequently Δ = 𝒜*(v) is the minimum-Frobenius-norm correction among all corrections constrained to lie in range (𝒜*). Concretely, if we minimize ∥Δ∥F subject to Δ = 𝒜*(v) and 𝒜(Δ) = r, then ∥Δ∥F2 = ⟨v, 𝒢v⟩ and the normal equations yield v = 𝒢†r. This optimality is useful because it shows that any bound we prove for Δ is, in a precise sense, the best possible bound for corrections that do not exploit the (typically large) nullspace of 𝒜.
Assume now that the input S
lies in the bounded-spectrum cone 𝒦M, i.e. 0 ≼ S ≼ MId.
After polishing, exact feasibility is automatic as above, but positivity
is not. Deterministically, we control the loss of PSD via the operator
norm of Δ. By Weyl’s
inequality,
λmin(S − Δ) ≥ λmin(S)−∥Δ∥op.
Hence S − Δ ≽ 0
provided ∥Δ∥op ≤ λmin(S).
Since λmin(S) may be
small or even zero, we introduce a damped variant: for
$$
t:=\min\Big\{1,\
\frac{\lambda_{\min}(S)}{2\|\Delta\|_{\mathrm{op}}}\Big\},
\qquad
\widetilde{S}:=S-t\Delta,
$$
we always have S̃ ≽ 0 because
λmin(S̃) ≥ λmin(S) − t∥Δ∥op ≥ λmin(S)/2.
On the high-probability event where ∥Δ∥op is sufficiently
small relative to λmin(S) for all
inputs of interest, the damping is inactive (so t = 1) and we simultaneously obtain
exact feasibility and PSD. Accordingly, the probabilistic work in
Theorem~B is organized to guarantee that, below threshold and for
residuals at the ℓ2
scale $\|r\|_2\lesssim \sqrt{n/d}$, the
correction Δ is small in
operator norm uniformly over all bounded-spectrum inputs.
We now isolate the quantitative inequality we will later prove with
high probability. By definition,
v = 𝒢†r, Δ = 𝒜*(v).
The first step is to control ∥v∥2 by conditioning of
𝒢 on its range:
$$
\|v\|_2 \;=\;\|\mathcal{G}^\dagger r\|_2
\;\le\;
\frac{\|r\|_2}{\sigma_{\min}(\mathcal{G}\vert_{\operatorname{range}})}.
$$
Thus, any lower bound of the form σmin(𝒢|range) ≳ d
immediately converts the natural residual scale $\|r\|_2\lesssim \delta\sqrt{n/d}\asymp
\delta\sqrt{d}$ into
$$
\|v\|_2 \lesssim \frac{\delta}{\sqrt{d}}.
$$
The second step is to control the operator norm of 𝒜*(v) = ∑iviXi
in terms of ∥v∥2
(and, if needed, ∥v∥∞):
∥Δ∥op = ∥𝒜*(v)∥op ≤ ∥𝒜*∥2 → op ∥v∥2,
where ∥𝒜*∥2 → op := sup∥v∥2 = 1∥𝒜*(v)∥op.
For rank-one subgaussian designs, the relevant bound takes the
form
$$
\big\|\mathcal{A}^*(v)\big\|_{\mathrm{op}}
\le C(K)\Big(\frac{\|v\|_2}{\sqrt{d}}+\|v\|_\infty\Big),
$$
and in the regime where v is
not extremely sparse (which is the typical output of 𝒢†r when 𝒢 is well-conditioned), the ∥v∥∞ term is negligible
compared to $\|v\|_2/\sqrt{d}$.
Combining these inequalities yields the deterministic reduction
$$
\|\Delta\|_{\mathrm{op}}
\;\lesssim\;
\frac{\|r\|_2}{\sqrt{d}\
\sigma_{\min}(\mathcal{G}\vert_{\operatorname{range}})}
\;+\;
\| \mathcal{G}^\dagger r\|_\infty,
$$
so that polishing preserves PSD as soon as the Gram operator is
well-conditioned and the adjoint sum is not dominated by a single
coefficient.
The preceding bounds explain why the ℓ2 residual scale in
EFPδ, M is
natural for polishing. If $\|r\|_2\le
\delta\sqrt{n/d}\asymp \delta\sqrt{d}$ and σmin(𝒢|range) ≳ d,
then $\|v\|_2\lesssim \delta/\sqrt{d}$.
Feeding this into $\|\mathcal{A}^*(v)\|_{\mathrm{op}}\lesssim
\|v\|_2/\sqrt{d}$ gives the heuristic magnitude
$$
\|\Delta\|_{\mathrm{op}}\;\lesssim\;\frac{\delta}{d},
$$
which vanishes as d → ∞. In
particular, below the 1/4 threshold the
correction is asymptotically tiny in operator norm, so once we know that
bounded-spectrum approximate feasibility provides at least some slack
away from the PSD boundary (or once we verify uniformly that t = 1 in the damping rule), the
one-step update S ↦ S − Δ produces
an exact feasible PSD point. This is the sense in which polishing is a
deterministic reduction: all randomness is now confined to verifying (i)
spectral bounds for 𝒢 and (ii) uniform
operator-norm bounds for 𝒜*(v), both in a form
that holds simultaneously for all admissible v arising from residuals of
admissible S.
Although the definition of Polish(S) is expressed through 𝒢†, it is implementable without
forming 𝒢 as an n × n matrix. To apply
𝒢 to a vector u ∈ ℝn, we use
the factorization 𝒢u = 𝒜(𝒜*(u)):
$$
\mathcal{A}^*(u)=\sum_{i=1}^n u_i X_i,\qquad
(\mathcal{A}(\mathcal{A}^*(u)))_j
= \sum_{i=1}^n u_i\,\langle X_i,X_j\rangle_F
= \sum_{i=1}^n u_i\,(x_i^\top x_j)^2.
$$
Thus, solving 𝒢v = r
(or computing 𝒢†r)
can be done by conjugate gradient or MINRES using only matrix–vector
products with 𝒢, and the final
correction Δ = 𝒜*(v) is a
single weighted rank-one sum. The deterministic analysis above therefore
directly translates into a stable numerical recipe, provided the
probabilistic estimates guarantee that the linear system is
well-conditioned in the relevant regime. The next step is to establish
precisely these random operator estimates, uniformly in d and with constants depending only
on (ε, K), which is
where the threshold α = 1/4
re-enters through the conditioning of 𝒢.
We now state the probabilistic inputs that make the deterministic reduction effective. Throughout we work on a single high-probability event (depending only on (ε, K)) on which two properties hold simultaneously: (i) the Gram operator 𝒢 = 𝒜𝒜* is uniformly well-conditioned on its range, and (ii) the adjoint map 𝒜* sends vectors in ℝn to symmetric matrices with controlled operator norm. On this event, for every bounded-spectrum input S with sufficiently small residual $\|r(S)\|_2\lesssim \sqrt{n/d}$, the pseudoinverse vector v = 𝒢†r(S) is small in ℓ2, and therefore the correction Δ = 𝒜*(v) is small in operator norm. The threshold α = 1/4 enters precisely through the lower edge of the spectrum of 𝒢: below 1/4 we obtain a uniform spectral gap away from 0, whereas above 1/4 the gap closes at the relevant d-dependent scale, preventing stable inversion.
To control σmin(𝒢|range)
and σmax(𝒢), it is
convenient to view 𝒜 as a random matrix
with rows given by the ``lifted’’ observations Xi = xixi⊤
acting by Frobenius inner product. Concretely, identifying 𝕊d with ℝd(d + 1)/2
through an orthonormal basis, 𝒜 becomes
an n × p matrix (with
p = d(d + 1)/2)
whose ith row is vec(Xi)⊤.
Then 𝒢 = 𝒜𝒜* is the Gram
matrix of these lifted rows, and its nonzero singular values coincide
with the squared singular values of 𝒜.
The issue is that Xi is not
isotropic as a vector in ℝp: its distribution has
a distinguished trace direction proportional to Id, and a
trace-zero component that is approximately isotropic. Accordingly, we
decompose
$$
X_i \;=\; \frac{\operatorname{Tr}(X_i)}{d}\,I_d \;+\;
\Big(X_i-\frac{\operatorname{Tr}(X_i)}{d}\,I_d\Big)
\;=\; X_i^{\parallel} + X_i^{\perp},
$$
where Xi⟂ ∈ {S ∈ 𝕊d : Tr (S) = 0}.
The bulk conditioning is governed by the trace-zero part, while the
trace part contributes a low-dimensional component whose effect can be
treated separately (and absorbed into the constants in Lemma~1). In the
Gaussian case, this decomposition is exact enough that 𝒢 reduces, after a deterministic rescaling
and a rank-one adjustment, to a Wishart-type matrix; the lower spectral
edge then behaves like a constant multiple of $(1-2\sqrt{\alpha})^2$ and is strictly
positive if and only if α < 1/4. For isotropic
subgaussian designs, we prove a corresponding statement by a
universality argument for extreme singular values of such lifted sample
covariance operators.
For polishing we do not need a sharp limiting law; it suffices to
prove a uniform spectral sandwich with constants depending only on (ε, K):
c1(ε, K) d ≤ σmin(𝒢|range) ≤ σmax(𝒢) ≤ c2(ε, K) d,
as recorded in Lemma~1. One way to organize the proof is through the
quadratic form representation
$$
u^\top \mathcal{G}\,u \;=\; \big\|\mathcal{A}^*(u)\big\|_F^2
\;=\; \Big\|\sum_{i=1}^n u_i X_i\Big\|_F^2,
\qquad u\in\mathbb{R}^n,
$$
together with the observation that range (𝒢) = range (𝒜) is precisely the
subspace on which u ↦ 𝒜*(u) is
injective. Thus, lower-bounding σmin(𝒢|range)
is equivalent to proving a uniform inequality of the form
$$
\Big\|\sum_{i=1}^n u_i X_i\Big\|_F \;\ge\; c\sqrt{d}\,\|u\|_2
\qquad \text{for all }u\in \operatorname{range}(\mathcal{G}),
$$
and similarly for the upper bound. Because n ≍ d2, naive
ε-net arguments over Sn − 1 are
impossible; instead we appeal to random-matrix results for sample
covariance operators with fixed aspect ratio, combined with a reduction
that isolates the trace direction and applies an extreme-singular-value
theorem on the remaining (p − 1)-dimensional trace-zero
subspace. The point at which α = 1/4 enters is that the effective
aspect ratio in the trace-zero component is 4α (rather than 2α), reflecting the quadratic
(rank-one) lifting; below 1/4 this
yields a strictly positive lower edge, while above 1/4 it degenerates.
The second input is a bound of Lemma~2 type:
$$
\big\|\mathcal{A}^*(v)\big\|_{\mathrm{op}}
\;=\;\Big\|\sum_{i=1}^n v_i x_i x_i^\top\Big\|_{\mathrm{op}}
\;\le\; C(K)\Big(\frac{\|v\|_2}{\sqrt{d}}+\|v\|_\infty\Big),
\qquad v\in\mathbb{R}^n,
$$
on a high-probability event. The ∥v∥∞ term accounts for
the possibility that a single unusually large coefficient multiplies a
rank-one matrix of size comparable to ∥xi∥22;
it is unavoidable in complete generality. The key point is the $\|v\|_2/\sqrt{d}$ scaling, which captures
the averaging when the weights are spread across many indices and the
design is isotropic. A convenient proof strategy is to bound the
quadratic form of 𝒜*(v) uniformly over unit
vectors u ∈ Sd − 1:
$$
u^\top \mathcal{A}^*(v)\,u \;=\;\sum_{i=1}^n v_i\, (u^\top x_i)^2,
$$
so that ∥𝒜*(v)∥op = supu ∈ Sd − 1|∑ivi(u⊤xi)2|.
For fixed u, the summands are
independent, mean-∥u∥22/d
(up to scaling) subexponential random variables with ψ1 norms controlled by
K2/d;
Bernstein-type bounds yield concentration at the correct scale. The
remaining difficulty is to make this uniform over all u, which requires chaining or a
carefully constructed net.
We implement the uniform control by combining three standard steps.
First, we truncate the event maxi∥xi∥22 ≤ C(K),
which holds with probability 1 − exp (−cd) by
subgaussian concentration of ∥xi∥2.
This removes rare large-norm points and allows us to treat each xixi⊤
as uniformly bounded in operator norm. Second, we symmetrize:
$$
\sup_{u\in S^{d-1}}\Big|\sum_{i=1}^n v_i \big((u^\top
x_i)^2-\mathbb{E}(u^\top x_i)^2\big)\Big|
\;\le\; 2\,\mathbb{E}_\varepsilon\sup_{u\in S^{d-1}}\Big|\sum_{i=1}^n
\varepsilon_i v_i (u^\top x_i)^2\Big|,
$$
where (εi)
are i.i.d. Rademacher signs. Third, conditional on (xi), we control
the Rademacher process indexed by u using generic chaining (or, at a
weaker level, an ε-net on
Sd − 1
together with a Lipschitz extension argument). The chaining route is
preferable because it avoids extra log d factors that become relevant
near α ≈ 1/4 where we must
preserve a quantitative gap depending on ε. The outcome is that, with high
probability, the map v ↦ 𝒜*(v)
behaves like a subgaussian operator from (ℝn, ∥ ⋅ ∥2)
into (𝕊d, ∥ ⋅ ∥op)
up to the unavoidable ∥ ⋅ ∥∞
correction.
The conditioning estimate for 𝒢 is
used only through v = 𝒢†r, so in
principle we need control of 𝒢 only on
vectors of the form r(S) for admissible S. We do not exploit this reduction
and instead prove a stronger statement: 𝒢 is well-conditioned on all of range (𝒢), uniformly over d once α ≤ 1/4 − ε is fixed. This
uniformity is essential for the ``simultaneously for all inputs’’
formulation in Theorem~B. Combining Lemma~1 and Lemma~2 gives, on the
high-probability event,
$$
\|\Delta\|_{\mathrm{op}}
\;=\;\big\|\mathcal{A}^*(\mathcal{G}^\dagger r)\big\|_{\mathrm{op}}
\;\le\; C(K)\Big(\frac{\|\mathcal{G}^\dagger
r\|_2}{\sqrt{d}}+\|\mathcal{G}^\dagger r\|_\infty\Big)
\;\le\;
C(K)\Big(\frac{\|r\|_2}{\sqrt{d}\,\sigma_{\min}(\mathcal{G}\vert_{\operatorname{range}})}+\|\mathcal{G}^\dagger
r\|_\infty\Big),
$$
which is the quantitative form needed to feed into the deterministic
PSD-preservation rule through Weyl-type perturbation bounds. The only
term not directly controlled by Lemma~1 is ∥𝒢†r∥∞, whose
treatment depends on how much delocalization one can prove for 𝒢 and on how sharply one needs constants; in
our setting it suffices to show that for residuals at the ℓ2 scale of EFPδ, M, the
solution v to 𝒢v = r is typically not
supported on o(n)
indices.
Two places require care if one wants explicit dependence on ε as α ↑ 1/4. First, the lower edge of the relevant spectrum of 𝒢 deteriorates polynomially in $(1-2\sqrt{\alpha})$, so any argument that loses unnecessary logarithmic factors (for instance by applying a crude union bound over an exponential net in a dimension comparable to d2) will not suffice. This is why we rely on extreme-singular-value results for subgaussian sample covariance operators with fixed aspect ratio, which provide the correct edge scaling without extraneous dimension-dependent constants. Second, in bounding supu ∈ Sd − 1∑ivi(u⊤xi)2, the natural metric for chaining is induced by the empirical second moment ∑ivi2(u⊤xi)2, which itself fluctuates with (xi); controlling these fluctuations uniformly in u requires a two-level argument (conditioning on (xi), then applying concentration for the induced metric). We flag this as the main technical pressure point: it is straightforward away from the threshold, but near α = 1/4 it is advantageous to use a fully fledged empirical-process tool (Dudley integral with a tail bound on covering numbers in the random metric) rather than a fixed deterministic net.
On the intersection of the high-probability events provided by Lemma~1 and Lemma~2 (and the mild additional delocalization estimate implicit in controlling ∥𝒢†r∥∞), we obtain that for every bounded-spectrum S with $\|r(S)\|_2\le \delta\sqrt{n/d}$, the polishing correction satisfies ∥Δ∥op ≤ Cε, Kδ (with the stronger vanishing behavior in d when the scaling constants permit it). This is precisely the regime in which the deterministic damping rule preserves positivity while enforcing exact feasibility, and it is also the regime in which the operator norm of the output remains within a controlled enlargement of the original bound M. The remainder of the polishing proof consists of combining these random operator estimates with eigenvalue perturbation bounds to obtain a uniform, stable map from approximate bounded feasibility to exact PSD feasibility.
We now record the deterministic perturbation estimates that convert a
quantitative bound on the correction direction into preservation of
positivity and a controlled enlargement of the operator norm. Let S ∈ 𝕊d be an
input matrix with S ≽ 0, and
let r = r(S) = 𝒜(S) − 1 ∈ ℝn
be its residual. Given any vector v ∈ ℝn and the
associated matrix
$$
\Delta=\mathcal{A}^*(v)=\sum_{i=1}^n v_i x_i x_i^\top,
$$
we consider updates of the form S̃ = S − tΔ
for a step size t ∈ [0, 1].
Two constraints are relevant: exact feasibility 𝒜(S̃) = 1, and
positivity S̃ ≽ 0.
If we choose v = 𝒢†r,
then
𝒜(Δ) = 𝒜𝒜*(𝒢†r) = 𝒢𝒢†r = Prange (𝒢) r,
where Prange (𝒢)
denotes orthogonal projection. On the event that 𝒢 has full rank (equivalently, range (𝒢) = ℝn), we
obtain 𝒜(Δ) = r and
hence
𝒜(S − Δ) = 𝒜(S) − 𝒜(Δ) = 1.
Thus, in the well-conditioned regime where 𝒢 is invertible (or at least where r lies in range (𝒢)), the undamped step t = 1 enforces exact feasibility in
a single correction. Damping is introduced solely as a safeguard for
positivity, and in the regime relevant for Theorem~B we will be able to
argue that the safeguard is inactive, i.e. t = 1 holds automatically because
Δ is small.
To guarantee S̃ ≽ 0, it
suffices to control the smallest eigenvalue under the perturbation −tΔ. By Weyl’s
inequality,
λmin(S̃) = λmin(S − tΔ) ≥ λmin(S) − t ∥Δ∥op.
Consequently, if we choose
$$
t \;\le\; \frac{\lambda_{\min}(S)}{\|\Delta\|_{\mathrm{op}}},
$$
then λmin(S̃) ≥ 0 and
S̃ ≽ 0. The explicit damping
rule used in Theorem~B,
$$
t:=\min\Big\{1,\frac{\lambda_{\min}(S)}{2\|\Delta\|_{\mathrm{op}}}\Big\},
$$
has the convenient consequence that either t = 1 (so feasibility is exact) or
else t < 1 but then λmin(S̃) ≥ λmin(S)/2,
which gives a quantitative buffer away from the boundary of the PSD
cone. This buffer is useful if one implements polishing iteratively
(recomputing v from the new
residual), since it prevents accumulation of numerical errors from
pushing the iterate into indefiniteness.
We also require that the output not acquire an uncontrolled large
eigenvalue. Again by Weyl’s inequality,
λmax(S̃) = λmax(S − tΔ) ≤ λmax(S) + t ∥Δ∥op ≤ ∥S∥op+∥Δ∥op,
since t ≤ 1. In particular, if
the input is bounded as 0 ≼ S ≼ MId,
then we obtain the deterministic bound
∥S̃∥op ≤ M+∥Δ∥op.
Thus, all stability properties of the map S ↦ S̃ reduce to bounding
∥Δ∥op in terms of
the residual size ∥r(S)∥2 (and the
random operator parameters controlling 𝒢† and 𝒜*).
We now make explicit the dependence of ∥Δ∥op on ∥r∥2 in the regime where
Lemma~1 and Lemma~2 hold. Write v = 𝒢†r, so that
on range (𝒢) we have
$$
\|v\|_2 \;\le\;
\frac{\|r\|_2}{\sigma_{\min}(\mathcal{G}\vert_{\operatorname{range}})}.
$$
If σmin(𝒢|range) ≳ d,
then ∥v∥2 ≲ ∥r∥2/d.
Feeding this into Lemma~2 yields
$$
\|\Delta\|_{\mathrm{op}}
=\|\mathcal{A}^*(v)\|_{\mathrm{op}}
\;\le\; C(K)\Big(\frac{\|v\|_2}{\sqrt{d}}+\|v\|_\infty\Big)
\;\lesssim\; C(K)\Big(\frac{\|r\|_2}{d^{3/2}}+\|v\|_\infty\Big).
$$
Under the scaling of EFPδ, M, namely
$\|r\|_2\le \delta\sqrt{n/d}$ and n = αd2,
we have $\|r\|_2\lesssim \delta
\sqrt{\alpha}\, d^{1/2}$, hence the contribution of the ℓ2 term alone is of order
δ/d:
$$
\frac{\|r\|_2}{d^{3/2}} \;\lesssim\; \frac{\delta}{d}.
$$
This is the quantitative reason that, below the 1/4 threshold, polishing is stable: the
required correction is not merely small, but vanishing in d at the natural residual scale.
Assume in addition that the input matrix has a spectral margin λmin(S) ≥ η
for some η > 0 (which may
itself be a function of (ε, K) but not of d). If ∥Δ∥op ≤ η/2,
then the damping rule gives t = 1 and the one-step update S̃ = S − Δ is
simultaneously (i) exactly feasible and (ii) PSD. In the
bounded-spectrum approximate feasibility setting, S may lie on the boundary of the PSD
cone, so we cannot assume a deterministic η. Nevertheless, the preceding
residual-to-correction transfer shows that for δ sufficiently small (depending only
on (ε, K)) and d sufficiently large, ∥Δ∥op is uniformly small
for all admissible inputs S.
In this regime, even if λmin(S) = 0, the
inequality
λmin(S − Δ) ≥ −∥Δ∥op
shows that we are at worst slightly indefinite before damping; the
deterministic damping then ensures S̃ ≽ 0 while keeping t close to 1. In implementations, one can equivalently
view this as a safeguard against rare inputs whose residual is at the
admissible threshold but whose spectrum is unusually ill-posed.
The same inequalities quantify sensitivity: if S and S′ are two
bounded-spectrum inputs with residuals r and r′ and corresponding
corrections Δ and Δ′, then on the event
that 𝒢 is well-conditioned and 𝒜* satisfies Lemma~2
uniformly,
$$
\|\Delta-\Delta'\|_{\mathrm{op}}
=\big\|\mathcal{A}^*\big(\mathcal{G}^\dagger(r-r')\big)\big\|_{\mathrm{op}}
\;\le\;
C(K)\Big(\frac{\|\mathcal{G}^\dagger(r-r')\|_2}{\sqrt{d}}+\|\mathcal{G}^\dagger(r-r')\|_\infty\Big).
$$
In particular, ignoring the ℓ∞ term (or controlling
it by delocalization), we obtain a Lipschitz-type bound
$$
\|\Delta-\Delta'\|_{\mathrm{op}} \;\lesssim\;
\frac{1}{d^{3/2}}\,\|r-r'\|_2.
$$
This shows that the polishing map is not only stable but strongly
contractive in the regime where the inverse Gram scaling is 1/d and the adjoint scaling is $1/\sqrt{d}$. Such a contractivity is what
allows us to apply the map ``simultaneously for all inputs’’ once we
restrict to bounded-spectrum matrices with residuals at the $\sqrt{n/d}$ scale.
The preceding discussion also isolates the only two mechanisms by which polishing can fail.
First, if the residual is too large (e.g. ∥r(S)∥2 is not $O(\sqrt{n/d})$ but rather comparable to $\sqrt{n}$), then even a well-conditioned 𝒢 yields ∥v∥2 of order ∥r∥2/d, which may be large enough that ∥Δ∥op is no longer negligible relative to λmin(S) or to the bound M. In this regime, damping may force t ≪ 1, and the one-step update no longer produces exact feasibility. From the geometric viewpoint, this is expected: a large residual means we are far from the affine slice {𝒜(S) = 1} in the measurement space, so the minimum-norm correction in range (𝒜*) need not remain within a small perturbation of the PSD cone.
Second, even with small residual, polishing becomes unstable if 𝒢 is poorly conditioned on its range. Indeed, if σmin(𝒢|range) is of smaller order than d, then ∥v∥2 = ∥𝒢†r∥2 can blow up, and hence so can ∥Δ∥op. This is precisely the obstruction at and above the threshold: as α ↑ 1/4 the relevant lower spectral edge deteriorates, and for α > 1/4 it collapses at the scale needed to control ∥Δ∥op uniformly over all admissible inputs. In that regime, the correction direction can be large even when ∥r∥2 is at the nominal $\sqrt{n/d}$ scale, so Weyl’s inequality no longer guarantees PSD preservation without taking t so small that exact feasibility is lost.
A third, more technical instability is encoded in the ∥v∥∞ term: if v is highly localized (so that a small number of indices carry most of its mass), then 𝒜*(v) may be dominated by a few rank-one terms and its operator norm can be large compared to the $\|v\|_2/\sqrt{d}$ prediction. This is why delocalization (or a priori control of ∥v∥∞ for the relevant residual subspace) is helpful: it rules out a situation where 𝒢† produces a vector v with a few very large coordinates, which would negate the averaging effect exploited by Lemma~2.
Summarizing, once we have placed ourselves on the high-probability
event where (i) 𝒢 has a uniform
spectral gap on its range and (ii) 𝒜* satisfies the uniform
operator-norm bound, the polishing map has the following deterministic
behavior: small residual implies small correction in operator norm;
small operator-norm correction implies PSD preservation by Weyl’s
inequality; and the operator norm of the output increases by at most
∥Δ∥op. The
remaining probabilistic work is therefore exhausted by ensuring that,
below α = 1/4, the chain
∥r∥2 → ∥𝒢†r∥2 → ∥𝒜*(𝒢†r)∥op
is uniformly contracting at the EFPδ, M scale,
so that the undamped step t = 1 is valid for all admissible
inputs and yields an exact feasible PSD matrix with controlled
spectrum.
We consider the convex program
where ∥ ⋅ ∥* denotes the
nuclear norm (sum of singular values). Since S is symmetric, $\|S\|_*=\sum_{j=1}^d|\lambda_j(S)|$. In
particular, on the PSD cone we have ∥S∥* = Tr (S).
The motivation for~ is that it is the canonical convex regularizer for
``simplicity’’ among all exact fits; our aim is to show that below the
1/4 threshold, the optimizer is in fact
PSD and quantitatively well-behaved.
Using the duality ∥S∥* = sup∥Z∥op ≤ 1⟨Z, S⟩F
and the definition of 𝒜*,
the Lagrange dual of~ takes the standard form
Weak duality is immediate: for any feasible S of~ and any feasible y of~,
∥S∥* ≥ ⟨𝒜*(y), S⟩F = ⟨y, 𝒜(S)⟩ = ⟨1, y⟩.
Thus any y with ∥𝒜*(y)∥op ≤ 1
certifies a lower bound on the primal optimum value.
Let S⋆ be any
minimizer of~. Under standard constraint qualifications (which we invoke
only at a high level here, and which can be verified on the
below-threshold event by the existence of strictly feasible
perturbations in ker (𝒜)), strong
duality holds and there exists a dual optimizer y⋆ attaining~ such
that
The key structural point is the following: if we can arrange (or prove)
that there is an optimal dual y⋆ for which 𝒜*(y⋆) ≽ 0,
then the subgradient relation in~ forces S⋆ ≽ 0. Indeed, the
subdifferential ∂∥S∥* at a symmetric
matrix S contains matrices
that agree with the on the range of S; if S has a negative spectral subspace,
any element of ∂∥S∥* must have a
corresponding negative direction. A PSD matrix 𝒜*(y⋆) ≽ 0
cannot serve as a nuclear-norm subgradient unless the negative spectral
projector of S⋆ is
zero. This is the conceptual route by which a dual certificate with
nonnegative spectrum enforces that the primal optimum lies in the PSD
cone.
We now describe how the below-threshold regime enables such a certificate. Fix α ≤ 1/4 − ε. By Theorem~B (polishing), on a probability-1 − o(1) event there exists at least one exact feasible matrix S0 ≽ 0 with bounded spectrum, and moreover we can take S0 with an a priori operator norm bound ∥S0∥op ≤ M*(α, ε, K). We use this bounded feasible point as the anchor for dual certification.
Consider the auxiliary SDP restricted to the PSD cone,
Any feasible point of~ is feasible for~ and has objective value equal to
its nuclear norm, hence
OPTNN ≤ OPTPSD-Tr ≤ Tr (S0).
The dual of~ is
Observe that if in addition to 𝒜*(y) ≼ Id
we also have 𝒜*(y) ≽ 0, then y is automatically feasible for the
nuclear-norm dual~, because ∥𝒜*(y)∥op ≤ 1
follows from 0 ≼ 𝒜*(y) ≼ Id.
Below the threshold, we exploit the slack in the measurement geometry
to select a dual optimizer for~ whose adjoint image is PSD. Concretely,
we first solve~ (conceptually; in the proof one may equivalently argue
existence by compactness on a bounded event), obtaining some dual
maximizer ytr with
𝒜*(ytr) ≼ Id
and ⟨1, ytr⟩ = OPTPSD-Tr.
We then modify ytr
along directions in ker (𝒜*)
(which is nontrivial in the underdetermined regime n < dim 𝕊d)
to eliminate any negative spectral part of 𝒜*(ytr) while
preserving the dual objective. The feasibility margin provided by α ≤ 1/4 − ε and the uniform
operator control for adjoint sums (Lemma~2) ensure that this ``spectral
cleaning’’ can be performed without violating the upper constraint 𝒜*(y) ≼ Id.
This yields a certificate y⋆ satisfying
At this stage, y⋆
is feasible for the nuclear-norm dual~ and attains the PSD-trace dual
optimum value. Therefore,
OPTNN ≥ ⟨1, y⋆⟩ = OPTPSD-Tr ≥ OPTNN,
so equality holds throughout and y⋆ is also dual-optimal
for~. In particular, any nuclear-norm minimizer must satisfy the KKT
conditions with a dual matrix 𝒜*(y⋆) ≽ 0,
which forces PSD optimality by the subgradient mechanism discussed
above. This is the content formalized by Lemma~4.
We conclude that, on the same high-probability event (depending on
ε and K) on which~ can be arranged, every
minimizer Ŝ of~ satisfies
Ŝ ≽ 0 and hence ∥Ŝ∥* = Tr (Ŝ).
Moreover, since Ŝ is optimal
for~, we have
Tr (Ŝ) = OPTPSD-Tr ≤ Tr (S0).
To pass from trace control to operator-norm control, we combine exact
feasibility with a bounded-spectrum comparison. The polishing theorem
provides a feasible S0 ≽ 0 with ∥S0∥op ≤ M*(α, ε, K);
the dual certificate construction above ensures that Ŝ cannot concentrate mass into a few
eigen-directions without increasing Tr (Ŝ) relative to S0 (equivalently: the
feasible slice intersects the PSD cone in a region whose extremal widths
are uniformly bounded below the threshold). As a result, one obtains the
claimed bound
∥Ŝ∥op ≤ M*(α, ε, K),
possibly after enlarging M* by a factor depending
only on (α, ε, K); this is
the quantitative boundedness statement in Theorem~C.
Finally, while uniqueness is not needed for our main conclusions, it is useful to record the standard sufficient condition. If, in addition, the dual optimizer y⋆ can be chosen so that 𝒜*(y⋆) is contractive off the range of Ŝ (i.e. 0 ≼ 𝒜*(y⋆) ≼ Id with ∥𝒜*(y⋆)∥op = 1 and a strict spectral gap < 1 on the orthogonal complement of range (Ŝ)), then complementary slackness implies that any other feasible PSD matrix must have strictly larger trace, hence Ŝ is the unique minimizer of both~ and~. In the random design setting, such strict complementarity is typical away from the boundary α ↑ 1/4, and can be verified by combining the certificate construction with a nondegeneracy condition for 𝒜 restricted to the tangent space at Ŝ.
Assume throughout that α ≤ 1/4 − ε and we are on
the high-probability event on which the conclusions of Theorems~A and~B
hold (in particular, 𝒢 is
well-conditioned on its range and the polishing bounds are valid
uniformly over bounded-spectrum inputs). Then we obtain an explicit
polynomial-time procedure which, given the measurements (xi)i ≤ n,
outputs an feasible matrix S̃ ≽ 0 satisfying 𝒜(S̃) = 1. The
procedure is modular:
(i) compute any bounded-spectrum approximately feasible S (i.e. a witness for EFPδ, M for
sufficiently small δ and some
M), and
(ii) apply the polishing map Polish(S), implemented via iterative
linear algebra.
We emphasize that the existence part (Step (i)) is probabilistic (Theorem~A), while Step (ii) is an explicit correction with deterministic correctness once the linear solve is performed; randomness enters Step (ii) only through the conditioning and norm-control needed to preserve PSD without excessive damping.
There are several standard convex formulations that can be used to
obtain an S with 0 ≼ S ≼ MId
and $\|\mathcal{A}(S)-\mathbf{1}\|_2\le
\delta\sqrt{n/d}$. One convenient choice is the
feasibility-regularized SDP
Below threshold, Theorem~A guarantees feasibility for suitable
parameters (with M possibly
enlarged to a data-dependent M′ controlled in
probability), and any feasible output of~ is admissible input to
polishing. Alternatively, one may replace the hard residual constraint
by a penalty and run projected first-order methods:
which admits projected gradient or accelerated variants. Projection onto
{0 ≼ S ≼ MId}
is effected by an eigenvalue decomposition and clipping eigenvalues to
[0, M], which is
polynomial-time in d and
numerically stable. The particular choice of algorithm in Step (i) is
inessential for our analysis: we only require bounded-spectrum
approximate witness at residual level $\delta\sqrt{n/d}$, and below threshold such
witnesses exist with high probability.
Given an input S from Step
(i), define the residual r = r(S) = 𝒜(S) − 1 ∈ ℝn.
The polishing map is
with t ∈ (0, 1] chosen to
ensure S̃ ≽ 0 (e.g. t = min {1, λmin(S)/(2∥Δ∥op)}
as in Theorem~B). The defining identity 𝒜(Δ) = 𝒜𝒜*(v) = 𝒢v
implies that if v solves 𝒢v = r exactly and t = 1, then 𝒜(S̃) = 1 exactly.
In practice we either (a) solve 𝒢v = r to sufficient
accuracy and, if needed, apply polishing iteratively, or (b) solve a
slightly regularized system (𝒢 + ρI)v = r
and correct the resulting small bias by a final refinement; both
variants remain polynomial-time and stable under the conditioning bounds
from Lemma~1.
The Gram matrix 𝒢 ∈ ℝn × n is
dense and prohibitive to form explicitly when n = αd2
is large. However, iterative solvers (conjugate gradient for symmetric
PSD systems, MINRES, or LSQR) require only matrix–vector products v ↦ 𝒢v, which can be
computed implicitly using the factorization 𝒢 = 𝒜𝒜*. Concretely, for a given
v ∈ ℝn we
compute
$$
M\ :=\ \mathcal{A}^*(v)\ =\ \sum_{j=1}^n v_j x_j
x_j^\top\in\mathbb{S}^d,
\qquad
(\mathcal{G}v)_i\ =\ (\mathcal{A}(M))_i\ =\ x_i^\top M x_i,\ \ i\le n.
$$
Thus one Gram product reduces to one adjoint accumulation (a weighted
sum of rank-one matrices) followed by evaluation of n quadratic forms. If we represent
X = [x1 ⋯ xn] ∈ ℝd × n,
then M = Xdiag (v)X⊤,
and 𝒢v = diag (X⊤MX).
This can be implemented via dense linear algebra in time polynomial in
(d, n), without
storing 𝒢.
We record a crude but explicit complexity upper bound sufficient for the theoretical claim. Since n = Θ(d2), forming $M=\sum_{j=1}^n v_j x_j x_j^\top$ naively costs O(nd2) = O(d4) arithmetic operations, and computing all quadratic forms xi⊤Mxi also costs O(nd2) = O(d4). Hence a single implicit 𝒢–vector product costs O(d4) in the dense model. Conjugate gradient requires $O(\sqrt{\kappa}\log(1/\eta))$ such products to reach relative residual η, where κ is the condition number of 𝒢 on its range. By Lemma~1, below threshold we have κ = Oε, K(1) with high probability after scaling, so the iteration count is Oε, K(log (1/η)). Consequently, the total polishing cost is Oε, K(d4log (1/η)), polynomial in the input size. Step (i) is also polynomial-time, whether via interior-point methods for SDPs (with worst-case polynomial complexity) or via first-order methods whose per-iteration cost is dominated by repeated applications of 𝒜 and 𝒜* (each of which can be implemented in O(nd2) = O(d4) time in the dense model).
This accounting is intentionally conservative; it is meant to certify polynomial-time solvability rather than to optimize constants. Any structural speedups (e.g. exploiting sparsity in xi, block structure, or hardware-accelerated matrix multiplication) improve the practical runtime but are not needed for the theoretical corollary.
Two numerical issues are central: conditioning of the linear solve in~, and preservation of PSD in the update S̃ = S − tΔ. The first is governed by Lemma~1: when α ≤ 1/4 − ε, the nonzero spectrum of 𝒢 lies in [c1d, c2d] with high probability, so the effective conditioning is bounded by c2/c1, independent of d. In particular, iterative solvers do not suffer dimension-driven slowdown. Moreover, since 𝒢 is PSD, conjugate gradient is backward stable in the usual sense: the computed iterate v(k) solves a nearby system (𝒢 + E)v = r with ∥E∥ controlled by roundoff, provided the matrix–vector products are computed stably.
For the second issue, we either compute t from spectral estimates of λmin(S) and ∥Δ∥op, or we adopt a conservative line-search. Since S is PSD and bounded by construction, λmin(S) can be approximated by a few iterations of Lanczos (polynomial-time), and ∥Δ∥op can be estimated similarly by power iteration applied to Δ. In many regimes, Theorem~B yields that ∥Δ∥op is sufficiently small that t = 1 is admissible, and the damping is only a safeguard near the boundary of the PSD cone. If damping is applied, exact feasibility is preserved when t = 1, whereas for t < 1 we lose exact feasibility; in that case we can restore exactness by either (a) applying the full correction in multiple damped steps while updating the residual, or (b) performing one final undamped correction after shifting S away from the PSD boundary (e.g. replacing S by (1 − γ)S + γτId for small γ, τ > 0), both of which keep the spectrum controlled under the operator-norm bounds from Theorem~B.
Because 𝒢 may have a nontrivial
kernel (indeed rank (𝒢) ≤ d(d + 1)/2), it
is natural to interpret v = 𝒢†r as the
minimum-ℓ2 solution
of minv∥𝒢v − r∥2.
Conjugate gradient applied to the normal equations or to 𝒢v = r with initialization
v(0) = 0 produces
the minimum-norm solution whenever r ∈ range (𝒢). In floating-point
arithmetic or when r is only
approximately in the range (as can occur if S is obtained from inexact
optimization in Step (i)), it is often preferable to solve the
regularized system
(𝒢 + ρI)v = r
for a small ridge ρ > 0,
and then set Δ = 𝒜*(v). This
yields a correction direction that is well-defined and smooth in r, and the resulting residual
becomes (I − 𝒢(𝒢 + ρI)−1)r,
which can be made arbitrarily small by taking ρ small and iterating. The
operator-norm control from Lemma~2 continues to apply to Δ = 𝒜*(v), with
constants that degrade continuously with ρ.
Below the threshold α ≤ 1/4 − ε, the combination of (a) existence of bounded-spectrum approximate fits, (b) uniform conditioning of the Gram operator, and (c) the explicit correction formula Δ = 𝒜*(𝒢†r) yields an end-to-end polynomial-time route to exact feasibility in the PSD cone. The solver is explicit, relies only on applications of 𝒜 and 𝒜* (hence can be implemented without forming 𝒢), and admits standard iterative linear-algebra primitives whose stability is supported by the same probabilistic estimates used in the analysis of Theorem~B.
We briefly record several instantiations of the general framework, and indicate directions in which the arguments admit reasonably direct modifications. The emphasis here is on clarifying what is already contained in the preceding theorems (often implicitly), rather than on introducing new technical statements.
In the Gaussian case xi ∼ N(0, Id/d),
the ensemble is rotationally invariant and the Gram entries
satisfy
$$
\mathcal{G}_{ij}=(x_i^\top x_j)^2,\qquad
\mathbb{E}[\mathcal{G}_{ii}]=\mathbb{E}\|x_i\|_2^4=1+O(1/d),\qquad
\mathbb{E}[\mathcal{G}_{ij}]=\mathbb{E}(x_i^\top x_j)^2=\frac{1}{d}\ \
(i\neq j).
$$
Theorem~A specializes to the bounded-spectrum approximate feasibility
transition at α = 1/4
established in , but with two changes in formulation that are useful for
downstream algorithmics: (a) we measure approximate feasibility in an
ℓ2 residual $\|r(S)\|_2\le \delta\sqrt{n/d}$ rather than
coordinatewise tolerances, and (b) we keep track of a bounded-spectrum
witness 0 ≼ S ≼ MId
as the basic regularity regime. Once such an approximate witness exists,
Theorem~B yields an explicit correction Δ = 𝒜*(𝒢†r(S))
that enforces 𝒜(S − Δ) = 1
at the level of linear algebra, and then controls ∥Δ∥op in terms of ∥r(S)∥2 so that
PSD is preserved (possibly after damping). Thus, in the Gaussian
setting, ’s approximate existence result combines with our polishing
mechanism to give a route from bounded approximate feasibility to
feasibility, conditional only on the availability of an approximate
witness with sufficiently small residual.
The polynomial-time correction map is essentially the
minimum-Frobenius-norm update within range (𝒜*) that removes the
residual. Concretely, for any input S and residual r = r(S), the
choice v = 𝒢†r
satisfies 𝒢v = Πrange (𝒢)r,
hence
𝒜(S − 𝒜*(v)) = 1 + r − 𝒢v = 1 + (I − Πrange (𝒢))r.
Therefore exact feasibility is immediate whenever r ∈ range (𝒢), which holds in
particular if the residual is computed from an exactly representable
vector in range (𝒜) (or if Step (i) is
solved to sufficient accuracy so that the residual is numerically
in-range). Below threshold, Lemma~1 yields uniform conditioning on range (𝒢) and Lemma~2 yields ∥𝒜*(v)∥op
control in terms of ∥v∥2, so the correction
is stable. In the Gaussian case one can often argue that t = 1 is admissible with high
probability for a broad class of approximate witnesses, because typical
bounded-spectrum S have λmin(S) bounded
away from 0 in many constructions and
∥Δ∥op concentrates
at the scale $O(\|r\|_2/\sqrt{d})$; we
do not need such refinements for the general theorem, but they help
interpret polishing as a one-shot exactification rather than as an
iterative safeguard.
Many formulations of ellipsoid fitting and related SDP relaxations
impose an additional linear normalization, such as Tr (S) = 1 (or Tr (S) = d depending on
scaling), or constrain a subset of diagonal entries. Such constraints
can be incorporated by augmenting the measurement operator. For example,
define
$$
\widetilde{\mathcal{A}}:\mathbb{S}^d\to\mathbb{R}^{n+1},\qquad
\widetilde{\mathcal{A}}(S):=\bigl(\mathcal{A}(S),\,
\operatorname{Tr}(S)\bigr),
$$
with target right-hand side (1, τ) for a
prescribed τ > 0. Then
$\widetilde{\mathcal{A}}^*(v,s)=\mathcal{A}^*(v)+s
I_d$ and the corresponding Gram operator becomes a block
matrix
$$
\widetilde{\mathcal{G}}=\widetilde{\mathcal{A}}\widetilde{\mathcal{A}}^*
=
\begin{pmatrix}
\mathcal{G} & \mathcal{A}(I_d)\\
\mathcal{A}(I_d)^\top & \|I_d\|_F^2
\end{pmatrix}
=
\begin{pmatrix}
\mathcal{G} & (\|x_i\|_2^2)_{i\le n}\\
(\|x_i\|_2^2)_{i\le n}^\top & d
\end{pmatrix}.
$$
Below threshold, the same strategy applies provided we have a uniform
conditioning bound for $\widetilde{\mathcal{G}}$ on its range, and
an operator-norm control for $\widetilde{\mathcal{A}}^*(v,s)$ analogous to
Lemma~2. Since Id is
deterministic and ∥xi∥22
concentrates around 1, these
modifications are typically benign: one can view the extra constraint as
adding one more well-behaved direction to the linear system, and the
polishing map becomes $S\mapsto
S-\widetilde{\mathcal{A}}^*(\widetilde{\mathcal{G}}^\dagger
\widetilde{r})$ with $\widetilde{r}=\widetilde{\mathcal{A}}(S)-(\mathbf{1},\tau)$.
Similar remarks apply to multiple linear side constraints (e.g. several
trace-like moments), at the cost of increasing the Gram dimension by a
constant factor.
The core probabilistic inputs in Theorems~A and~B are (i) a universality statement controlling the approximate-feasibility volume/width transition, and (ii) a uniform invertibility statement for 𝒢 on its range together with an 𝒜* operator-norm bound. These inputs are not intrinsically tied to the full PSD cone 𝕊+d; rather, they depend on how the affine slice intersects a convex cone in a space of dimension Θ(d2). If one replaces 𝕊+d by a cone 𝒦 ⊆ 𝕊d that is invariant under a group action S ↦ USU⊤ for U in a subgroup G ≤ O(d) (e.g. block-diagonal PSD cones, G-equivariant matrix algebras, or cones defined by commuting constraints), then one expects the threshold to depend on the number of degrees of freedom (statistical dimension or Gaussian width of 𝒦 ∩ {∥S∥F = 1}), rather than on d(d + 1)/2. At a formal level, if the feasible set is restricted to a G-invariant linear subspace ℒ ⊆ 𝕊d of dimension m ≪ d2, then 𝒜 is replaced by 𝒜 ∘ Πℒ, and the natural scaling becomes n = αm; correspondingly, one anticipates a threshold at α = 1/4 with d2 replaced by m, provided the design remains isotropic ℒ in the sense that 𝔼⟨X, H⟩F2 is proportional to ∥H∥F2 for H ∈ ℒ. Establishing such variants would require re-running the universality and conditioning arguments with m as the ambient dimension, but the polishing mechanism itself is unchanged once the relevant Gram operator is well-conditioned on its range.
The bounded-spectrum hypothesis 0 ≼ S ≼ MId
is a convenient way to exclude pathologies near the PSD boundary and to
obtain uniformity in Lemma~2. In applications, the same effect can be
achieved by convex regularization rather than explicit spectral
truncation. For instance, one can compute an approximate witness
by
$$
\min_{S\succeq 0}\
\frac12\|\mathcal{A}(S)-\mathbf{1}\|_2^2+\lambda\,\operatorname{Tr}(S)+\mu\,\|S\|_F^2,
$$
with μ > 0 enforcing strict
convexity and spectral control, and then polish the output. The point is
not that this particular objective is canonical, but that many
regularizers imply bounds of the form ∥S∥op ≤ M
(possibly with M depending on
λ, μ and the data)
and yield small residuals. Once such a bounded approximate witness is
available, polishing provides an explicit final certificate S̃ ≽ 0 with exact constraints. This
separation of concerns—first enforce ``well-behavedness’’ via
regularization, then certify exact feasibility by linear correction—is
robust across algorithmic choices.
The measurements 𝒜(S)i = xi⊤Sxi
can be interpreted as evaluating a quadratic model fS(x) = x⊤Sx
on the training inputs x1, …, xn.
Exact feasibility 𝒜(S) = 1 is then
interpolation of the constant label 1
by a PSD quadratic form. The matrix 𝒢
is precisely the (degree-2 homogeneous)
polynomial kernel Gram matrix on the training set: 𝒢ij = k(xi, xj)
with k(u, v) = (u⊤v)2.
Thus the linear system 𝒢v = r appearing in
polishing is the normal-equation system for least-squares regression in
the quadratic feature space, and the conditioning statement in Lemma~1
may be read as a statement about the stability of kernel regression
below the n ≍ d2 scaling,
at least on the range determined by the feature map.
Moreover, the nuclear norm ∥S∥* = Tr (S)
for PSD S corresponds to the
natural RKHS norm for quadratic forms under our scaling, so Theorem~C
can be interpreted as saying that, below threshold, the minimum-norm
interpolant in the quadratic feature class is automatically PSD and has
controlled operator norm with high probability. This is consistent with
the heuristic that, in an underconstrained regime, minimum-norm
interpolation selects a ``benign’’ solution. While our results do not
directly yield generalization bounds (which would require a separate
distributional model for test points and labels), they provide a
structural guarantee: below α = 1/4 − ε, interpolation
of the constant target can be realized by a bounded PSD quadratic model,
and the canonical convex regularizer does not leave the PSD cone.
From a kernel perspective, existence of S ≽ 0 with 𝒜(S) = 1 is a representability statement: the constant vector 1 lies in the range of the feature map induced by S ↦ (xi⊤Sxi)i ≤ n restricted to the PSD cone. Polishing adds a certification statement: given any approximate representer with bounded spectrum, one can algorithmically convert it into an exact representer by solving a Gram linear system and applying a controlled correction in parameter space. In particular, one does not need to solve the conic feasibility problem to high accuracy ab initio; it suffices to enter the basin of small residual and bounded spectrum. This perspective is compatible with numerical practice in random-feature models, where one frequently observes that moderate-accuracy optimization combined with an explicit linear-algebraic refinement yields exact constraint satisfaction, provided the relevant Gram operator is not ill-conditioned.
We conclude by isolating several problems that appear to be the natural next steps if one aims to turn the present picture into a sharp and broadly applicable theory. The common theme is that our arguments are presently separated into (i) existence of a approximate witness (bounded spectrum and small residual) and (ii) a deterministic correction that certifies exact feasibility once such a witness is available. It is plausible that both components can be strengthened: the first toward the sharp boundary α = 1/4 and toward weaker regularity hypotheses, and the second toward greater robustness under model misspecification and dependence. We state these directions in a way that makes clear which parts are geometric (thresholds, widths, statistical dimension) and which are analytic (conditioning of Gram operators, stability of inverse maps, control of operator norms).
Our polishing theorem is stated with a fixed gap α ≤ 1/4 − ε, and this gap is used in two places: to obtain a uniform lower bound on the nonzero singular values of 𝒢 on its range, and to ensure that the 𝒜* operator-norm bound can be applied uniformly over the relevant vectors v = 𝒢†r. It is natural to ask whether one can let ε = εd ↓ 0 and still certify exact feasibility for residuals of size δ = δd that may also shrink with d. A minimal quantitative goal would be a statement of the form: if α ≤ 1/4 − cd−β for some β > 0 then, with high probability, σmin(𝒢|range) ≳ d ⋅ εd (or another explicit scale) and polishing remains stable for residuals ∥r∥2 of order at most d1/2εd (up to constants). Achieving such ``near-critical’’ control appears to require a refined spectral analysis of 𝒢 beyond the coarse conditioning bounds used here, perhaps in the spirit of local laws for kernel Gram matrices. Even in the Gaussian case, the matrix 𝒢 is neither a Wigner matrix nor a standard sample covariance; its entries are quadratic functions of inner products, and the near-critical regime likely exhibits a nontrivial edge behavior. A second, closely related problem is to understand whether the damping step can be made essentially unnecessary at the boundary, i.e. whether typical approximate witnesses in the feasible regime have λmin(S) bounded below at the scale on which ∥Δ∥op fluctuates. This would turn polishing into a genuinely one-shot exactification with t = 1 for broad classes of initializations.
The current stability mechanism bounds ∥Δ∥op via ∥𝒜*(v)∥op
together with ∥v∥2 ≤ ∥𝒢†∥op∥r∥2.
In near-critical regimes this can be pessimistic because ∥𝒢†∥op is governed by
the smallest nonzero singular value, while the residual may concentrate
in more stable directions. A plausible improvement is a estimate of the
form
∥𝒜*(𝒢†r)∥op ≤ Φ(r)
where Φ is controlled by the
spectral measure of 𝒢 and the alignment
of r with near-null
directions. Such a bound would amount to a refined norm control on the
map 𝒜*𝒢†
restricted to residuals that arise from bounded-spectrum inputs.
Establishing it may require coupling the geometry of the feasible slice
(which constrains what residuals are attainable by bounded-spectrum
matrices) with spectral information about 𝒢. This is a concrete place where one might
hope to exploit the specific structure of residuals r(S) = 𝒜(S) − 1
rather than bounding over all r ∈ ℝn.
A more conceptual open question is whether the bounded-spectrum hypothesis is genuinely needed for exact feasibility below threshold, or merely a technical device that simplifies uniformity. One extreme conjecture would be: for α < 1/4, with high probability there exists an exact feasible solution S⋆ ≽ 0 with ∥S⋆∥op = O(1) (or at least Tr (S⋆) = O(d)), without imposing any a priori spectral cap. On the algorithmic side, one would like a statement of the form: if one finds approximate feasible S ≽ 0 with sufficiently small residual (but possibly large ∥S∥op), then there is a controlled way to ``regularize and then polish’’ without first projecting onto {S ≼ MId}, which can be numerically delicate when S is ill-conditioned. One route is to prove that approximate feasibility itself enforces spectral control: for instance, show that if 𝒜(S) ≈ 1 and S ≽ 0 then necessarily Tr (S) ≈ d and ∥S∥op cannot be too large, with high probability over the design. Such an implication would be a form of restricted injectivity of the map S ↦ 𝒜(S) on rays of the PSD cone, and would reduce bounded-spectrum feasibility to plain feasibility plus residual control. Another route is to incorporate an interior-point style barrier (e.g. −log det (S + ηI)) to ensure strict positivity and then to quantify how the barrier parameter η can be sent to 0 while maintaining a stable polishing step.
Our universality and conditioning inputs are formulated for i.i.d. isotropic subgaussian xi. In many applications the xi are neither independent nor identically distributed: they may be drawn without replacement from a finite set, they may form a Markov chain, or they may be generated by a random feature pipeline with shared randomness across samples. A first extension is to allow heterogeneous covariances 𝔼[xixi⊤] = Σi/d with Σi close to Id, and ask whether the same 1/4 boundary persists after an appropriate prewhitening. Here one expects the threshold to depend on the effective condition number of the average covariance and on the uniformity of tails, and the key technical point becomes a stability analysis of 𝒢 under inhomogeneous kernel entries (xi⊤xj)2. A second extension is dependence across i. Even mild dependence can break the simple concentration tools used to control σmin(𝒢|range), so one would need either a martingale/mixing replacement for the Gram conditioning lemma, or a leave-one-out/self-consistent analysis that controls the resolvent of 𝒢. We emphasize that polishing itself is deterministic; what is needed is a probabilistic guarantee that the linear system defining v is solvable stably in the regime of interest. In structured settings, it may be more natural to certify stability by bounding the spectrum of an Gram operator plus a perturbation term that is controlled in operator norm, rather than appealing to i.i.d. concentration.
Another direction is robustness when the right-hand side is not exactly 1 or when the constraints are noisy. In such settings exact feasibility may be impossible, but one can still ask for a certified near-feasible PSD solution with an explicit residual bound and controlled spectrum. The polishing map can be adapted by solving a Tikhonov-regularized system (𝒢 + γI)v = r and setting S̃ = S − 𝒜*(v), which is the minimum-norm correction in a smoothed metric. An open problem is to develop a theorem stating that, below α = 1/4, this regularized polishing is optimally stable (in a minimax sense) over a natural noise model, with γ chosen at the scale of the smallest nonzero singular value of 𝒢. Such a result would connect the feasibility phase transition to the stability phase transition for kernel ridge regression with the quadratic kernel, and would clarify what aspects of our proofs are genuinely about feasibility versus about conditioning.
The pattern ``find a regularized approximate solution, then certify or exactify via an explicit linear-algebraic correction’’ appears in several problems that are adjacent to our setting. Two examples that seem particularly close in spirit are synchronization-type SDPs (where one optimizes over a PSD matrix with structured linear constraints encoding group consistency) and relaxations based on Kikuchi or cluster-variational matrices (where PSD constraints are imposed on collections of local marginals). In these problems one often observes empirically that first-order methods find points that nearly satisfy the linear constraints but are not exactly feasible, and that enforcing exact constraints via projection can destroy PSD structure. Our polishing mechanism suggests an alternative: if the constraint operator has a well-conditioned Gram map on its range, then one can correct in the range of the adjoint while quantifying how much the PSD cone is perturbed. Formalizing this analogy requires identifying, in each problem, the analogue of 𝒢 = 𝒜𝒜* and proving a below-threshold (or below-density) conditioning statement. For synchronization landscapes, one expects the relevant effective dimension to be the number of degrees of freedom in the lifted variable (often Θ(d2) per node, with additional structure), and the analogue of α = 1/4 might manifest as a sampling density threshold for stable constraint enforcement. For Kikuchi-type matrices, the local-consistency operator typically has a sparse incidence structure; here the main challenge is that 𝒢 is no longer a dense kernel Gram matrix, and conditioning must be analyzed via graph expansion or spectral properties of the constraint graph rather than via i.i.d. concentration.
Even in the present i.i.d. model, the practical implementation of polishing depends on solving 𝒢v = r efficiently. While conjugate gradients with matrix-vector products by 𝒢 is polynomial-time and often effective, the near-critical regime will be sensitive to preconditioning. An open direction is to design preconditioners exploiting the structure 𝒢ij = (xi⊤xj)2, for example by decomposing 𝒢 into a low-rank mean component plus a fluctuation term, or by using randomized sketches that preserve the range. On the theoretical side, it would be useful to prove that a small number of iterative polishing steps (possibly with re-damping) converges to an exact feasible PSD matrix under weaker conditions than a single step, mirroring Newton refinement in nonlinear programming. Such a result would shift the burden from requiring strong one-shot bounds on ∥Δ∥op to requiring only that successive corrections remain controlled on average.
We view the following as the most concrete 2026-level targets suggested by the present work: (i) a near-critical conditioning theorem for 𝒢 that yields explicit rates as α ↑ 1/4; (ii) a self-regularization result showing that approximate feasibility forces bounded trace and moderate operator norm for PSD solutions, thereby eliminating explicit spectral caps; (iii) an extension of the universality and polishing guarantees to non-i.i.d. or weakly dependent designs via resolvent/leave-one-out techniques; and (iv) a transfer principle that packages ``regularize then certify’’ as a reusable lemma for other SDP landscapes where one can prove an appropriate Gram conditioning statement. Each of these targets is, in our view, technically plausible and would substantially strengthen the conceptual message: below a universal sampling boundary, feasibility is not only an existential geometric property but also an algorithmically certifiable one.