The prevailing methodology for understanding capability growth in
large language models has been to index models along a single
scale'' axis and to study how task performance changes as a function of that axis. In early work, the axis was frequently parameter count or training compute; in more recent work, it is more naturally a calibrated pretraining progress scalar, such as a normalized loss on a public calibration corpus. This one-dimensional view is increasingly misaligned with how frontier systems are produced and deployed in 2026. Modern systems are routinely altered by at least two additional and operationally significant controls: post-training strength (e.g., RLHF/RLAIF or other KL-constrained policy improvement) and inference-time compute (e.g., best-of-$n$, verifier-guided search, or repeated sampling with selection). When three such knobs can be adjusted, the notion of a task having a singlethreshold
model size’’ is at best an incomplete projection of a higher-dimensional
reality.
We therefore take as primitive a task distribution 𝒯 equipped with a semantic verifier, and we
focus on the success probability p𝒯(L*, R, b)
of a model queried at coordinates (L*, R, b).
The central conceptual move is to treat emergence not as a
one-dimensional transition, but as a boundary in a space of controllable
resources. Given a baseline success probability p0(𝒯) and a margin ε > 0 defining what it means to
be ``meaningfully above baseline,’’ the emergent region is the set
{(L*, R, b) : p𝒯(L*, R, b) ≥ p0(𝒯) + ε}.
Its boundary is a surface (in transformed coordinates (L*, R, log b)),
and the empirical question becomes: how does this surface tilt, how
stable is it across tasks, and what tradeoffs between pretraining,
post-training, and inference-time compute does it induce?
This reframing is motivated by a set of recurring empirical phenomena. First, the rise of long-run and deliberative inference procedures makes b a first-class quantity: a fixed base model can exhibit large changes in verified success when allowed to sample and select, to use self-consistency, or to invoke external verifiers/search. Second, the post-training pipeline has become more heterogeneous: RL for instruction following, preference optimization, safety fine-tuning, and RL for reasoning can change behavior in ways not well captured by L* alone, and can do so with a strength that is quantifiable in terms of validated reward improvement or constrained update magnitude. Third, ``emergence’’ itself has been contested: discontinuities in performance curves can be produced by metric choice, coarse evaluation, or the interaction of difficulty distributions with a smoothly improving underlying model. A geometric perspective clarifies what is being claimed. If performance is smooth in each controllable coordinate, then apparent thresholds along a single axis can arise as intersections of a smooth surface with a one-dimensional path through resource space; conversely, if a task truly exhibits a sharp transition, that claim should be robust to reasonable reparameterizations and to the presence of auxiliary axes.
Our technical stance is that, for many verifiable tasks, a
generalized linear margin model in (L*, R, log b)
provides a useful idealization: the logit of success is approximately
affine in these coordinates, with positive coefficients reflecting
monotonicity. Under this assumption, the emergent region is
(approximately) a half-space and the emergence boundary is
(approximately) a hyperplane. This is not merely a convenient
visualization; it supplies testable predictions. In particular,
iso-capability curves satisfy linear tradeoffs
α𝒯 dL* + β𝒯 dR + γ𝒯 dlog b = 0,
so that increases in pretraining progress can be exchanged for
multiplicative reductions in inference-time compute, or for reductions
in post-training strength, at a rate determined by the task-specific
coefficients. Such relations are directly falsifiable by controlled
experiments that vary one axis while compensating with another to
maintain a fixed success level.
A second motivation is methodological: if emergence is a surface, then evaluating it by exhaustive gridding is prohibitively expensive. We therefore prioritize adaptive evaluation protocols that treat the model–verifier system as a black box and that allocate samples near the boundary where information is maximized. The goal is to output an estimate $\widehat{\mathcal{S}}_{\mathcal{T}}$ of the boundary with a geometric accuracy target Δ (distance in the normal direction) and confidence 1 − δ, while minimizing verifier-labeled queries. This places the problem within the scope of active learning and sequential testing, but with structure: monotonicity in each coordinate and a low-dimensional parametric form for the logit. The resulting guarantees are of independent interest for practitioners who must decide which combination of training, post-training, and inference-time compute reaches a target reliability at minimal cost.
Our contributions are threefold. First, we formalize emergence as a set membership problem in (L*, R, log b) and show that, under the GLM margin model, the emergent region is exactly a half-space with an explicit boundary equation. Second, we provide an adaptive algorithm that estimates this boundary with sample complexity Õ(Δ−2log (1/δ)) using only black-box verifier outcomes (and optionally margin scores), and we extract interpretable tradeoff rates from the fitted normal vector. Third, we give a matching information-theoretic lower bound showing that no algorithm can uniformly beat the Ω(Δ−2log (1/δ)) dependence, thereby explaining why high-resolution evaluation near transition regions is unavoidable. Empirically, we outline how to instantiate R and b via reproducible post-training and inference controls, and how to validate the manifold picture by demonstrating predicted iso-capability tradeoffs on canonical verifiable emergence tasks.
Early empirical discussions of
emergent abilities'' emphasized apparently discontinuous performance transitions as a function of a single scale proxy, often model size or training compute. The BIG-Bench line of work and its successors documented many tasks whose accuracy curves appear flat over a range of scales and then rise sharply \cite{wei2022emergent,bigbench}. Subsequent studies broadened the set of benchmarks and highlighted that emergence claims depend delicately on the choice of metric, the discretization of model sizes, and the evaluation noise. Our framework is compatible with the descriptive content of these results, but it changes the mathematical object of interest: we treat a task'sthreshold’’
as a boundary in a controllable resource space rather than as a single
scalar breakpoint.
A prominent counterpoint is the mirage'' ormetric
artifact’’ perspective, which argues that qualitative ``phase
transitions’’ can be induced by nonlinear metrics (e.g., exact match),
finite sample effects, and heterogeneous difficulty mixtures even when
the underlying competence increases smoothly . Related analyses show
that coarse gridding of model sizes and overplotting of noisy estimates
can visually exaggerate discontinuities. The geometric viewpoint we
adopt can be read as a structural refinement of this critique: if
success probability is smooth and monotone in multiple coordinates, then
a sharp transition observed along one path (e.g., scaling L* alone) can arise from
the intersection of that path with a tilted surface. Conversely, if one
asserts a genuine discontinuity, it should be robust under changes of
path and under inclusion of additional operational degrees of freedom
such as post-training and test-time compute.
Another related thread seeks more invariant progress measures than
parameter count. Several works relate downstream capability to
pretraining loss and propose that certain skills become available when
loss crosses task-dependent levels . Our use of a calibrated progress
scalar L* is in
this spirit: we assume L* is a loss-like
quantity intended to be comparable across training setups, and we treat
it as one coordinate among others. In particular, we separate the
question how far along pretraining are we?'' fromhow much
post-training modification and inference-time computation are we willing
to apply?’’ This disentangling is motivated by the fact that modern
systems are not deployed at raw checkpoints, but rather after
substantial policy shaping and with nontrivial inference procedures.
Inference-time scaling laws and deliberative decoding methods
motivate an explicit inference compute axis. A large empirical
literature studies best-of-n,
self-consistency, rejection sampling, verifier-guided search, tool-use
policies, and other procedures where performance improves with
additional sampling or search . Some works model the gain as a function
of the number of samples, selection temperature, or search depth,
sometimes yielding approximate power laws in compute or log-compute.
These methods effectively modify the mapping from a base conditional
distribution to an induced decision rule, and the resulting success
probability is naturally nondecreasing in an inference budget parameter.
Our formalism incorporates such effects through the coordinate b (typically via log b), allowing comparisons between
train more'' andsearch more’’ in a single
representation.
Post-training procedures such as RLHF and RLAIF, as well as more recent RL-for-reasoning and long-reasoning-model (LRM) pipelines, motivate a second non-pretraining axis . While preference optimization is often discussed qualitatively, several papers operationalize post-training strength through the number of RL steps, a KL constraint, or validated reward improvement. These controls can induce substantial changes in helpfulness, harmlessness, and instruction-following, and in some regimes can trade off against pretraining progress in achieving fixed benchmark performance. Our coordinate R is intended to capture this degree of freedom abstractly, emphasizing that the evaluation target is not a single trained model but a family of policies reachable from checkpoints under controlled post-training.
Methodologically, our emphasis on verifiable tasks and high-resolution evaluation aligns with work arguing that emergence claims require dense sampling near the transition region. For code and math tasks, pass@k-style metrics, repeated sampling with verification, and related ``until-success’’ protocols expose how rapidly success probability can change as a function of sampling budget and how sensitive conclusions are to evaluation resolution . Difficulty slicing plays a similar role: by stratifying instances by estimated difficulty (e.g., length, required steps, or empirical success rates), one often observes smoother progress within slices even when the aggregate curve appears abrupt. Our setting assumes a task distribution equipped with a semantic verifier, precisely to make such repeated-sampling and stratified analyses principled and to separate semantic correctness from proxy metrics.
Finally, the adaptive evaluation problem we consider is related to active learning, sequential testing, and threshold estimation under monotonicity constraints. Classical work studies efficient identification of decision boundaries for generalized linear models and the design of queries that concentrate near regions of high information . Our contribution is to place these ideas in a capability-measurement context with explicit resource coordinates and to focus on estimating an emergence rather than a one-dimensional threshold. This differs from exhaustive gridding approaches common in scaling studies: by allocating samples preferentially near the boundary and enforcing coordinate-wise monotonicity, one can obtain statistically certified boundary estimates with substantially fewer verifier-labeled evaluations than uniform exploration, while still retaining interpretable tradeoff quantities between pretraining, post-training, and inference-time compute.
We fix a model family equipped with three operational degrees of
freedom: a pretraining checkpoint indexed by a scalar L*, a post-training
operator of adjustable strength R ≥ 0, and an inference-time
procedure parameterized by a compute budget b ∈ ℕ. The intended reading is that
L* measures
how far along pretraining we are'' in a way that is comparable across training runs, $R$ measureshow
much policy shaping we apply after pretraining’’ (e.g., via RLHF/RLAIF
or a KL-constrained policy improvement step), and b measures ``how much computation we
expend at test time’’ (e.g., number of samples in best-of-n, number of verifier-guided
expansions, or the number of calls to a tool or critic). We defer the
question of how to construct a useful L* to ; for the present
section, it suffices that L* indexes a finite set
of available checkpoints and that it admits an ordering consistent with
pretraining progress.
A is specified as a distribution 𝒯
over instances x together with
a semantic verifier. Concretely, we assume access to a measurable
mapping
V𝒯 : (x, y) ↦ {0, 1},
where V𝒯(x, y) = 1
indicates semantic success (e.g., exact mathematical correctness,
passing unit tests, or satisfying an externally specified constraint).
When available, we also allow a real-valued margin score s𝒯(x, y) ∈ ℝ
whose sign agrees with the verifier decision, so that s𝒯(x, y) > 0 ⇒ V𝒯(x, y) = 1.
The margin is not logically necessary for our definitions, but it is
useful for variance reduction and for designing adaptive sampling rules
in later sections.
Given coordinates (L*, R, b),
we obtain an induced randomized decision rule as follows. The checkpoint
at L* is
transformed by the post-training operator at strength R into a policy (or conditional
distribution) that we denote abstractly by πL*, R(⋅ ∣ x).
The inference procedure with budget b induces a distribution over final
outputs, which we write as y ∼ Infer(πL*, R, x; b).
This abstraction is intended to cover both ``single-shot’’ decoding
(where b might correspond to a
fixed decoding strategy) and deliberative procedures (where b controls, e.g., the number of
candidates proposed and then filtered, or the depth of a search guided
by a learned or external verifier). We then define the task success
probability
p𝒯(L*, R, b) := Prx ∼ 𝒯, y ∼ Infer(πL*, R, x; b) [ V𝒯(x, y) = 1 ].
The central structural assumption used throughout is coordinate-wise
monotonicity: after transforming b to log b, the map (L*, R, log b) ↦ p𝒯(L*, R, b)
is nondecreasing in each coordinate. This assumption is empirical rather
than tautological, but it is operationally natural: additional
pretraining progress should not systematically harm semantic accuracy on
an in-distribution verifiable task; additional post-training strength
(as measured by a validated scalar R) should not reduce task success in
expectation; and additional inference budget should not decrease success
when the procedure is allowed to retain the best candidate according to
the verifier or an aligned proxy.
To define what it means for a capability to , we must specify a
baseline. For each task 𝒯 we assume a
baseline success probability p0(𝒯), typically the
chance rate of a randomized strategy with the same output constraints
(e.g., 1/k for k-way multiple choice), or an
empirical baseline obtained by a simple non-language heuristic. We fix a
margin parameter ε > 0 and
declare that (L*, R, b)
is if
p𝒯(L*, R, b) ≥ p0(𝒯) + ε.
This definition encodes a minimal notion of above-baseline competence
while remaining agnostic to any particular benchmark metric: it depends
only on a semantic verifier and a probabilistic model of repeated
trials. The corresponding is the subset
ℰ𝒯(ε) := {(L*, R, b) : p𝒯(L*, R, b) ≥ p0(𝒯) + ε},
and the associated is its boundary in the transformed coordinates z = (L*, R, log b),
i.e.,
𝒮𝒯(ε) := ∂{z : p𝒯(z) ≥ p0(𝒯) + ε}.
The geometric viewpoint is that an ``emergent ability’’ is not a single
breakpoint along one scaling axis, but rather a codimension-one boundary
in a space of controllable resources; one-dimensional emergence curves
are recovered by intersecting 𝒮𝒯(ε) with a path such as
(L*, 0, 1) or (L*, R̄, b̄).
Finally, we record desiderata for a useful capability manifold representation, since these constrain both modeling and measurement. First, the coordinates should be in the sense that we can implement controlled interventions in R and b and can select among discrete checkpoints in L* without changing unrelated aspects of the system. Second, the representation should be across training setups, motivating a calibrated L* and a task-independent interpretation of b as actual inference compute rather than an idiosyncratic knob. Third, measurement should be : the definition of p𝒯 permits repeated sampling and uses V𝒯 to separate semantic correctness from proxy metrics, allowing confidence intervals and sequential tests. Fourth, the manifold should support : for fixed target success p̄ ∈ (p0(𝒯), 1) we seek interpretable rates such as ∂L*/∂log b and ∂R/∂log b along iso-success sets. Fifth, the representation should admit : given a cost model C(L*, R, b) combining training, post-training, and inference expenditures, we wish to identify compute-efficient points in ℰ𝒯(ε).
With these objects fixed, the remaining task is statistical: we must estimate 𝒮𝒯(ε) from black-box samples of V𝒯(x, y) at adaptively chosen (L*, R, b). The next section addresses the first coordinate, L*, since the interpretability of the entire manifold depends on a progress scalar whose meaning is stable under incidental changes to the training setup.
The coordinate L* is intended to capture ``how much pretraining we have’’ in a way that is (i) ordered, so that larger L* corresponds to later checkpoints, and (ii) comparable across training setups, so that a unit change in L* has a stable interpretation even when tokenization, data mixtures, or optimization details vary. Since our algorithms only require access to a finite set of checkpoints, we could take L* to be the checkpoint index. However, doing so would destroy comparability and would make tradeoff rates such as ∂L*/∂log b meaningless. We therefore define L* by evaluation on a fixed public calibration distribution.
Fix a publicly specified calibration corpus Dcal (a multiset of
sequences) together with a deterministic evaluation protocol
(tokenization, truncation length, masking conventions). For a pretrained
checkpoint θ we define the
empirical next-token cross-entropy
$$
\widehat{\ell}_{\mathrm{cal}}(\theta)
\;:=\;
\frac{1}{N}\sum_{(u_1,\dots,u_T)\in D_{\mathrm{cal}}}
\sum_{t=1}^{T}
-\log p_{\theta}(u_t \mid u_{<t}),
\qquad
N:=\sum_{(u_1,\dots,u_T)\in D_{\mathrm{cal}}} T.
$$
Since progress corresponds to decreasing loss, we choose a monotone
reparameterization that is increasing with capability, e.g.
L*(θ) := ℓref − ℓ̂cal(θ),
where ℓref is a
fixed constant (for instance, ℓ̂cal of a small reference
checkpoint) chosen only to keep L* nonnegative. Any
strictly increasing affine transform of L* is equivalent for our
geometric statements; it only rescales the coefficient α𝒯 and shifts η𝒯.
Direct cross-entropy depends on the tokenization and can change by an
additive constant when the vocabulary changes, even when the induced
distribution over strings is essentially the same. To reduce this
artifact, we may normalize by a fixed unigram baseline computed on the
same evaluation representation. Let quni be the empirical
unigram distribution over the evaluation alphabet (tokens, bytes, or
characters) estimated from Dcal, and define the
unigram cross-entropy
$$
\widehat{\ell}_{\mathrm{uni}}
\;:=\;
\frac{1}{N}\sum_{(u_1,\dots,u_T)\in D_{\mathrm{cal}}}
\sum_{t=1}^{T}
-\log q_{\mathrm{uni}}(u_t).
$$
We then define the
ℓ̂ex(θ) := ℓ̂cal(θ) − ℓ̂uni,
and set L*(θ) := −ℓ̂ex(θ).
Equivalently, this is a log improvement over a unigram model;
exponentiating yields a normalized perplexity exp (ℓ̂ex(θ)). In
practice we recommend computing the above on bytes (bits-per-byte) when
tokenizers differ across runs; this makes L* comparable across
vocabularies at the cost of a slightly different evaluation
pipeline.
We require L* to
be invariant to irrelevant implementation choices that do not alter the
induced distribution in a meaningful way. Formally, for two checkpoints
θ and θ′ that differ only by a
reparameterization or by a numerically irrelevant change (e.g. weight
format, compilation), we require L*(θ) = L*(θ′)
up to negligible evaluation noise. More importantly, across training
runs we require that L* depends primarily on
the learned predictive distribution, not on the training schedule. This
is the sense in which using a public evaluation loss is preferable to
using number of tokens trained on'' oroptimizer steps,’’
both of which can vary substantially in their marginal effect.
We fix Dcal once, publish it (or its hash plus sampling procedure), and compute L* for every checkpoint with the same evaluation settings (context length, truncation, and masking). We also recommend reporting standard errors for ℓ̂cal computed by batching documents, since small differences in L* can otherwise be within measurement noise. Finally, since we only access a discrete set of checkpoints, we store the mapping checkpoint ↦ L* and treat interpolation as a modeling convenience rather than an operational primitive.
Our later generalized linear model assumes that, holding (R, b) fixed, p𝒯(L*, R, b) is nondecreasing in L* and that logit (p𝒯) is approximately affine in L* over the range under consideration. Monotonicity is an empirical regularity we enforce at the modeling layer; approximate affinity is a local approximation justified by smoothness of success probabilities in effective pretraining progress. Concretely, if there exists a latent scalar progress variable λ for which the logit is affine and L* is an affine proxy for λ on the sampled checkpoints, then all half-space and tradeoff claims remain valid after reparameterization, up to rescaling of the coefficients.
We now posit a low-dimensional structural model for how task success
varies as we move along the three coordinates (L*, R, b).
Fix a task distribution 𝒯 with semantic
verifier V𝒯(x, y) ∈ {0, 1}
and baseline success probability p0(𝒯) (e.g. random
guessing). For each triple (L*, R, b)
we define the induced success probability
p𝒯(L*, R, b) := Prx ∼ 𝒯, y ∼ πL*, R, b(⋅ ∣ x) [V𝒯(x, y) = 1],
where πL*, R, b
denotes the model after selecting a pretraining checkpoint with progress
L*, applying
post-training of strength R,
and running the inference procedure with budget b. We assume coordinate-wise
monotonicity: after transforming b to log b, the map (L*, R, log b) ↦ p𝒯(L*, R, b)
is nondecreasing in each coordinate. This assumption is not logically
necessary for the definitions, but it is the regularity that makes
surface estimation statistically and operationally well-posed; in
particular it rules out pathological ``capability regressions’’ under
more compute in the regime of interest.
Our central approximation is a generalized linear margin model on the
logit scale. Writing σ(t) := (1 + e−t)−1
and z := (L*, R, log b) ∈ ℝ3,
we assume that for each 𝒯 there exist
parameters (α𝒯, β𝒯, γ𝒯, η𝒯, ξ𝒯)
with α𝒯, β𝒯, γ𝒯 > 0
such that
The use of log b is a
parsimonious way to encode diminishing returns in common inference-time
procedures (best-of-n, tree
search expansions, verifier calls), while retaining an affine geometry
in z. We treat as a local
model: smoothness of p𝒯 in each coordinate
implies that the logit is well-approximated by its first-order Taylor
expansion on bounded domains, and the fitted coefficients should be
interpreted as effective slopes over the evaluated range rather than
universal constants.
Given an above-baseline margin ε > 0, we define the for task
𝒯 as
ℰ𝒯(ε) := {(L*, R, b) : p𝒯(L*, R, b) ≥ p0(𝒯) + ε}.
Under , the boundary ∂ℰ𝒯(ε) is a hyperplane in
z-coordinates. Indeed, letting
κ𝒯(ε) := logit (p0(𝒯) + ε) − ξ𝒯 + α𝒯η𝒯,
we have
(L*, R, b) ∈ ∂ℰ𝒯(ε) ⇔ α𝒯L* + β𝒯R + γ𝒯log b = κ𝒯(ε).
We refer to this boundary (in the original (L*, R, b)
coordinates) as the . The normal vector w𝒯 := (α𝒯, β𝒯, γ𝒯)
describes how sensitive the task is to improvements along each axis,
while the offset κ𝒯(ε) determines
how far along the manifold we must move to clear the specified
above-baseline threshold.
The same geometry yields : for any target success probability p̄ ∈ (p0(𝒯), 1),
the level set {z: p𝒯(z) = p̄}
satisfies a linear equation on the logit scale,
α𝒯L* + β𝒯R + γ𝒯log b = logit (p̄) − ξ𝒯 + α𝒯η𝒯.
Differentiating this identity yields tradeoff rates among the three
resources:
α𝒯 dL* + β𝒯 dR + γ𝒯 dlog b = 0.
Hence, holding R fixed, an
increase ΔL* permits a
multiplicative reduction in inference budget by
$$
b \mapsto b\cdot
\exp\!\Bigl(-\frac{\alpha_{\mathcal{T}}}{\gamma_{\mathcal{T}}}\Delta
L^*\Bigr),
$$
and holding b fixed, the
marginal post-training required to compensate for a drop in pretraining
progress is dR/dL* = −α𝒯/β𝒯.
These ratios, rather than the raw coefficients, are the invariant
objects: any affine reparameterization of L* rescales α𝒯 but leaves the implied
tradeoffs unchanged after appropriate units conversion.
Finally, we clarify parameter interpretations that will matter for estimation. The threshold parameter η𝒯 is the (notional) L* location at which, with R = 0 and b = 1, the logit crosses −ξ𝒯; it is therefore a convenient centering but not directly identifiable without fixing conventions for R and b. The intercept ξ𝒯 absorbs systematic advantages of the task or prompt format not captured by L*, R, b and should be regarded as task-specific. When a real-valued margin score s𝒯(x, y) is available, one may view as modeling Pr [s𝒯(x, y) ≥ 0] and interpret (α𝒯, β𝒯, γ𝒯) as slopes of a latent margin distribution; however, our subsequent procedures only require Bernoulli verifier outcomes. The next section formalizes the estimation problem implied by this model: given only black-box access to (x, y, V𝒯(x, y)) under chosen (L*, R, b), we must recover (up to tolerance) the emergence surface and the associated tradeoff geometry.
We now state the statistical problem that mediates between the
geometric model of Section~ and an implementable evaluation procedure.
Fix a task distribution 𝒯 with verifier
V𝒯(x, y) ∈ {0, 1},
baseline p0(𝒯), and
margin ε > 0. Write z := (L*, R, log b)
and restrict attention to a known feasible box
𝒵 := [Lmin, Lmax] × [0, Rmax] × [log bmin, log bmax] ⊂ ℝ3,
which encodes the checkpoints, post-training strengths, and inference
budgets available for evaluation. The (unknown) success probability
is
p𝒯(z) = Prx ∼ 𝒯, y ∼ πz(⋅ ∣ x) [V𝒯(x, y) = 1],
and the corresponding emergence set and emergence surface are
ℰ𝒯(ε) := {z ∈ 𝒵 : p𝒯(z) ≥ p0(𝒯) + ε}, 𝒮𝒯(ε) := ∂ℰ𝒯(ε) ∩ 𝒵.
When holds, 𝒮𝒯(ε)
is (approximately) a hyperplane in z; however, the problem definition
below is stated in terms of black-box access to p𝒯 and does not require
us to assume the parametric form at the level of the oracle.
An evaluation consists of choosing a point z = (L*, R, log b) ∈ 𝒵
and drawing i.i.d. labeled samples
Ui(z) := V𝒯(xi, yi) ∈ {0, 1}, xi ∼ 𝒯, yi ∼ πz(⋅ ∣ xi),
so that 𝔼[Ui(z)] = p𝒯(z).
We model this as black-box access to an oracle Eval𝒯(z) that returns an
independent Bernoulli draw with mean p𝒯(z); calling
the oracle m times at z produces an empirical mean $\hat p_m(z):=m^{-1}\sum_{i=1}^m U_i(z)$.
Optionally, if the verifier provides a real-valued margin s𝒯(x, y),
the oracle may also return Si(z) := s𝒯(xi, yi),
which can be used to reduce variance; nonetheless, our guarantees will
be expressed in terms of the Bernoulli labels alone.
We allow an adaptive/sequential access pattern: at time t an algorithm selects a query point zt (and possibly a batch size mt) as a function of past observations {(zτ, U1 : mτ(zτ))}τ < t. The total evaluation budget is n := ∑tmt, optionally with a refined accounting that weights each label by an inference cost proportional to b.
The central output of ESE is an estimate $\widehat{\mathcal{S}}_{\mathcal{T}}(\varepsilon)$
of the surface 𝒮𝒯(ε). Since 𝒮𝒯(ε) is a
codimension-one set, we measure error geometrically rather than by
pointwise probability estimation. A general metric is the (restricted)
Hausdorff distance on 𝒵:
$$
d_{\mathrm{H}}(\widehat{\mathcal{S}},\mathcal{S})
:=
\max\Bigl\{
\sup_{u\in \widehat{\mathcal{S}}}\inf_{v\in \mathcal{S}}\|u-v\|_2,\
\sup_{v\in \mathcal{S}}\inf_{u\in \widehat{\mathcal{S}}}\|u-v\|_2
\Bigr\}.
$$
We say that $\widehat{\mathcal{S}}$ is
Δ-accurate (in geometry) if
$d_{\mathrm{H}}(\widehat{\mathcal{S}},\mathcal{S})\le
\Delta$. Under the hyperplane model of Theorem~1, it is often
convenient to express the same goal as a error bound. Writing the true
surface as {z ∈ 𝒵 : w⊤z + c = 0}
with w ≠ 0, define the signed
distance of any point z to the
surface by
$$
\mathrm{dist}_{\perp}(z,\mathcal{S})
:=
\frac{w^{\top}z+c}{\|w\|_2}.
$$
An estimated hyperplane {ŵ⊤z + ĉ = 0}
is then Δ-accurate on 𝒵 if
$$
\sup_{z\in\mathcal{Z}}
\Bigl|\frac{\hat w^{\top}z+\hat c}{\|\hat
w\|_2}-\frac{w^{\top}z+c}{\|w\|_2}\Bigr|
\le \Delta,
$$
which implies a comparable Hausdorff bound on bounded domains. We will
use Δ as the target geometric
tolerance throughout.
Because p𝒯(z) is
observed only through Bernoulli noise, a surface estimate must be
accompanied by high-probability guarantees. Formally, an ESE procedure
takes inputs (𝒯, ε, Δ, δ) and
outputs a random set $\widehat{\mathcal{S}}_{\mathcal{T}}(\varepsilon)$
(and, when appropriate, parameter estimates ŵ, ĉ) such that
$$
\Pr\!\bigl(d_{\mathrm{H}}(\widehat{\mathcal{S}}_{\mathcal{T}}(\varepsilon),\mathcal{S}_{\mathcal{T}}(\varepsilon))\le
\Delta\bigr)\ \ge\ 1-\delta.
$$
Operationally, our algorithms will maintain confidence intervals for
each queried point. If nt(z)
denotes the number of oracle calls made at z up to time t, we compute an interval CIt(z) = [ℓt(z), ut(z)]
satisfying a simultaneous coverage guarantee of the form
Pr (∀t ≥ 1, ∀z ∈ {z1, …, zt}: p𝒯(z) ∈ CIt(z)) ≥ 1 − δ,
using either a union bound over all queried points/times with
Hoeffding-type radii, or an anytime-valid confidence sequence
(e.g. empirical Bernstein). These calibrated intervals serve two roles:
(i) they certify whether a queried point is above or below the threshold
p0(𝒯) + ε
once ut(z) < p0 + ε
or ℓt(z) > p0 + ε,
and (ii) they control geometric error by ensuring that points used to
fit the surface lie within a known probability band around the
threshold. The algorithmic question, deferred to the next section, is
how to choose the query locations zt and sample
sizes mt
so as to minimize evaluation cost while achieving the stated (Δ, δ) guarantee.
We describe a concrete procedure for estimating 𝒮𝒯(ε) from black-box oracle access, using two ingredients: (i) sequential testing at selected query points to certify whether p𝒯(z) lies above or below the emergence threshold θ := p0(𝒯) + ε, and (ii) a monotone parametric surface fit in the transformed coordinates z = (L*, R, log b). The algorithm is designed to concentrate samples near the boundary, since points deep inside ℰ𝒯(ε) or far outside it can be classified with few samples and contribute little to geometric accuracy.
At each queried location z, we maintain an anytime-valid confidence sequence CIt(z) = [ℓt(z), ut(z)] for p𝒯(z) based on the Bernoulli samples observed so far at z. Concretely, we may use a Hoeffding-style radius with a union bound over the (random) set of queried points and times, or an empirical-Bernstein confidence sequence; we only require the simultaneous coverage property stated in Section~. A point z is once ut(z) < θ, and once ℓt(z) > θ. Points with θ ∈ CIt(z) are treated as and drive further sampling.
We fit a monotone logistic model in z,
p̂(z) = σ(ŵ⊤z + ĉ), ŵ = (α̂, β̂, γ̂) ∈ ℝ3,
with the coordinate-wise monotonicity constraint α̂, β̂, γ̂ ≥ 0.
Fitting may be done by constrained maximum likelihood on the aggregated
Bernoulli data, optionally with an ℓ2 penalty for stability.
When we wish to further guard against misspecification, we may
post-process the fitted scores ĝ(z) := ŵ⊤z + ĉ
by an isotonic calibration step: we fit a monotone link σ̃(⋅) so that z ≼ z′ implies
σ̃(ĝ(z)) ≤ σ̃(ĝ(z′))
(with respect to the product order on 𝒵). In practice, the linear-logit fit already
captures the dominant geometry, while the monotonicity constraints
prevent spurious folds that would lead to wasted exploration.
At iteration t, given the
current fit (ŵt, ĉt),
we define the predicted surface $\widehat{\mathcal{S}}_t:=\{z\in\mathcal{Z}:\hat
p_t(z)=\theta\}$ and restrict attention to a candidate set 𝒞t consisting of points
in a thin tube around $\widehat{\mathcal{S}}_t$ (e.g. a discretized
ρ-neighborhood intersected
with 𝒵). We select the next query
location by maximizing an acquisition score that favors both high
uncertainty and proximity to the boundary,
zt ∈ arg maxz ∈ 𝒞tAt(z) := width (CIt(z)) ⋅ (|p̂t(z) − θ|+λ)−1,
where λ > 0 prevents
numerical blow-up. After choosing zt, we draw
additional oracle samples at zt until either
(i) zt
becomes certified emergent/non-emergent, or (ii) the confidence interval
radius falls below a prescribed local tolerance rt (smaller when
zt lies
closer to $\widehat{\mathcal{S}}_t$).
We then refit the monotone logistic model using all collected data and
repeat.
We stop once the boundary is locally resolved to the target geometric
tolerance Δ. Operationally, we
maintain a finite Δ-net 𝒩 of points in 𝒵 (or, more efficiently, a net on the current
tube around $\widehat{\mathcal{S}}_t$).
For each z ∈ 𝒩, we ask whether
its classification (emergent vs. non-emergent) is determined by the
confidence sequence, or else whether it lies at least Δ away (in the estimated normal
direction) from the current boundary so that its label is irrelevant to
Δ-accurate surface placement.
Concretely, writing the estimated signed normal distance as
$$
\widehat{\mathrm{dist}}_{\perp,t}(z)
:=
\frac{\hat w_t^{\top}z+\hat c_t}{\|\hat w_t\|_2},
$$
we continue sampling until all net points with $|\widehat{\mathrm{dist}}_{\perp,t}(z)|\le
\Delta$ are certified (i.e. their CI lies entirely above or below θ). This enforces that the remaining
ambiguity set is contained in a normal-direction tube of width 2Δ around the final estimate.
From the final normal vector ŵ = (α̂, β̂, γ̂)
we report iso-capability rates by implicit differentiation of α̂L* + β̂R + γ̂log b = const.
In particular, holding R fixed
we estimate
$$
\frac{\partial L^*}{\partial \log b}\Big|_{\hat p=\theta}
=
-\frac{\hat\gamma}{\hat\alpha},
\qquad
\frac{\partial R}{\partial \log b}\Big|_{\hat p=\theta}
=
-\frac{\hat\gamma}{\hat\beta},
$$
and similarly ∂L*/∂R = −β̂/α̂.
If a deployment cost model C(L*, R, b)
is supplied, we additionally solve minz ∈ 𝒵C(z)
subject to p̂(z) ≥ θ (with a
safety margin obtained by replacing θ by θ + slack), yielding a
compute-efficient operating point together with a certificate derived
from the maintained confidence intervals.
We quantify the evaluation cost required to estimate the emergence
surface to geometric tolerance Δ under the generalized linear
margin model. Throughout we work on a bounded feasible set 𝒵 ⊂ ℝ3 in coordinates z = (L*, R, log b)
with diameter diam(𝒵) ≤ D, and
we write the emergence threshold as θ := p0(𝒯) + ε ∈ (0, 1).
Under the logistic specification, p(z) = σ(w⊤z + c)
with w ≥ 0 and ∥w∥2 ≤ W, the
true surface is
𝒮 := {z ∈ 𝒵 : p(z) = θ} = {z ∈ 𝒵 : w⊤z + c = logit (θ)}.
We measure geometric error in the normal direction by the signed
distance to the hyperplane: for a candidate estimate (ŵ, ĉ) we consider the
induced surface $\widehat{\mathcal{S}}=\{z:\hat w^\top z+\hat
c=\operatorname{logit}(\theta)\}$ and seek $\sup_{z\in\mathcal{Z}}|\mathrm{dist}_\perp(z,\widehat{\mathcal{S}})-\mathrm{dist}_\perp(z,\mathcal{S})|\le
\Delta$ with probability at least 1 − δ.
Near the boundary, perturbations of signed distance translate to
perturbations of success probability at a rate controlled by the slope
of the link function in the normal direction. Let n := w/∥w∥2
denote the unit normal. Along the normal line z(t) = z0 + tn
passing through some z0 ∈ 𝒮 we have
$$
\frac{d}{dt}p(z(t))
=
\sigma'(w^\top z(t)+c)\,\|w\|_2.
$$
On the surface, w⊤z0 + c = logit (θ)
and hence σ′(logit (θ)) = θ(1 − θ).
Consequently, for small normal displacement t we obtain the local
linearization
p(z0 + tn) = θ + θ(1 − θ)∥w∥2 t + O(t2).
This identity makes explicit the dependence on (i) the threshold
location θ (tasks with θ very close to 0 or 1 have
weaker slope), and (ii) the aggregate slope ∥w∥2 (surfaces with small
normal coefficient are harder to localize geometrically). In particular,
ensuring probability error at most τ at points within an O(Δ) tube around 𝒮 suffices to ensure geometric error O(τ/(θ(1 − θ)∥w∥2)).
The adaptive procedure in Section~ concentrates evaluations in a shrinking tube around the current estimate; formally, one may view it as repeatedly reducing uncertainty about the signed offset of the hyperplane in the direction of its normal, while using monotonicity to avoid exploring regions whose label is already certified. The following statement summarizes the standard rate.
We select a finite Δ-net of candidate locations in a tube around the current surface estimate; the net size contributes only logarithmic factors since 𝒵 is three-dimensional and bounded. At a fixed location z, certifying whether p(z) ≷ θ up to probability error τ requires O(τ−2log (1/δ)) samples by standard concentration (Hoeffding or empirical Bernstein). By the linearization above, a normal displacement of Δ induces a probability change of order θ(1 − θ)∥w∥2Δ near the boundary; thus taking τ ≍ θ(1 − θ)∥w∥2Δ suffices to rule out surfaces shifted by more than Δ. The adaptive strategy ensures that only points whose (estimated) signed distance is O(Δ) are sampled to this precision; points farther away are certified with fewer samples. Summing the resulting contributions yields N = Õ(Δ−2log (1/δ)) with constants depending on W and θ(1 − θ).
The Bernoulli model already captures worst-case label noise since Var(V) ≤ 1/4. If, additionally, the verifier provides a real-valued margin score s𝒯(x, y) with sub-Gaussian tails and 𝔼[s ∣ z] strictly increasing in the latent logit w⊤z + c, then one may replace Bernoulli concentration by sub-Gaussian concentration, improving constants (and, in favorable regimes, the number of model calls if one can reuse a single model output to obtain multiple noisy margin observations). The exponent Δ−2 is not removed in general because geometric localization still reduces to estimating a mean shift of order Δ.
Suppose we have tasks 𝒯1, …, 𝒯K each
obeying a logistic surface model, and we wish to estimate all surfaces
to error Δ with overall
failure probability at most δ.
If the tasks are unrelated, a union bound yields an immediate
allocation: run the single-task procedure with confidence δ/K per task, giving total
labeled samples $\sum_{k=1}^K
\tilde{O}(\Delta^{-2}\log(K/\delta))$ and allowing adaptive
reallocation to tasks whose boundaries lie in harder-to-resolve regions
(e.g. smaller ∥wk∥2).
A more informative setting arises when tasks share a common normal
direction, as in a hierarchical model
pk(z) = σ(w⊤z + ck), k ∈ [K],
with a single unknown w and
task-specific offsets ck. Then we may
pool data across tasks to estimate w at rate governed by the total
sample size, and subsequently estimate each ck with
additional task-local samples. In this shared-normal model, the dominant
cost of learning orientation is paid once, and the per-task incremental
cost is essentially one-dimensional (estimating the intercept along the
learned normal), yielding a strict reduction compared to learning wk separately
whenever K is moderate and the
surfaces are not nearly parallel to coordinate axes.
We complement the upper bounds by showing that the Δ−2 dependence is not an artifact of the particular adaptive strategy, but is information-theoretically unavoidable under the Bernoulli-verifier observation model. Moreover, we formalize the intuitive necessity of sampling in a neighborhood of the emergence surface: evaluations far from the transition region cannot, in the worst case, certify the boundary location at geometric resolution Δ.
Fix a bounded feasible set 𝒵 ⊂ ℝ3 in coordinates z = (L*, R, log b),
and a success threshold θ ∈ (0, 1). Consider the logistic
family
pc(z) = σ(w⊤z + c), ∥w∥2 ≤ W,
with unknown offset c
(equivalently, an unknown hyperplane intercept). The emergence surface
at level θ is 𝒮c = {z : w⊤z + c = logit (θ)}.
We reduce boundary localization to distinguishing two close
hypotheses. Fix any admissible w and choose offsets c0, c1
such that the induced surfaces 𝒮c0 and 𝒮c1 are
separated by signed normal distance Δ on 𝒵. Since the signed distance to the
hyperplane w⊤z + c = logit (θ)
scales as |c − c̃|/∥w∥2,
it suffices to take
c1 − c0 = ∥w∥2 Δ.
Now consider any queried point z. Under c0 and c1, the Bernoulli means
differ by
$$
|p_{c_1}(z)-p_{c_0}(z)|
\;=\;
\big|\sigma(w^\top z+c_1)-\sigma(w^\top z+c_0)\big|
\;\le\;
\sup_u \sigma'(u)\,|c_1-c_0|
\;=\;
\frac{1}{4}\|w\|_2\,\Delta,
$$
using supuσ′(u) = 1/4.
Consequently, even at the most informative margin location, the gap in
Bernoulli means is at most a constant multiple of Δ. Distinguishing two Bernoulli
distributions whose means differ by Θ(Δ) with error probability
at most δ requires Ω(Δ−2log (1/δ))
samples by standard hypothesis-testing bounds (equivalently, KL
divergence or Pinsker inequalities). Since any surface estimator of
geometric error ≤ Δ must
distinguish c0 from
c1 with probability
at least 1 − δ, the stated
lower bound follows.
The preceding argument uses only the global bound σ′(u) ≤ 1/4; a
sharper statement explains why ``testing far away’’ can be dramatically
less informative. If all queried points satisfy a margin condition |w⊤z + c0 − logit (θ)| ≥ m,
then for the same two offsets as above,
|pc1(z) − pc0(z)| ≤ sup|u − logit (θ)| ≥ mσ′(u) ∥w∥2Δ ≤ θm(1 − θm) ∥w∥2Δ,
where θm := σ(logit (θ) + m)
and θm(1 − θm)
decays exponentially in m.
Thus, if an algorithm avoids an m-tube around the surface, the
per-sample information about the offset shrinks with m, and the number of samples
required to achieve the same Δ
grows proportionally to (sup|u − logit (θ)| ≥ mσ′(u))−2.
In minimax terms, any guarantee that is uniform over instances forces
the algorithm to eventually collect data from points whose predicted
success probability is close to θ (equivalently, whose logit margin
is O(1)), matching the
operational prescription of near-boundary evaluation.
Theorem~ rules out claims of cheap certification of emergence from sparse, off-boundary measurements: extrapolation from confidently-failing regions (or confidently-succeeding regions) cannot, in general, pin down where the transition occurs to geometric accuracy Δ. This is not merely a limitation of our modeling; it is an intrinsic limitation of black-box Bernoulli feedback.
A related necessity phenomenon appears in compute-optimal deployment selection. If we discretize the feasible box into candidate configurations (``arms’’) z1, …, zM with unknown means p(zi) and wish to identify the minimum-cost arm with p(zi) ≥ θ, we recover a thresholded best-arm identification problem. Standard lower bounds imply that arms with gap gapi := |θ − p(zi)| require Ω(gapi−2log (1/δ)) samples to classify reliably, and the arms that are near-optimal in cost tend also to be those with small gaps (hence dominating the sample complexity). Therefore, even when the end goal is not an explicit surface estimate but a compute-minimizing emergent configuration, the evaluation budget must concentrate on candidates near the boundary, and a Δ−2-type scaling reappears once we translate a desired tolerance in configuration-space to a gap in success probability.
Having estimated an emergence surface for a fixed task 𝒯, we often wish not merely to certify that
some configuration is emergent, but to emergence at minimum total
compute. Formally, for a target success threshold θ := p0(𝒯) + ε
and a cost model C(L*, R, b)
combining (possibly exogenous) pretraining, post-training, and
inference-time costs, COCA asks us to solve
min(L*, R, b) ∈ 𝒟 C(L*, R, b) s.t. p𝒯(L*, R, b) ≥ θ,
where 𝒟 is a feasible box encoding
available checkpoints, post-training limits, and inference budgets.
Since p𝒯 is
coordinate-wise nondecreasing in (L*, R, log b),
the constraint is monotone: increasing R or b (or improving L*) preserves
feasibility, while decreasing them risks violating the target.
Under the GLM margin model, COCA reduces to a tractable constrained
optimization. Writing z = (L*, R, u)
with u := log b, the
condition p𝒯(L*, R, b) ≥ θ
is equivalent to a single linear inequality in z:
α𝒯(L* − η𝒯) + β𝒯R + γ𝒯u + ξ𝒯 ≥ logit (θ),
i.e.,
α𝒯L* + β𝒯R + γ𝒯log b ≥ κ𝒯(ε) for
an appropriate constant κ𝒯(ε).
Thus the feasible emergent set is a half-space in (L*, R, log b)
intersected with 𝒟. In particular, if
C is convex in (L*, R, log b)
(e.g. separable with convex components), COCA is a convex program with a
unique optimum under mild strict convexity assumptions.
A useful special case, consistent with our later empirical protocol,
is a separable cost model
C(L*, R, b) = cL(L*) + cR(R) + cb b,
where cL
encodes the amortized cost of selecting a checkpoint (often decreasing
in L* if smaller
L* corresponds to
more pretraining compute), cR encodes the
post-training budget (nondecreasing in R), and cbb
encodes linear inference-time cost. Passing to u = log b yields objective
cL(L*) + cR(R) + cbeu
and a single linear constraint in (L*, R, u).
At any optimum that is not pinned to a boundary of 𝒟, the emergence constraint is tight
(otherwise we can increase L* or decrease R or b to reduce cost while remaining
feasible). The Karush–Kuhn–Tucker conditions then give a clear
marginal-rate interpretation: letting λ ≥ 0 be the multiplier on the
constraint, stationarity reads
cL′(L*) + λα𝒯 = 0, cR′(R) + λβ𝒯 = 0, cbeu + λγ𝒯 = 0,
with complementary slackness enforcing equality on the constraint.
Interpreting signs according to the monotonicity of each cost component,
these equations equate the contributed by each axis: we should spend
compute on whichever of pretraining, post-training, or inference yields
the largest increase in α𝒯L* + β𝒯R + γ𝒯u
per marginal unit of cost, until these normalized marginal returns
balance (or until a box constraint binds).
When cL
and cR are
approximately linear over the feasible range (a reasonable approximation
when available checkpoints and RL strengths are coarse), the solution
often becomes ``nearly corner-like’’: we either (i) push b up while keeping (L*, R) near
their cheapest feasible values, (ii) increase R while keeping (L*, b) small,
or (iii) move to a better checkpoint while keeping (R, b) minimal, depending
on which axis most cheaply supplies the required margin. Concretely, if
we solve the constraint for b,
$$
b \;=\;
\exp\!\left(\frac{\kappa_{\mathcal{T}}(\varepsilon)-\alpha_{\mathcal{T}}L^*-\beta_{\mathcal{T}}R}{\gamma_{\mathcal{T}}}\right),
$$
then COCA reduces to minimizing
$$
(L^*,R)\;\mapsto\;
c_L(L^*)+c_R(R)+c_b\exp\!\left(\frac{\kappa_{\mathcal{T}}(\varepsilon)-\alpha_{\mathcal{T}}L^*-\beta_{\mathcal{T}}R}{\gamma_{\mathcal{T}}}\right),
$$
which is convex when cL, cR
are convex. The exponential term makes explicit the iso-capability
tradeoff: improving L* or increasing R linearly in the margin reduces the
required b , with rates
governed by α𝒯/γ𝒯
and β𝒯/γ𝒯.
Finally, COCA must be robust to misspecification and estimation error, since in practice we only have an estimated success function p̂𝒯 (or an estimated hyperplane in (L*, R, log b)) from finite verifier samples. A conservative approach is to optimize against a lower confidence bound: if we maintain a uniform high-probability guarantee |p̂(z) − p(z)| ≤ ρ on the feasible domain, then enforcing p̂(z) ≥ θ + ρ implies p(z) ≥ θ with probability at least 1 − δ (after union bounding over candidate solutions). Equivalently, if we estimate the surface in normal-direction error Δ, then near the boundary the induced probability error scales as O(Δ) through the derivative of the logistic link, and we can inflate the required margin by an amount calibrated to Δ to obtain feasibility. Under mild regularity (e.g. Lipschitz cost in (L*, R, log b)), this conservatism yields an explicit suboptimality bound: the additional cost incurred by robustification is at most proportional to the product of a cost Lipschitz constant and the geometric surface estimation error. In this sense, the same near-boundary evaluation that is necessary for surface localization also enables principled, compute-optimal deployment decisions with certified success guarantees.
We outline an experimental design whose goal is to make the coordinates (L*, R, b) operational, to render the monotone surface hypothesis falsifiable, and to support cross-laboratory comparison. We assume access to a family of publicly available checkpoints, a reproducible post-training pipeline, and a verifier-equipped task suite.
First, we fix a (public, contamination-audited, and disjoint from all evaluation tasks) and define L* for each checkpoint by a standardized loss-like scalar, e.g. normalized token-level cross-entropy on the corpus under a fixed tokenizer and evaluation protocol. We then index checkpoints by L* rather than by parameter count or training step, and we publish the mapping checkpoint ↦ L*, including variance from evaluation subsampling. This step makes the pretraining axis comparable across model families and mitigates confounding from architectural changes. When only discrete checkpoints are available, we treat L* as a discrete coordinate and restrict surface fitting accordingly (piecewise-linear in L* or logistic with checkpoint indicators), but we still report slopes in the continuous surrogate for interpretability.
Second, we instantiate the post-training axis R as a of a single pipeline with fixed data, reward model/verifier, and optimizer, varying only a scalar strength parameter. Concretely, we recommend parameterizing R by a monotone quantity that is stable across runs, such as total KL-constrained improvement (integrated KL penalty weight times steps), total number of policy-gradient steps times effective step size, or a validated reward improvement measured on a held-out preference set. For each checkpoint (or for a subset spanning the feasible L* range), we run multiple independent seeds for each R value and store the full training traces, enabling post hoc checks that R↦ capability is not spuriously nonmonotone due to optimization instability. We recommend at least one ablation in which R is varied while holding the reward model fixed versus jointly updating it, since the latter changes the semantics of the improvement axis.
Third, we standardize the inference-time compute knob b by adopting a whose compute can be externally audited. Examples include: (i) best-of-n sampling (with b = n) under fixed temperature and prompt; (ii) verifier-guided selection among n candidates, counting verifier calls into b explicitly; (iii) bounded tree search with a fixed expansion policy and b equal to the number of node expansions; (iv) iterative refinement with b equal to the number of refinement rounds. We report both the semantic definition of b and a hardware-normalized proxy (e.g. total forward-pass tokens plus verifier tokens) to permit translation across implementations. Because our model uses log b, we include b values on a geometric grid (e.g. 1, 2, 4, 8, …) to stabilize slope estimation.
Fourth, we assemble a defining several 𝒯 with clear p0(𝒯). We recommend mixing: (a) algorithmic tasks with exact verifiers (modular arithmetic, parity, bracket matching); (b) structured reasoning tasks with deterministic checkers (GSM-like math with canonicalization and numeric equivalence); (c) knowledge tasks with constrained formats and adjudication (multiple-choice slices such as MMLU/C-Eval subsets); and (d) transcription/transliteration tasks with string-distance based verifiers. For each 𝒯 we publish the instance generator or dataset split, the verifier code, and any normalization rules, and we test for verifier brittleness by adversarial formatting perturbations. We additionally recommend reporting a margin score s𝒯(x, y) when available (e.g. log-odds from a learned verifier) to reduce variance in near-boundary estimation, while still defining success by V𝒯 ∈ {0, 1}.
Fifth, we implement boundary estimation by combining (i) a coarse initial design over (L*, R, log b) and (ii) adaptive refinement near the predicted boundary. In practice, we choose a modest initial grid that spans the feasible box and then allocate additional evaluations according to an uncertainty-weighted boundary acquisition rule, maintaining sequential confidence intervals for p̂𝒯(L*, R, b). To reduce variance and improve comparability across z = (L*, R, log b), we either reuse a fixed instance set across queried points (paired evaluation) or use stratified sampling with shared random seeds; we then explicitly report which regime we used, since pairing changes the effective noise model. Stopping criteria are expressed in terms of (estimated) normal-direction boundary error Δ or, operationally, the maximum probability ambiguity in a narrow band around θ.
Sixth, we specify reporting conventions. For each task we report: the fitted normal vector w = (α𝒯, β𝒯, γ𝒯) (up to scale) with confidence intervals; the estimated surface $\widehat{\mathcal{S}}_{\mathcal{T}}$ as level sets in 2D slices; iso-capability tradeoff rates such as ∂L*/∂log b at fixed R; and the realized success rates at the recommended compute-optimal point (if COCA is performed), including a held-out confirmation evaluation not used in fitting. We also publish all queried (L*, R, b) points, sample counts, random seeds, and the exact prompt templates.
Finally, we recommend ablations that test necessity of each modeling choice: replacing adaptive sampling with uniform grids at matched cost; removing monotonicity constraints in fitting; fitting without the log b transform; using only V𝒯 versus using (V𝒯, s𝒯); and transferring a surface fit from one inference procedure (e.g. best-of-n) to another at matched token budget. These ablations help isolate whether observed emergence surfaces reflect genuine tradeoffs among (L*, R, b) rather than artifacts of evaluation design.
Our central modeling move is to treat ``emergence’’ as a boundary of a monotone success function on the coordinate triple z = (L*, R, log b) and to approximate this boundary by a hyperplane in logit space. The utility of this move is that it converts a vague qualitative transition into a geometric object with tradeoff rates. Its limitations are exactly the conditions under which either (i) the map z ↦ p𝒯(z) ceases to be coordinate-wise nondecreasing, or (ii) the logit is not well-approximated by an affine functional of z on the feasible box. We therefore interpret fitted surfaces as local summaries, and we expect failure modes in regimes where the task distribution, the verifier, or the inference procedure induces discontinuities, multi-modality, or strategic behavior.
A first point of fragility is . All quantities are indexed by 𝒯, but in practice one often uses a dataset proxy $\widehat{\mathcal{T}}$ or a convenience slice whose support differs from the intended deployment distribution. Even when $p_{\widehat{\mathcal{T}}}(L^*,R,b)$ is well-behaved, p𝒯(L*, R, b) need not share the same normal direction, and the apparent tradeoff ∂L*/∂log b can change sign when 𝒯 places more mass on instances that are solved by different internal mechanisms (e.g. memorization versus algorithmic reasoning). The surface picture remains meaningful only to the extent that 𝒯 is stable under the operational perturbations one cares about (prompt templates, formatting, tool availability, adversarial inputs). Empirically, this suggests that surface estimation should be accompanied by stress tests that explicitly shift 𝒯 and report the induced surface displacement as an additional uncertainty, rather than reporting only statistical error at fixed 𝒯.
A second limitation concerns . Our definition of success depends on V𝒯(x, y), and any change in verifier semantics changes the task itself. This is benign for exact checkers (e.g. arithmetic) but problematic for learned verifiers or rubric-based graders that can drift over time, be sensitive to formatting, or be gamed by the model. In such cases p𝒯 is not merely a property of the policy; it is a property of the policy–verifier interaction, and increasing R may increase ``success’’ by exploiting verifier loopholes rather than improving the underlying capability. Formally, the monotonicity of p𝒯 in R can fail even if the underlying capability is monotone, because the success event is ill-posed. A practical mitigation is to treat the verifier as part of the experimental object: version it, audit it for brittleness, and, when possible, report both V𝒯 and a more conservative cross-verification (e.g. two independent checkers, or a human spot-check) on a held-out subset to bound the extent of reward hacking.
Third, can arise from the knobs themselves. Increasing b can reduce success when the inference procedure introduces additional opportunities for error (e.g. longer chains that amplify compounding mistakes, or search heuristics that overfit to spurious verifier signals). Increasing R can degrade performance due to overoptimization, loss of generality, or mode collapse, especially when the post-training objective is misaligned with 𝒯. Even L*, as an empirical scalar derived from a calibration corpus, can be non-monotone with respect to a downstream task if the calibration corpus correlates poorly with the features relevant for 𝒯. Our algorithmic framework can detect such violations via systematic isotonicity residuals and held-out checks, but it cannot ``repair’’ them without changing the model class (e.g. switching from a single hyperplane to a piecewise-monotone or nonparametric surface). When substantial non-monotonicity is present, the appropriate output is not a surface estimate but an explicit certificate that the monotone-manifold hypothesis is falsified on the tested domain.
We also emphasize the relation to . Many datasets are mixtures over
latent difficulty levels; as a consequence, p𝒯(z) can be a
mixture of sigmoids with different thresholds and slopes, producing
curvature or apparent
kinks'' in the fitted boundary. One can partially linearize this by conditioning on an estimated difficulty score $d(x)$ and studying surfaces for $\mathcal{T}\mid d(x)\in I$ (difficulty strata), or by fitting a hierarchical model in which $\eta_{\mathcal{T}}$ is instance-dependent. This connects our geometric view to item-response theory: the emergence surface for a stratum behaves like a separatrix for items of comparable discrimination, while the aggregate surface conflates strata and may exaggerate sharp transitions. Thus, when interpreting anemergent’’
region, we recommend reporting how much of the effect persists under
difficulty-controlled evaluation.
Governance implications are immediate. A three-axis surface replaces a single ``model capability’’ number with a map of how capabilities can be purchased with training, post-training, and inference-time compute. For safety-relevant tasks, this implies that restricting one axis (e.g. limiting deployment-time b) may not control risk if the same surface can be crossed by increasing R or selecting a more advanced checkpoint (smaller loss, larger L*). Conversely, transparency about tradeoffs can support proportional regulation: one can define risk thresholds in terms of p𝒯(L*, R, b) and require reporting of the minimal-cost point achieving them under an auditable cost model C. However, publishing surfaces for harmful tasks may itself lower the barrier to attainment by revealing the cheapest route to high success. We therefore distinguish between (i) publishing the methodology and benign-task validation, and (ii) controlled disclosure of harmful-capability surfaces, potentially via secure evaluation and limited-access reporting.
Finally, we list directions where the manifold picture must be generalized. Agentic settings introduce state, memory, and tool use; success then depends on trajectories, and b may control search depth, tool calls, or time, yielding nontrivial feedback between inference compute and environment responses. Harmful capability surfaces (e.g. biosecurity, cyber exploitation) require adversarial task distributions, robust verifiers, and careful accounting for human-in-the-loop and tool availability as additional coordinates beyond (L*, R, b). Methodologically, we anticipate extending from a single hyperplane to (a) locally linear surfaces with curvature penalties, (b) monotone Gaussian-process or spline models over z, and (c) multi-task coupling that shares information across related 𝒯 while allowing task-specific thresholds η𝒯. The core question remains geometric: which directions in the space of training and deployment choices move us most efficiently across a verifiable success boundary, and under what conditions does that boundary remain stable enough to be governed.