← Back

Capability Manifolds for Emergence: Surfaces in Calibrated Pretraining Progress, Post-Training RL, and Test-Time Compute

Metadata

Table of Contents

  1. 1. Introduction and Motivation: why 1D scaling is insufficient in 2026 (LRMs, RL post-training, inference-time compute); connection to loss-threshold emergence and metric debates; contributions and overview.
  2. 2. Related Work: emergent abilities (Wei et al., BIG-Bench), mirage/metric arguments (Schaeffer et al.), loss thresholds (Du et al.), inference scaling laws, RLHF/RLAIF and LRMs, PASSUNTIL/high-resolution evaluation, task difficulty slicing.
  3. 3. Formal Setup: tasks with verifiers; definitions of baseline success; axis definitions (L*, R, b); what counts as an emergent capability region; desiderata for a capability manifold.
  4. 4. Calibrated Pretraining Progress L*: definition options (public calibration corpus loss, unigram-normalized perplexity); invariances and failure modes; practical guidance; assumptions needed for theory.
  5. 5. The Capability Manifold Model: generalized linear margin/logit model; monotonicity and smoothness assumptions; emergence surfaces and iso-capability lines; interpretation of parameters as tradeoff rates.
  6. 6. Problem 1 — Emergence Surface Estimation (ESE): formal statement; evaluation oracle; distance-to-surface objective; confidence calibration.
  7. 7. Algorithm: Adaptive Surface-Finding via Sequential Testing: pseudocode; sampling policy; fitting method (monotone logistic / isotonic GLM); stopping rule; producing tradeoff summaries.
  8. 8. Theory: Upper Bounds: sample complexity for estimating the emergence surface; dependence on slopes, noise, and desired accuracy; extension to multiple tasks with shared evaluations.
  9. 9. Theory: Lower Bounds and Necessity of Near-Boundary Testing: information-theoretic reductions to Bernoulli mean testing / best-arm identification; matching Ω(Δ−2) lower bound; implications for predictability claims.
  10. 10. Problem 2 — Compute-Optimal Capability Attainment (COCA): optimization over (L*, R, b) under a cost model; closed-form solutions under the GLM; robustness to model misspecification.
  11. 11. Empirical Protocol (Optional but Strengthening): recommended experimental design with open checkpoints; controlled RL sweeps; standardized inference-time compute knobs; verifiable task suite; reporting conventions and ablations.
  12. 12. Discussion and Limitations: when the manifold breaks (distribution shift, changing verifiers, non-monotonicities); relation to task difficulty slicing; governance implications; future work (agents, harmful capability surfaces).

Content

1. Introduction and Motivation: why 1D scaling is insufficient in 2026 (LRMs, RL post-training, inference-time compute); connection to loss-threshold emergence and metric debates; contributions and overview.

The prevailing methodology for understanding capability growth in large language models has been to index models along a single scale'' axis and to study how task performance changes as a function of that axis. In early work, the axis was frequently parameter count or training compute; in more recent work, it is more naturally a calibrated pretraining progress scalar, such as a normalized loss on a public calibration corpus. This one-dimensional view is increasingly misaligned with how frontier systems are produced and deployed in 2026. Modern systems are routinely altered by at least two additional and operationally significant controls: post-training strength (e.g., RLHF/RLAIF or other KL-constrained policy improvement) and inference-time compute (e.g., best-of-$n$, verifier-guided search, or repeated sampling with selection). When three such knobs can be adjusted, the notion of a task having a singlethreshold model size’’ is at best an incomplete projection of a higher-dimensional reality.

We therefore take as primitive a task distribution 𝒯 equipped with a semantic verifier, and we focus on the success probability p𝒯(L*, R, b) of a model queried at coordinates (L*, R, b). The central conceptual move is to treat emergence not as a one-dimensional transition, but as a boundary in a space of controllable resources. Given a baseline success probability p0(𝒯) and a margin ε > 0 defining what it means to be ``meaningfully above baseline,’’ the emergent region is the set
{(L*, R, b) : p𝒯(L*, R, b) ≥ p0(𝒯) + ε}.
Its boundary is a surface (in transformed coordinates (L*, R, log b)), and the empirical question becomes: how does this surface tilt, how stable is it across tasks, and what tradeoffs between pretraining, post-training, and inference-time compute does it induce?

This reframing is motivated by a set of recurring empirical phenomena. First, the rise of long-run and deliberative inference procedures makes b a first-class quantity: a fixed base model can exhibit large changes in verified success when allowed to sample and select, to use self-consistency, or to invoke external verifiers/search. Second, the post-training pipeline has become more heterogeneous: RL for instruction following, preference optimization, safety fine-tuning, and RL for reasoning can change behavior in ways not well captured by L* alone, and can do so with a strength that is quantifiable in terms of validated reward improvement or constrained update magnitude. Third, ``emergence’’ itself has been contested: discontinuities in performance curves can be produced by metric choice, coarse evaluation, or the interaction of difficulty distributions with a smoothly improving underlying model. A geometric perspective clarifies what is being claimed. If performance is smooth in each controllable coordinate, then apparent thresholds along a single axis can arise as intersections of a smooth surface with a one-dimensional path through resource space; conversely, if a task truly exhibits a sharp transition, that claim should be robust to reasonable reparameterizations and to the presence of auxiliary axes.

Our technical stance is that, for many verifiable tasks, a generalized linear margin model in (L*, R, log b) provides a useful idealization: the logit of success is approximately affine in these coordinates, with positive coefficients reflecting monotonicity. Under this assumption, the emergent region is (approximately) a half-space and the emergence boundary is (approximately) a hyperplane. This is not merely a convenient visualization; it supplies testable predictions. In particular, iso-capability curves satisfy linear tradeoffs
α𝒯dL* + β𝒯dR + γ𝒯dlog b = 0,
so that increases in pretraining progress can be exchanged for multiplicative reductions in inference-time compute, or for reductions in post-training strength, at a rate determined by the task-specific coefficients. Such relations are directly falsifiable by controlled experiments that vary one axis while compensating with another to maintain a fixed success level.

A second motivation is methodological: if emergence is a surface, then evaluating it by exhaustive gridding is prohibitively expensive. We therefore prioritize adaptive evaluation protocols that treat the model–verifier system as a black box and that allocate samples near the boundary where information is maximized. The goal is to output an estimate $\widehat{\mathcal{S}}_{\mathcal{T}}$ of the boundary with a geometric accuracy target Δ (distance in the normal direction) and confidence 1 − δ, while minimizing verifier-labeled queries. This places the problem within the scope of active learning and sequential testing, but with structure: monotonicity in each coordinate and a low-dimensional parametric form for the logit. The resulting guarantees are of independent interest for practitioners who must decide which combination of training, post-training, and inference-time compute reaches a target reliability at minimal cost.

Our contributions are threefold. First, we formalize emergence as a set membership problem in (L*, R, log b) and show that, under the GLM margin model, the emergent region is exactly a half-space with an explicit boundary equation. Second, we provide an adaptive algorithm that estimates this boundary with sample complexity (Δ−2log (1/δ)) using only black-box verifier outcomes (and optionally margin scores), and we extract interpretable tradeoff rates from the fitted normal vector. Third, we give a matching information-theoretic lower bound showing that no algorithm can uniformly beat the Ω(Δ−2log (1/δ)) dependence, thereby explaining why high-resolution evaluation near transition regions is unavoidable. Empirically, we outline how to instantiate R and b via reproducible post-training and inference controls, and how to validate the manifold picture by demonstrating predicted iso-capability tradeoffs on canonical verifiable emergence tasks.


Early empirical discussions of emergent abilities'' emphasized apparently discontinuous performance transitions as a function of a single scale proxy, often model size or training compute. The BIG-Bench line of work and its successors documented many tasks whose accuracy curves appear flat over a range of scales and then rise sharply \cite{wei2022emergent,bigbench}. Subsequent studies broadened the set of benchmarks and highlighted that emergence claims depend delicately on the choice of metric, the discretization of model sizes, and the evaluation noise. Our framework is compatible with the descriptive content of these results, but it changes the mathematical object of interest: we treat a task'sthreshold’’ as a boundary in a controllable resource space rather than as a single scalar breakpoint.

A prominent counterpoint is the mirage'' ormetric artifact’’ perspective, which argues that qualitative ``phase transitions’’ can be induced by nonlinear metrics (e.g., exact match), finite sample effects, and heterogeneous difficulty mixtures even when the underlying competence increases smoothly . Related analyses show that coarse gridding of model sizes and overplotting of noisy estimates can visually exaggerate discontinuities. The geometric viewpoint we adopt can be read as a structural refinement of this critique: if success probability is smooth and monotone in multiple coordinates, then a sharp transition observed along one path (e.g., scaling L* alone) can arise from the intersection of that path with a tilted surface. Conversely, if one asserts a genuine discontinuity, it should be robust under changes of path and under inclusion of additional operational degrees of freedom such as post-training and test-time compute.

Another related thread seeks more invariant progress measures than parameter count. Several works relate downstream capability to pretraining loss and propose that certain skills become available when loss crosses task-dependent levels . Our use of a calibrated progress scalar L* is in this spirit: we assume L* is a loss-like quantity intended to be comparable across training setups, and we treat it as one coordinate among others. In particular, we separate the question how far along pretraining are we?'' fromhow much post-training modification and inference-time computation are we willing to apply?’’ This disentangling is motivated by the fact that modern systems are not deployed at raw checkpoints, but rather after substantial policy shaping and with nontrivial inference procedures.

Inference-time scaling laws and deliberative decoding methods motivate an explicit inference compute axis. A large empirical literature studies best-of-n, self-consistency, rejection sampling, verifier-guided search, tool-use policies, and other procedures where performance improves with additional sampling or search . Some works model the gain as a function of the number of samples, selection temperature, or search depth, sometimes yielding approximate power laws in compute or log-compute. These methods effectively modify the mapping from a base conditional distribution to an induced decision rule, and the resulting success probability is naturally nondecreasing in an inference budget parameter. Our formalism incorporates such effects through the coordinate b (typically via log b), allowing comparisons between train more'' andsearch more’’ in a single representation.

Post-training procedures such as RLHF and RLAIF, as well as more recent RL-for-reasoning and long-reasoning-model (LRM) pipelines, motivate a second non-pretraining axis . While preference optimization is often discussed qualitatively, several papers operationalize post-training strength through the number of RL steps, a KL constraint, or validated reward improvement. These controls can induce substantial changes in helpfulness, harmlessness, and instruction-following, and in some regimes can trade off against pretraining progress in achieving fixed benchmark performance. Our coordinate R is intended to capture this degree of freedom abstractly, emphasizing that the evaluation target is not a single trained model but a family of policies reachable from checkpoints under controlled post-training.

Methodologically, our emphasis on verifiable tasks and high-resolution evaluation aligns with work arguing that emergence claims require dense sampling near the transition region. For code and math tasks, pass@k-style metrics, repeated sampling with verification, and related ``until-success’’ protocols expose how rapidly success probability can change as a function of sampling budget and how sensitive conclusions are to evaluation resolution . Difficulty slicing plays a similar role: by stratifying instances by estimated difficulty (e.g., length, required steps, or empirical success rates), one often observes smoother progress within slices even when the aggregate curve appears abrupt. Our setting assumes a task distribution equipped with a semantic verifier, precisely to make such repeated-sampling and stratified analyses principled and to separate semantic correctness from proxy metrics.

Finally, the adaptive evaluation problem we consider is related to active learning, sequential testing, and threshold estimation under monotonicity constraints. Classical work studies efficient identification of decision boundaries for generalized linear models and the design of queries that concentrate near regions of high information . Our contribution is to place these ideas in a capability-measurement context with explicit resource coordinates and to focus on estimating an emergence rather than a one-dimensional threshold. This differs from exhaustive gridding approaches common in scaling studies: by allocating samples preferentially near the boundary and enforcing coordinate-wise monotonicity, one can obtain statistically certified boundary estimates with substantially fewer verifier-labeled evaluations than uniform exploration, while still retaining interpretable tradeoff quantities between pretraining, post-training, and inference-time compute.


3. Formal Setup: tasks with verifiers; definitions of baseline success; axis definitions (L*, R, b); what counts as an emergent capability region; desiderata for a capability manifold.

We fix a model family equipped with three operational degrees of freedom: a pretraining checkpoint indexed by a scalar L*, a post-training operator of adjustable strength R ≥ 0, and an inference-time procedure parameterized by a compute budget b ∈ ℕ. The intended reading is that L* measures how far along pretraining we are'' in a way that is comparable across training runs, $R$ measureshow much policy shaping we apply after pretraining’’ (e.g., via RLHF/RLAIF or a KL-constrained policy improvement step), and b measures ``how much computation we expend at test time’’ (e.g., number of samples in best-of-n, number of verifier-guided expansions, or the number of calls to a tool or critic). We defer the question of how to construct a useful L* to ; for the present section, it suffices that L* indexes a finite set of available checkpoints and that it admits an ordering consistent with pretraining progress.

A is specified as a distribution 𝒯 over instances x together with a semantic verifier. Concretely, we assume access to a measurable mapping
V𝒯 : (x, y) ↦ {0, 1},
where V𝒯(x, y) = 1 indicates semantic success (e.g., exact mathematical correctness, passing unit tests, or satisfying an externally specified constraint). When available, we also allow a real-valued margin score s𝒯(x, y) ∈ ℝ whose sign agrees with the verifier decision, so that s𝒯(x, y) > 0 ⇒ V𝒯(x, y) = 1. The margin is not logically necessary for our definitions, but it is useful for variance reduction and for designing adaptive sampling rules in later sections.

Given coordinates (L*, R, b), we obtain an induced randomized decision rule as follows. The checkpoint at L* is transformed by the post-training operator at strength R into a policy (or conditional distribution) that we denote abstractly by πL*, R(⋅ ∣ x). The inference procedure with budget b induces a distribution over final outputs, which we write as y ∼ Infer(πL*, R, x; b). This abstraction is intended to cover both ``single-shot’’ decoding (where b might correspond to a fixed decoding strategy) and deliberative procedures (where b controls, e.g., the number of candidates proposed and then filtered, or the depth of a search guided by a learned or external verifier). We then define the task success probability
p𝒯(L*, R, b) := Prx ∼ 𝒯, y ∼ Infer(πL*, R, x; b) [ V𝒯(x, y) = 1 ].
The central structural assumption used throughout is coordinate-wise monotonicity: after transforming b to log b, the map (L*, R, log b) ↦ p𝒯(L*, R, b) is nondecreasing in each coordinate. This assumption is empirical rather than tautological, but it is operationally natural: additional pretraining progress should not systematically harm semantic accuracy on an in-distribution verifiable task; additional post-training strength (as measured by a validated scalar R) should not reduce task success in expectation; and additional inference budget should not decrease success when the procedure is allowed to retain the best candidate according to the verifier or an aligned proxy.

To define what it means for a capability to , we must specify a baseline. For each task 𝒯 we assume a baseline success probability p0(𝒯), typically the chance rate of a randomized strategy with the same output constraints (e.g., 1/k for k-way multiple choice), or an empirical baseline obtained by a simple non-language heuristic. We fix a margin parameter ε > 0 and declare that (L*, R, b) is if
p𝒯(L*, R, b) ≥ p0(𝒯) + ε.
This definition encodes a minimal notion of above-baseline competence while remaining agnostic to any particular benchmark metric: it depends only on a semantic verifier and a probabilistic model of repeated trials. The corresponding is the subset
𝒯(ε) := {(L*, R, b) : p𝒯(L*, R, b) ≥ p0(𝒯) + ε},
and the associated is its boundary in the transformed coordinates z = (L*, R, log b), i.e.,
𝒮𝒯(ε) := ∂{z : p𝒯(z) ≥ p0(𝒯) + ε}.
The geometric viewpoint is that an ``emergent ability’’ is not a single breakpoint along one scaling axis, but rather a codimension-one boundary in a space of controllable resources; one-dimensional emergence curves are recovered by intersecting 𝒮𝒯(ε) with a path such as (L*, 0, 1) or (L*, , ).

Finally, we record desiderata for a useful capability manifold representation, since these constrain both modeling and measurement. First, the coordinates should be in the sense that we can implement controlled interventions in R and b and can select among discrete checkpoints in L* without changing unrelated aspects of the system. Second, the representation should be across training setups, motivating a calibrated L* and a task-independent interpretation of b as actual inference compute rather than an idiosyncratic knob. Third, measurement should be : the definition of p𝒯 permits repeated sampling and uses V𝒯 to separate semantic correctness from proxy metrics, allowing confidence intervals and sequential tests. Fourth, the manifold should support : for fixed target success  ∈ (p0(𝒯), 1) we seek interpretable rates such as L*/∂log b and R/∂log b along iso-success sets. Fifth, the representation should admit : given a cost model C(L*, R, b) combining training, post-training, and inference expenditures, we wish to identify compute-efficient points in 𝒯(ε).

With these objects fixed, the remaining task is statistical: we must estimate 𝒮𝒯(ε) from black-box samples of V𝒯(x, y) at adaptively chosen (L*, R, b). The next section addresses the first coordinate, L*, since the interpretability of the entire manifold depends on a progress scalar whose meaning is stable under incidental changes to the training setup.


4. Calibrated Pretraining Progress L*: definition options (public calibration corpus loss, unigram-normalized perplexity); invariances and failure modes; practical guidance; assumptions needed for theory.

The coordinate L* is intended to capture ``how much pretraining we have’’ in a way that is (i) ordered, so that larger L* corresponds to later checkpoints, and (ii) comparable across training setups, so that a unit change in L* has a stable interpretation even when tokenization, data mixtures, or optimization details vary. Since our algorithms only require access to a finite set of checkpoints, we could take L* to be the checkpoint index. However, doing so would destroy comparability and would make tradeoff rates such as L*/∂log b meaningless. We therefore define L* by evaluation on a fixed public calibration distribution.

Fix a publicly specified calibration corpus Dcal (a multiset of sequences) together with a deterministic evaluation protocol (tokenization, truncation length, masking conventions). For a pretrained checkpoint θ we define the empirical next-token cross-entropy
$$ \widehat{\ell}_{\mathrm{cal}}(\theta) \;:=\; \frac{1}{N}\sum_{(u_1,\dots,u_T)\in D_{\mathrm{cal}}} \sum_{t=1}^{T} -\log p_{\theta}(u_t \mid u_{<t}), \qquad N:=\sum_{(u_1,\dots,u_T)\in D_{\mathrm{cal}}} T. $$
Since progress corresponds to decreasing loss, we choose a monotone reparameterization that is increasing with capability, e.g.
L*(θ) := ref − ℓ̂cal(θ),
where ref is a fixed constant (for instance, ℓ̂cal of a small reference checkpoint) chosen only to keep L* nonnegative. Any strictly increasing affine transform of L* is equivalent for our geometric statements; it only rescales the coefficient α𝒯 and shifts η𝒯.

Direct cross-entropy depends on the tokenization and can change by an additive constant when the vocabulary changes, even when the induced distribution over strings is essentially the same. To reduce this artifact, we may normalize by a fixed unigram baseline computed on the same evaluation representation. Let quni be the empirical unigram distribution over the evaluation alphabet (tokens, bytes, or characters) estimated from Dcal, and define the unigram cross-entropy
$$ \widehat{\ell}_{\mathrm{uni}} \;:=\; \frac{1}{N}\sum_{(u_1,\dots,u_T)\in D_{\mathrm{cal}}} \sum_{t=1}^{T} -\log q_{\mathrm{uni}}(u_t). $$
We then define the
ℓ̂ex(θ) := ℓ̂cal(θ) − ℓ̂uni,
and set L*(θ) := −ℓ̂ex(θ). Equivalently, this is a log improvement over a unigram model; exponentiating yields a normalized perplexity exp (ℓ̂ex(θ)). In practice we recommend computing the above on bytes (bits-per-byte) when tokenizers differ across runs; this makes L* comparable across vocabularies at the cost of a slightly different evaluation pipeline.

We require L* to be invariant to irrelevant implementation choices that do not alter the induced distribution in a meaningful way. Formally, for two checkpoints θ and θ that differ only by a reparameterization or by a numerically irrelevant change (e.g. weight format, compilation), we require L*(θ) = L*(θ) up to negligible evaluation noise. More importantly, across training runs we require that L* depends primarily on the learned predictive distribution, not on the training schedule. This is the sense in which using a public evaluation loss is preferable to using number of tokens trained on'' oroptimizer steps,’’ both of which can vary substantially in their marginal effect.

The calibration construction can fail in ways that matter for the subsequent manifold model.

We fix Dcal once, publish it (or its hash plus sampling procedure), and compute L* for every checkpoint with the same evaluation settings (context length, truncation, and masking). We also recommend reporting standard errors for ℓ̂cal computed by batching documents, since small differences in L* can otherwise be within measurement noise. Finally, since we only access a discrete set of checkpoints, we store the mapping checkpoint  ↦ L* and treat interpolation as a modeling convenience rather than an operational primitive.

Our later generalized linear model assumes that, holding (R, b) fixed, p𝒯(L*, R, b) is nondecreasing in L* and that logit (p𝒯) is approximately affine in L* over the range under consideration. Monotonicity is an empirical regularity we enforce at the modeling layer; approximate affinity is a local approximation justified by smoothness of success probabilities in effective pretraining progress. Concretely, if there exists a latent scalar progress variable λ for which the logit is affine and L* is an affine proxy for λ on the sampled checkpoints, then all half-space and tradeoff claims remain valid after reparameterization, up to rescaling of the coefficients.


5. The Capability Manifold Model: generalized linear margin/logit model; monotonicity and smoothness assumptions; emergence surfaces and iso-capability lines; interpretation of parameters as tradeoff rates.

We now posit a low-dimensional structural model for how task success varies as we move along the three coordinates (L*, R, b). Fix a task distribution 𝒯 with semantic verifier V𝒯(x, y) ∈ {0, 1} and baseline success probability p0(𝒯) (e.g. random guessing). For each triple (L*, R, b) we define the induced success probability
p𝒯(L*, R, b) := Prx ∼ 𝒯, y ∼ πL*, R, b(⋅ ∣ x) [V𝒯(x, y) = 1],
where πL*, R, b denotes the model after selecting a pretraining checkpoint with progress L*, applying post-training of strength R, and running the inference procedure with budget b. We assume coordinate-wise monotonicity: after transforming b to log b, the map (L*, R, log b) ↦ p𝒯(L*, R, b) is nondecreasing in each coordinate. This assumption is not logically necessary for the definitions, but it is the regularity that makes surface estimation statistically and operationally well-posed; in particular it rules out pathological ``capability regressions’’ under more compute in the regime of interest.

Our central approximation is a generalized linear margin model on the logit scale. Writing σ(t) := (1 + et)−1 and z := (L*, R, log b) ∈ ℝ3, we assume that for each 𝒯 there exist parameters (α𝒯, β𝒯, γ𝒯, η𝒯, ξ𝒯) with α𝒯, β𝒯, γ𝒯 > 0 such that

The use of log b is a parsimonious way to encode diminishing returns in common inference-time procedures (best-of-n, tree search expansions, verifier calls), while retaining an affine geometry in z. We treat as a local model: smoothness of p𝒯 in each coordinate implies that the logit is well-approximated by its first-order Taylor expansion on bounded domains, and the fitted coefficients should be interpreted as effective slopes over the evaluated range rather than universal constants.

Given an above-baseline margin ε > 0, we define the for task 𝒯 as
𝒯(ε) := {(L*, R, b) : p𝒯(L*, R, b) ≥ p0(𝒯) + ε}.
Under , the boundary ∂ℰ𝒯(ε) is a hyperplane in z-coordinates. Indeed, letting κ𝒯(ε) := logit (p0(𝒯) + ε) − ξ𝒯 + α𝒯η𝒯, we have
(L*, R, b) ∈ ∂ℰ𝒯(ε)  ⇔  α𝒯L* + β𝒯R + γ𝒯log b = κ𝒯(ε).
We refer to this boundary (in the original (L*, R, b) coordinates) as the . The normal vector w𝒯 := (α𝒯, β𝒯, γ𝒯) describes how sensitive the task is to improvements along each axis, while the offset κ𝒯(ε) determines how far along the manifold we must move to clear the specified above-baseline threshold.

The same geometry yields : for any target success probability  ∈ (p0(𝒯), 1), the level set {z: p𝒯(z) = } satisfies a linear equation on the logit scale,
α𝒯L* + β𝒯R + γ𝒯log b = logit () − ξ𝒯 + α𝒯η𝒯.
Differentiating this identity yields tradeoff rates among the three resources:
α𝒯dL* + β𝒯dR + γ𝒯dlog b = 0.
Hence, holding R fixed, an increase ΔL* permits a multiplicative reduction in inference budget by
$$ b \mapsto b\cdot \exp\!\Bigl(-\frac{\alpha_{\mathcal{T}}}{\gamma_{\mathcal{T}}}\Delta L^*\Bigr), $$
and holding b fixed, the marginal post-training required to compensate for a drop in pretraining progress is dR/dL* = −α𝒯/β𝒯. These ratios, rather than the raw coefficients, are the invariant objects: any affine reparameterization of L* rescales α𝒯 but leaves the implied tradeoffs unchanged after appropriate units conversion.

Finally, we clarify parameter interpretations that will matter for estimation. The threshold parameter η𝒯 is the (notional) L* location at which, with R = 0 and b = 1, the logit crosses ξ𝒯; it is therefore a convenient centering but not directly identifiable without fixing conventions for R and b. The intercept ξ𝒯 absorbs systematic advantages of the task or prompt format not captured by L*, R, b and should be regarded as task-specific. When a real-valued margin score s𝒯(x, y) is available, one may view as modeling Pr [s𝒯(x, y) ≥ 0] and interpret (α𝒯, β𝒯, γ𝒯) as slopes of a latent margin distribution; however, our subsequent procedures only require Bernoulli verifier outcomes. The next section formalizes the estimation problem implied by this model: given only black-box access to (x, y, V𝒯(x, y)) under chosen (L*, R, b), we must recover (up to tolerance) the emergence surface and the associated tradeoff geometry.


6. Problem 1 — Emergence Surface Estimation (ESE): formal statement; evaluation oracle; distance-to-surface objective; confidence calibration.

We now state the statistical problem that mediates between the geometric model of Section~ and an implementable evaluation procedure. Fix a task distribution 𝒯 with verifier V𝒯(x, y) ∈ {0, 1}, baseline p0(𝒯), and margin ε > 0. Write z := (L*, R, log b) and restrict attention to a known feasible box
𝒵 := [Lmin, Lmax] × [0, Rmax] × [log bmin, log bmax] ⊂ ℝ3,
which encodes the checkpoints, post-training strengths, and inference budgets available for evaluation. The (unknown) success probability is
p𝒯(z) = Prx ∼ 𝒯, y ∼ πz(⋅ ∣ x) [V𝒯(x, y) = 1],
and the corresponding emergence set and emergence surface are
𝒯(ε) := {z ∈ 𝒵 : p𝒯(z) ≥ p0(𝒯) + ε},   𝒮𝒯(ε) := ∂ℰ𝒯(ε) ∩ 𝒵.
When holds, 𝒮𝒯(ε) is (approximately) a hyperplane in z; however, the problem definition below is stated in terms of black-box access to p𝒯 and does not require us to assume the parametric form at the level of the oracle.

An evaluation consists of choosing a point z = (L*, R, log b) ∈ 𝒵 and drawing i.i.d. labeled samples
Ui(z) := V𝒯(xi, yi) ∈ {0, 1},   xi ∼ 𝒯,  yi ∼ πz(⋅ ∣ xi),
so that 𝔼[Ui(z)] = p𝒯(z). We model this as black-box access to an oracle Eval𝒯(z) that returns an independent Bernoulli draw with mean p𝒯(z); calling the oracle m times at z produces an empirical mean $\hat p_m(z):=m^{-1}\sum_{i=1}^m U_i(z)$. Optionally, if the verifier provides a real-valued margin s𝒯(x, y), the oracle may also return Si(z) := s𝒯(xi, yi), which can be used to reduce variance; nonetheless, our guarantees will be expressed in terms of the Bernoulli labels alone.

We allow an adaptive/sequential access pattern: at time t an algorithm selects a query point zt (and possibly a batch size mt) as a function of past observations {(zτ, U1 : mτ(zτ))}τ < t. The total evaluation budget is n := ∑tmt, optionally with a refined accounting that weights each label by an inference cost proportional to b.

The central output of ESE is an estimate $\widehat{\mathcal{S}}_{\mathcal{T}}(\varepsilon)$ of the surface 𝒮𝒯(ε). Since 𝒮𝒯(ε) is a codimension-one set, we measure error geometrically rather than by pointwise probability estimation. A general metric is the (restricted) Hausdorff distance on 𝒵:
$$ d_{\mathrm{H}}(\widehat{\mathcal{S}},\mathcal{S}) := \max\Bigl\{ \sup_{u\in \widehat{\mathcal{S}}}\inf_{v\in \mathcal{S}}\|u-v\|_2,\ \sup_{v\in \mathcal{S}}\inf_{u\in \widehat{\mathcal{S}}}\|u-v\|_2 \Bigr\}. $$
We say that $\widehat{\mathcal{S}}$ is Δ-accurate (in geometry) if $d_{\mathrm{H}}(\widehat{\mathcal{S}},\mathcal{S})\le \Delta$. Under the hyperplane model of Theorem~1, it is often convenient to express the same goal as a error bound. Writing the true surface as {z ∈ 𝒵 : wz + c = 0} with w ≠ 0, define the signed distance of any point z to the surface by
$$ \mathrm{dist}_{\perp}(z,\mathcal{S}) := \frac{w^{\top}z+c}{\|w\|_2}. $$
An estimated hyperplane {z +  = 0} is then Δ-accurate on 𝒵 if
$$ \sup_{z\in\mathcal{Z}} \Bigl|\frac{\hat w^{\top}z+\hat c}{\|\hat w\|_2}-\frac{w^{\top}z+c}{\|w\|_2}\Bigr| \le \Delta, $$
which implies a comparable Hausdorff bound on bounded domains. We will use Δ as the target geometric tolerance throughout.

Because p𝒯(z) is observed only through Bernoulli noise, a surface estimate must be accompanied by high-probability guarantees. Formally, an ESE procedure takes inputs (𝒯, ε, Δ, δ) and outputs a random set $\widehat{\mathcal{S}}_{\mathcal{T}}(\varepsilon)$ (and, when appropriate, parameter estimates , ) such that
$$ \Pr\!\bigl(d_{\mathrm{H}}(\widehat{\mathcal{S}}_{\mathcal{T}}(\varepsilon),\mathcal{S}_{\mathcal{T}}(\varepsilon))\le \Delta\bigr)\ \ge\ 1-\delta. $$
Operationally, our algorithms will maintain confidence intervals for each queried point. If nt(z) denotes the number of oracle calls made at z up to time t, we compute an interval CIt(z) = [t(z), ut(z)] satisfying a simultaneous coverage guarantee of the form
Pr (∀t ≥ 1, ∀z ∈ {z1, …, zt}: p𝒯(z) ∈ CIt(z)) ≥ 1 − δ,
using either a union bound over all queried points/times with Hoeffding-type radii, or an anytime-valid confidence sequence (e.g. empirical Bernstein). These calibrated intervals serve two roles: (i) they certify whether a queried point is above or below the threshold p0(𝒯) + ε once ut(z) < p0 + ε or t(z) > p0 + ε, and (ii) they control geometric error by ensuring that points used to fit the surface lie within a known probability band around the threshold. The algorithmic question, deferred to the next section, is how to choose the query locations zt and sample sizes mt so as to minimize evaluation cost while achieving the stated (Δ, δ) guarantee.


7. Algorithm: Adaptive Surface-Finding via Sequential Testing: pseudocode; sampling policy; fitting method (monotone logistic / isotonic GLM); stopping rule; producing tradeoff summaries.

We describe a concrete procedure for estimating 𝒮𝒯(ε) from black-box oracle access, using two ingredients: (i) sequential testing at selected query points to certify whether p𝒯(z) lies above or below the emergence threshold θ := p0(𝒯) + ε, and (ii) a monotone parametric surface fit in the transformed coordinates z = (L*, R, log b). The algorithm is designed to concentrate samples near the boundary, since points deep inside 𝒯(ε) or far outside it can be classified with few samples and contribute little to geometric accuracy.

At each queried location z, we maintain an anytime-valid confidence sequence CIt(z) = [t(z), ut(z)] for p𝒯(z) based on the Bernoulli samples observed so far at z. Concretely, we may use a Hoeffding-style radius with a union bound over the (random) set of queried points and times, or an empirical-Bernstein confidence sequence; we only require the simultaneous coverage property stated in Section~. A point z is once ut(z) < θ, and once t(z) > θ. Points with θ ∈ CIt(z) are treated as and drive further sampling.

We fit a monotone logistic model in z,
(z) = σ(z + ),    = (α̂, β̂, γ̂) ∈ ℝ3,
with the coordinate-wise monotonicity constraint α̂, β̂, γ̂ ≥ 0. Fitting may be done by constrained maximum likelihood on the aggregated Bernoulli data, optionally with an 2 penalty for stability. When we wish to further guard against misspecification, we may post-process the fitted scores (z) := z +  by an isotonic calibration step: we fit a monotone link σ̃(⋅) so that z ≼ z implies σ̃((z)) ≤ σ̃((z)) (with respect to the product order on 𝒵). In practice, the linear-logit fit already captures the dominant geometry, while the monotonicity constraints prevent spurious folds that would lead to wasted exploration.

At iteration t, given the current fit (t, t), we define the predicted surface $\widehat{\mathcal{S}}_t:=\{z\in\mathcal{Z}:\hat p_t(z)=\theta\}$ and restrict attention to a candidate set 𝒞t consisting of points in a thin tube around $\widehat{\mathcal{S}}_t$ (e.g. a discretized ρ-neighborhood intersected with 𝒵). We select the next query location by maximizing an acquisition score that favors both high uncertainty and proximity to the boundary,
zt ∈ arg maxz ∈ 𝒞tAt(z) := width (CIt(z)) ⋅ (|t(z) − θ|+λ)−1,
where λ > 0 prevents numerical blow-up. After choosing zt, we draw additional oracle samples at zt until either (i) zt becomes certified emergent/non-emergent, or (ii) the confidence interval radius falls below a prescribed local tolerance rt (smaller when zt lies closer to $\widehat{\mathcal{S}}_t$). We then refit the monotone logistic model using all collected data and repeat.

We stop once the boundary is locally resolved to the target geometric tolerance Δ. Operationally, we maintain a finite Δ-net 𝒩 of points in 𝒵 (or, more efficiently, a net on the current tube around $\widehat{\mathcal{S}}_t$). For each z ∈ 𝒩, we ask whether its classification (emergent vs. non-emergent) is determined by the confidence sequence, or else whether it lies at least Δ away (in the estimated normal direction) from the current boundary so that its label is irrelevant to Δ-accurate surface placement. Concretely, writing the estimated signed normal distance as
$$ \widehat{\mathrm{dist}}_{\perp,t}(z) := \frac{\hat w_t^{\top}z+\hat c_t}{\|\hat w_t\|_2}, $$
we continue sampling until all net points with $|\widehat{\mathrm{dist}}_{\perp,t}(z)|\le \Delta$ are certified (i.e. their CI lies entirely above or below θ). This enforces that the remaining ambiguity set is contained in a normal-direction tube of width 2Δ around the final estimate.

From the final normal vector  = (α̂, β̂, γ̂) we report iso-capability rates by implicit differentiation of α̂L* + β̂R + γ̂log b = const. In particular, holding R fixed we estimate
$$ \frac{\partial L^*}{\partial \log b}\Big|_{\hat p=\theta} = -\frac{\hat\gamma}{\hat\alpha}, \qquad \frac{\partial R}{\partial \log b}\Big|_{\hat p=\theta} = -\frac{\hat\gamma}{\hat\beta}, $$
and similarly L*/∂R = −β̂/α̂. If a deployment cost model C(L*, R, b) is supplied, we additionally solve minz ∈ 𝒵C(z) subject to (z) ≥ θ (with a safety margin obtained by replacing θ by θ + slack), yielding a compute-efficient operating point together with a certificate derived from the maintained confidence intervals.


8. Theory: Upper Bounds: sample complexity for estimating the emergence surface; dependence on slopes, noise, and desired accuracy; extension to multiple tasks with shared evaluations.

We quantify the evaluation cost required to estimate the emergence surface to geometric tolerance Δ under the generalized linear margin model. Throughout we work on a bounded feasible set 𝒵 ⊂ ℝ3 in coordinates z = (L*, R, log b) with diameter diam(𝒵) ≤ D, and we write the emergence threshold as θ := p0(𝒯) + ε ∈ (0, 1). Under the logistic specification, p(z) = σ(wz + c) with w ≥ 0 and w2 ≤ W, the true surface is
𝒮 := {z ∈ 𝒵 : p(z) = θ} = {z ∈ 𝒵 : wz + c = logit (θ)}.
We measure geometric error in the normal direction by the signed distance to the hyperplane: for a candidate estimate (, ) we consider the induced surface $\widehat{\mathcal{S}}=\{z:\hat w^\top z+\hat c=\operatorname{logit}(\theta)\}$ and seek $\sup_{z\in\mathcal{Z}}|\mathrm{dist}_\perp(z,\widehat{\mathcal{S}})-\mathrm{dist}_\perp(z,\mathcal{S})|\le \Delta$ with probability at least 1 − δ.

Near the boundary, perturbations of signed distance translate to perturbations of success probability at a rate controlled by the slope of the link function in the normal direction. Let n := w/∥w2 denote the unit normal. Along the normal line z(t) = z0 + tn passing through some z0 ∈ 𝒮 we have
$$ \frac{d}{dt}p(z(t)) = \sigma'(w^\top z(t)+c)\,\|w\|_2. $$
On the surface, wz0 + c = logit (θ) and hence σ(logit (θ)) = θ(1 − θ). Consequently, for small normal displacement t we obtain the local linearization
p(z0 + tn) = θ + θ(1 − θ)∥w2t + O(t2).
This identity makes explicit the dependence on (i) the threshold location θ (tasks with θ very close to 0 or 1 have weaker slope), and (ii) the aggregate slope w2 (surfaces with small normal coefficient are harder to localize geometrically). In particular, ensuring probability error at most τ at points within an O(Δ) tube around 𝒮 suffices to ensure geometric error O(τ/(θ(1 − θ)∥w2)).

The adaptive procedure in Section~ concentrates evaluations in a shrinking tube around the current estimate; formally, one may view it as repeatedly reducing uncertainty about the signed offset of the hyperplane in the direction of its normal, while using monotonicity to avoid exploring regions whose label is already certified. The following statement summarizes the standard rate.

We select a finite Δ-net of candidate locations in a tube around the current surface estimate; the net size contributes only logarithmic factors since 𝒵 is three-dimensional and bounded. At a fixed location z, certifying whether p(z) ≷ θ up to probability error τ requires O(τ−2log (1/δ)) samples by standard concentration (Hoeffding or empirical Bernstein). By the linearization above, a normal displacement of Δ induces a probability change of order θ(1 − θ)∥w2Δ near the boundary; thus taking τ ≍ θ(1 − θ)∥w2Δ suffices to rule out surfaces shifted by more than Δ. The adaptive strategy ensures that only points whose (estimated) signed distance is O(Δ) are sampled to this precision; points farther away are certified with fewer samples. Summing the resulting contributions yields N = (Δ−2log (1/δ)) with constants depending on W and θ(1 − θ).

The Bernoulli model already captures worst-case label noise since Var(V) ≤ 1/4. If, additionally, the verifier provides a real-valued margin score s𝒯(x, y) with sub-Gaussian tails and 𝔼[s ∣ z] strictly increasing in the latent logit wz + c, then one may replace Bernoulli concentration by sub-Gaussian concentration, improving constants (and, in favorable regimes, the number of model calls if one can reuse a single model output to obtain multiple noisy margin observations). The exponent Δ−2 is not removed in general because geometric localization still reduces to estimating a mean shift of order Δ.

Suppose we have tasks 𝒯1, …, 𝒯K each obeying a logistic surface model, and we wish to estimate all surfaces to error Δ with overall failure probability at most δ. If the tasks are unrelated, a union bound yields an immediate allocation: run the single-task procedure with confidence δ/K per task, giving total labeled samples $\sum_{k=1}^K \tilde{O}(\Delta^{-2}\log(K/\delta))$ and allowing adaptive reallocation to tasks whose boundaries lie in harder-to-resolve regions (e.g. smaller wk2). A more informative setting arises when tasks share a common normal direction, as in a hierarchical model
pk(z) = σ(wz + ck),   k ∈ [K],
with a single unknown w and task-specific offsets ck. Then we may pool data across tasks to estimate w at rate governed by the total sample size, and subsequently estimate each ck with additional task-local samples. In this shared-normal model, the dominant cost of learning orientation is paid once, and the per-task incremental cost is essentially one-dimensional (estimating the intercept along the learned normal), yielding a strict reduction compared to learning wk separately whenever K is moderate and the surfaces are not nearly parallel to coordinate axes.


9. Theory: Lower Bounds and Necessity of Near-Boundary Testing: information-theoretic reductions to Bernoulli mean testing / best-arm identification; matching Ω(Δ−2) lower bound; implications for predictability claims.

We complement the upper bounds by showing that the Δ−2 dependence is not an artifact of the particular adaptive strategy, but is information-theoretically unavoidable under the Bernoulli-verifier observation model. Moreover, we formalize the intuitive necessity of sampling in a neighborhood of the emergence surface: evaluations far from the transition region cannot, in the worst case, certify the boundary location at geometric resolution Δ.

Fix a bounded feasible set 𝒵 ⊂ ℝ3 in coordinates z = (L*, R, log b), and a success threshold θ ∈ (0, 1). Consider the logistic family
pc(z) = σ(wz + c),   ∥w2 ≤ W,
with unknown offset c (equivalently, an unknown hyperplane intercept). The emergence surface at level θ is 𝒮c = {z : wz + c = logit (θ)}.

We reduce boundary localization to distinguishing two close hypotheses. Fix any admissible w and choose offsets c0, c1 such that the induced surfaces 𝒮c0 and 𝒮c1 are separated by signed normal distance Δ on 𝒵. Since the signed distance to the hyperplane wz + c = logit (θ) scales as |c − |/∥w2, it suffices to take
c1 − c0 = ∥w2Δ.
Now consider any queried point z. Under c0 and c1, the Bernoulli means differ by
$$ |p_{c_1}(z)-p_{c_0}(z)| \;=\; \big|\sigma(w^\top z+c_1)-\sigma(w^\top z+c_0)\big| \;\le\; \sup_u \sigma'(u)\,|c_1-c_0| \;=\; \frac{1}{4}\|w\|_2\,\Delta, $$
using supuσ(u) = 1/4. Consequently, even at the most informative margin location, the gap in Bernoulli means is at most a constant multiple of Δ. Distinguishing two Bernoulli distributions whose means differ by Θ(Δ) with error probability at most δ requires Ω(Δ−2log (1/δ)) samples by standard hypothesis-testing bounds (equivalently, KL divergence or Pinsker inequalities). Since any surface estimator of geometric error  ≤ Δ must distinguish c0 from c1 with probability at least 1 − δ, the stated lower bound follows.

The preceding argument uses only the global bound σ(u) ≤ 1/4; a sharper statement explains why ``testing far away’’ can be dramatically less informative. If all queried points satisfy a margin condition |wz + c0 − logit (θ)| ≥ m, then for the same two offsets as above,
|pc1(z) − pc0(z)| ≤ sup|u − logit (θ)| ≥ mσ(u) ∥w2Δ ≤ θm(1 − θm) ∥w2Δ,
where θm := σ(logit (θ) + m) and θm(1 − θm) decays exponentially in m. Thus, if an algorithm avoids an m-tube around the surface, the per-sample information about the offset shrinks with m, and the number of samples required to achieve the same Δ grows proportionally to (sup|u − logit (θ)| ≥ mσ(u))−2. In minimax terms, any guarantee that is uniform over instances forces the algorithm to eventually collect data from points whose predicted success probability is close to θ (equivalently, whose logit margin is O(1)), matching the operational prescription of near-boundary evaluation.

Theorem~ rules out claims of cheap certification of emergence from sparse, off-boundary measurements: extrapolation from confidently-failing regions (or confidently-succeeding regions) cannot, in general, pin down where the transition occurs to geometric accuracy Δ. This is not merely a limitation of our modeling; it is an intrinsic limitation of black-box Bernoulli feedback.

A related necessity phenomenon appears in compute-optimal deployment selection. If we discretize the feasible box into candidate configurations (``arms’’) z1, …, zM with unknown means p(zi) and wish to identify the minimum-cost arm with p(zi) ≥ θ, we recover a thresholded best-arm identification problem. Standard lower bounds imply that arms with gap gapi := |θ − p(zi)| require Ω(gapi−2log (1/δ)) samples to classify reliably, and the arms that are near-optimal in cost tend also to be those with small gaps (hence dominating the sample complexity). Therefore, even when the end goal is not an explicit surface estimate but a compute-minimizing emergent configuration, the evaluation budget must concentrate on candidates near the boundary, and a Δ−2-type scaling reappears once we translate a desired tolerance in configuration-space to a gap in success probability.


10. Problem 2 — Compute-Optimal Capability Attainment (COCA): optimization over (L*, R, b) under a cost model; closed-form solutions under the GLM; robustness to model misspecification.

Having estimated an emergence surface for a fixed task 𝒯, we often wish not merely to certify that some configuration is emergent, but to emergence at minimum total compute. Formally, for a target success threshold θ := p0(𝒯) + ε and a cost model C(L*, R, b) combining (possibly exogenous) pretraining, post-training, and inference-time costs, COCA asks us to solve
min(L*, R, b) ∈ 𝒟C(L*, R, b)   s.t.   p𝒯(L*, R, b) ≥ θ,
where 𝒟 is a feasible box encoding available checkpoints, post-training limits, and inference budgets. Since p𝒯 is coordinate-wise nondecreasing in (L*, R, log b), the constraint is monotone: increasing R or b (or improving L*) preserves feasibility, while decreasing them risks violating the target.

Under the GLM margin model, COCA reduces to a tractable constrained optimization. Writing z = (L*, R, u) with u := log b, the condition p𝒯(L*, R, b) ≥ θ is equivalent to a single linear inequality in z:
α𝒯(L* − η𝒯) + β𝒯R + γ𝒯u + ξ𝒯 ≥ logit (θ),
i.e.,
α𝒯L* + β𝒯R + γ𝒯log b ≥ κ𝒯(ε)  for an appropriate constant κ𝒯(ε).
Thus the feasible emergent set is a half-space in (L*, R, log b) intersected with 𝒟. In particular, if C is convex in (L*, R, log b) (e.g. separable with convex components), COCA is a convex program with a unique optimum under mild strict convexity assumptions.

A useful special case, consistent with our later empirical protocol, is a separable cost model
C(L*, R, b) = cL(L*) + cR(R) + cbb,
where cL encodes the amortized cost of selecting a checkpoint (often decreasing in L* if smaller L* corresponds to more pretraining compute), cR encodes the post-training budget (nondecreasing in R), and cbb encodes linear inference-time cost. Passing to u = log b yields objective cL(L*) + cR(R) + cbeu and a single linear constraint in (L*, R, u). At any optimum that is not pinned to a boundary of 𝒟, the emergence constraint is tight (otherwise we can increase L* or decrease R or b to reduce cost while remaining feasible). The Karush–Kuhn–Tucker conditions then give a clear marginal-rate interpretation: letting λ ≥ 0 be the multiplier on the constraint, stationarity reads
cL(L*) + λα𝒯 = 0,   cR(R) + λβ𝒯 = 0,   cbeu + λγ𝒯 = 0,
with complementary slackness enforcing equality on the constraint. Interpreting signs according to the monotonicity of each cost component, these equations equate the contributed by each axis: we should spend compute on whichever of pretraining, post-training, or inference yields the largest increase in α𝒯L* + β𝒯R + γ𝒯u per marginal unit of cost, until these normalized marginal returns balance (or until a box constraint binds).

When cL and cR are approximately linear over the feasible range (a reasonable approximation when available checkpoints and RL strengths are coarse), the solution often becomes ``nearly corner-like’’: we either (i) push b up while keeping (L*, R) near their cheapest feasible values, (ii) increase R while keeping (L*, b) small, or (iii) move to a better checkpoint while keeping (R, b) minimal, depending on which axis most cheaply supplies the required margin. Concretely, if we solve the constraint for b,
$$ b \;=\; \exp\!\left(\frac{\kappa_{\mathcal{T}}(\varepsilon)-\alpha_{\mathcal{T}}L^*-\beta_{\mathcal{T}}R}{\gamma_{\mathcal{T}}}\right), $$
then COCA reduces to minimizing
$$ (L^*,R)\;\mapsto\; c_L(L^*)+c_R(R)+c_b\exp\!\left(\frac{\kappa_{\mathcal{T}}(\varepsilon)-\alpha_{\mathcal{T}}L^*-\beta_{\mathcal{T}}R}{\gamma_{\mathcal{T}}}\right), $$
which is convex when cL, cR are convex. The exponential term makes explicit the iso-capability tradeoff: improving L* or increasing R linearly in the margin reduces the required b , with rates governed by α𝒯/γ𝒯 and β𝒯/γ𝒯.

Finally, COCA must be robust to misspecification and estimation error, since in practice we only have an estimated success function 𝒯 (or an estimated hyperplane in (L*, R, log b)) from finite verifier samples. A conservative approach is to optimize against a lower confidence bound: if we maintain a uniform high-probability guarantee |(z) − p(z)| ≤ ρ on the feasible domain, then enforcing (z) ≥ θ + ρ implies p(z) ≥ θ with probability at least 1 − δ (after union bounding over candidate solutions). Equivalently, if we estimate the surface in normal-direction error Δ, then near the boundary the induced probability error scales as O(Δ) through the derivative of the logistic link, and we can inflate the required margin by an amount calibrated to Δ to obtain feasibility. Under mild regularity (e.g. Lipschitz cost in (L*, R, log b)), this conservatism yields an explicit suboptimality bound: the additional cost incurred by robustification is at most proportional to the product of a cost Lipschitz constant and the geometric surface estimation error. In this sense, the same near-boundary evaluation that is necessary for surface localization also enables principled, compute-optimal deployment decisions with certified success guarantees.


We outline an experimental design whose goal is to make the coordinates (L*, R, b) operational, to render the monotone surface hypothesis falsifiable, and to support cross-laboratory comparison. We assume access to a family of publicly available checkpoints, a reproducible post-training pipeline, and a verifier-equipped task suite.

First, we fix a (public, contamination-audited, and disjoint from all evaluation tasks) and define L* for each checkpoint by a standardized loss-like scalar, e.g. normalized token-level cross-entropy on the corpus under a fixed tokenizer and evaluation protocol. We then index checkpoints by L* rather than by parameter count or training step, and we publish the mapping checkpoint  ↦ L*, including variance from evaluation subsampling. This step makes the pretraining axis comparable across model families and mitigates confounding from architectural changes. When only discrete checkpoints are available, we treat L* as a discrete coordinate and restrict surface fitting accordingly (piecewise-linear in L* or logistic with checkpoint indicators), but we still report slopes in the continuous surrogate for interpretability.

Second, we instantiate the post-training axis R as a of a single pipeline with fixed data, reward model/verifier, and optimizer, varying only a scalar strength parameter. Concretely, we recommend parameterizing R by a monotone quantity that is stable across runs, such as total KL-constrained improvement (integrated KL penalty weight times steps), total number of policy-gradient steps times effective step size, or a validated reward improvement measured on a held-out preference set. For each checkpoint (or for a subset spanning the feasible L* range), we run multiple independent seeds for each R value and store the full training traces, enabling post hoc checks that R capability is not spuriously nonmonotone due to optimization instability. We recommend at least one ablation in which R is varied while holding the reward model fixed versus jointly updating it, since the latter changes the semantics of the improvement axis.

Third, we standardize the inference-time compute knob b by adopting a whose compute can be externally audited. Examples include: (i) best-of-n sampling (with b = n) under fixed temperature and prompt; (ii) verifier-guided selection among n candidates, counting verifier calls into b explicitly; (iii) bounded tree search with a fixed expansion policy and b equal to the number of node expansions; (iv) iterative refinement with b equal to the number of refinement rounds. We report both the semantic definition of b and a hardware-normalized proxy (e.g. total forward-pass tokens plus verifier tokens) to permit translation across implementations. Because our model uses log b, we include b values on a geometric grid (e.g. 1, 2, 4, 8, …) to stabilize slope estimation.

Fourth, we assemble a defining several 𝒯 with clear p0(𝒯). We recommend mixing: (a) algorithmic tasks with exact verifiers (modular arithmetic, parity, bracket matching); (b) structured reasoning tasks with deterministic checkers (GSM-like math with canonicalization and numeric equivalence); (c) knowledge tasks with constrained formats and adjudication (multiple-choice slices such as MMLU/C-Eval subsets); and (d) transcription/transliteration tasks with string-distance based verifiers. For each 𝒯 we publish the instance generator or dataset split, the verifier code, and any normalization rules, and we test for verifier brittleness by adversarial formatting perturbations. We additionally recommend reporting a margin score s𝒯(x, y) when available (e.g. log-odds from a learned verifier) to reduce variance in near-boundary estimation, while still defining success by V𝒯 ∈ {0, 1}.

Fifth, we implement boundary estimation by combining (i) a coarse initial design over (L*, R, log b) and (ii) adaptive refinement near the predicted boundary. In practice, we choose a modest initial grid that spans the feasible box and then allocate additional evaluations according to an uncertainty-weighted boundary acquisition rule, maintaining sequential confidence intervals for 𝒯(L*, R, b). To reduce variance and improve comparability across z = (L*, R, log b), we either reuse a fixed instance set across queried points (paired evaluation) or use stratified sampling with shared random seeds; we then explicitly report which regime we used, since pairing changes the effective noise model. Stopping criteria are expressed in terms of (estimated) normal-direction boundary error Δ or, operationally, the maximum probability ambiguity in a narrow band around θ.

Sixth, we specify reporting conventions. For each task we report: the fitted normal vector w = (α𝒯, β𝒯, γ𝒯) (up to scale) with confidence intervals; the estimated surface $\widehat{\mathcal{S}}_{\mathcal{T}}$ as level sets in 2D slices; iso-capability tradeoff rates such as L*/∂log b at fixed R; and the realized success rates at the recommended compute-optimal point (if COCA is performed), including a held-out confirmation evaluation not used in fitting. We also publish all queried (L*, R, b) points, sample counts, random seeds, and the exact prompt templates.

Finally, we recommend ablations that test necessity of each modeling choice: replacing adaptive sampling with uniform grids at matched cost; removing monotonicity constraints in fitting; fitting without the log b transform; using only V𝒯 versus using (V𝒯, s𝒯); and transferring a surface fit from one inference procedure (e.g. best-of-n) to another at matched token budget. These ablations help isolate whether observed emergence surfaces reflect genuine tradeoffs among (L*, R, b) rather than artifacts of evaluation design.


12. Discussion and Limitations: when the manifold breaks (distribution shift, changing verifiers, non-monotonicities); relation to task difficulty slicing; governance implications; future work (agents, harmful capability surfaces).

Our central modeling move is to treat ``emergence’’ as a boundary of a monotone success function on the coordinate triple z = (L*, R, log b) and to approximate this boundary by a hyperplane in logit space. The utility of this move is that it converts a vague qualitative transition into a geometric object with tradeoff rates. Its limitations are exactly the conditions under which either (i) the map z ↦ p𝒯(z) ceases to be coordinate-wise nondecreasing, or (ii) the logit is not well-approximated by an affine functional of z on the feasible box. We therefore interpret fitted surfaces as local summaries, and we expect failure modes in regimes where the task distribution, the verifier, or the inference procedure induces discontinuities, multi-modality, or strategic behavior.

A first point of fragility is . All quantities are indexed by 𝒯, but in practice one often uses a dataset proxy $\widehat{\mathcal{T}}$ or a convenience slice whose support differs from the intended deployment distribution. Even when $p_{\widehat{\mathcal{T}}}(L^*,R,b)$ is well-behaved, p𝒯(L*, R, b) need not share the same normal direction, and the apparent tradeoff L*/∂log b can change sign when 𝒯 places more mass on instances that are solved by different internal mechanisms (e.g. memorization versus algorithmic reasoning). The surface picture remains meaningful only to the extent that 𝒯 is stable under the operational perturbations one cares about (prompt templates, formatting, tool availability, adversarial inputs). Empirically, this suggests that surface estimation should be accompanied by stress tests that explicitly shift 𝒯 and report the induced surface displacement as an additional uncertainty, rather than reporting only statistical error at fixed 𝒯.

A second limitation concerns . Our definition of success depends on V𝒯(x, y), and any change in verifier semantics changes the task itself. This is benign for exact checkers (e.g. arithmetic) but problematic for learned verifiers or rubric-based graders that can drift over time, be sensitive to formatting, or be gamed by the model. In such cases p𝒯 is not merely a property of the policy; it is a property of the policy–verifier interaction, and increasing R may increase ``success’’ by exploiting verifier loopholes rather than improving the underlying capability. Formally, the monotonicity of p𝒯 in R can fail even if the underlying capability is monotone, because the success event is ill-posed. A practical mitigation is to treat the verifier as part of the experimental object: version it, audit it for brittleness, and, when possible, report both V𝒯 and a more conservative cross-verification (e.g. two independent checkers, or a human spot-check) on a held-out subset to bound the extent of reward hacking.

Third, can arise from the knobs themselves. Increasing b can reduce success when the inference procedure introduces additional opportunities for error (e.g. longer chains that amplify compounding mistakes, or search heuristics that overfit to spurious verifier signals). Increasing R can degrade performance due to overoptimization, loss of generality, or mode collapse, especially when the post-training objective is misaligned with 𝒯. Even L*, as an empirical scalar derived from a calibration corpus, can be non-monotone with respect to a downstream task if the calibration corpus correlates poorly with the features relevant for 𝒯. Our algorithmic framework can detect such violations via systematic isotonicity residuals and held-out checks, but it cannot ``repair’’ them without changing the model class (e.g. switching from a single hyperplane to a piecewise-monotone or nonparametric surface). When substantial non-monotonicity is present, the appropriate output is not a surface estimate but an explicit certificate that the monotone-manifold hypothesis is falsified on the tested domain.

We also emphasize the relation to . Many datasets are mixtures over latent difficulty levels; as a consequence, p𝒯(z) can be a mixture of sigmoids with different thresholds and slopes, producing curvature or apparent kinks'' in the fitted boundary. One can partially linearize this by conditioning on an estimated difficulty score $d(x)$ and studying surfaces for $\mathcal{T}\mid d(x)\in I$ (difficulty strata), or by fitting a hierarchical model in which $\eta_{\mathcal{T}}$ is instance-dependent. This connects our geometric view to item-response theory: the emergence surface for a stratum behaves like a separatrix for items of comparable discrimination, while the aggregate surface conflates strata and may exaggerate sharp transitions. Thus, when interpreting anemergent’’ region, we recommend reporting how much of the effect persists under difficulty-controlled evaluation.

Governance implications are immediate. A three-axis surface replaces a single ``model capability’’ number with a map of how capabilities can be purchased with training, post-training, and inference-time compute. For safety-relevant tasks, this implies that restricting one axis (e.g. limiting deployment-time b) may not control risk if the same surface can be crossed by increasing R or selecting a more advanced checkpoint (smaller loss, larger L*). Conversely, transparency about tradeoffs can support proportional regulation: one can define risk thresholds in terms of p𝒯(L*, R, b) and require reporting of the minimal-cost point achieving them under an auditable cost model C. However, publishing surfaces for harmful tasks may itself lower the barrier to attainment by revealing the cheapest route to high success. We therefore distinguish between (i) publishing the methodology and benign-task validation, and (ii) controlled disclosure of harmful-capability surfaces, potentially via secure evaluation and limited-access reporting.

Finally, we list directions where the manifold picture must be generalized. Agentic settings introduce state, memory, and tool use; success then depends on trajectories, and b may control search depth, tool calls, or time, yielding nontrivial feedback between inference compute and environment responses. Harmful capability surfaces (e.g. biosecurity, cyber exploitation) require adversarial task distributions, robust verifiers, and careful accounting for human-in-the-loop and tool availability as additional coordinates beyond (L*, R, b). Methodologically, we anticipate extending from a single hyperplane to (a) locally linear surfaces with curvature penalties, (b) monotone Gaussian-process or spline models over z, and (c) multi-task coupling that shares information across related 𝒯 while allowing task-specific thresholds η𝒯. The core question remains geometric: which directions in the space of training and deployment choices move us most efficiently across a verifiable success boundary, and under what conditions does that boundary remain stable enough to be governed.