Loss That Transfers: Calibrated Progress Metrics and Tight Bounds for Cross-Tokenizer Emergence Forecasting

Metadata

Total Words: 8,915
Export Date: 2026-01-17 23:04:18
Description: Recent work argues that emergent abilities correlate more strongly with pretraining loss than with parameter count, while other work questions whether emergence is an evaluation artifact. A major obstacle to turning loss thresholds into a practical forecasting tool is that raw cross-entropy is not comparable across tokenizers and corpora: the same model can exhibit different losses purely due to segmentation and vocabulary effects. We propose a tokenizer- and corpus-calibrated progress metric L^* computed on a public calibration corpus 𝒞 by subtracting a tokenizer-specific unigram baseline. We formalize a latent-progress model where tokenizer/corpus choices induce affine distortions of observed loss, and we prove matching upper and lower bounds for cross-family threshold prediction: L^* provably removes the dominant non-comparability term, while any method using raw loss, compute, or parameter count suffers irreducible error when families differ in offsets. We also show that computing a segmentation-invariant string likelihood (marginalizing over tokenizations) is #P-hard for general autoregressive models, motivating our efficient calibration-based surrogate. We outline experiments on open checkpoints to validate that L^* aligns emergence thresholds (e.g., MMLU/C-Eval/GSM8K-like tipping points) across families, enabling portable forecasting and governance reporting in the 2026 setting where model stacks and tokenizers vary widely.

1. Introduction: emergence, loss-threshold perspective (Du et al.), and the comparability crisis across tokenizers/corpora; statement of contributions and what is provably possible/impossible.
2. Related Work: emergent abilities (Wei et al.), metric artifacts debate, loss-based thresholds (Du et al.), normalized perplexity/bpb ideas, and why cross-family forecasting remains unresolved.
3. Problem Setup and Definitions: calibration corpus 𝒞, model families, tokenizers, downstream tasks, random baselines, and the target forecasting task (threshold crossing).
4. Calibrated Progress Metric L^*: construction, variants (additive vs multiplicative normalization; domain-sliced calibration), and design principles (public, stable, hard-to-game).
5. Algorithms: computing CE_𝒞^T(θ), building unigram baseline U_T, fitting monotone performance curves and estimating thresholds with uncertainty.
6. Theory I — Latent Affine Distortion Model: formal assumptions, estimator for latent progress using L^*, and finite-sample error bounds for threshold prediction; matching lower bounds for raw-loss/compute predictors.
7. Theory II — Tokenization Intractability: #P-hardness of exact segmentation-invariant likelihood for autoregressive models; implications and why L^* is the right surrogate under resource constraints.
8. Experimental Protocol (for strengthening claims): datasets, model families, checkpoints, tasks (multiple-choice + verifier-based), and evaluation metrics (continuous + discontinuous); transfer tests (leave-one-family-out).
9. Stress Tests and Failure Modes: dataset shift, calibration corpus mismatch, tokenization perturbations, adversarial/gaming analysis, and when L^* should not be expected to work.
10. Discussion and Governance Use-Cases: portable early-warning reporting; how L^* interacts with RL post-training and inference-time compute (boundaries of applicability); limitations and future work.

Content

1. Introduction: emergence, loss-threshold perspective (Du et al.), and the comparability crisis across tokenizers/corpora; statement of contributions and what is provably possible/impossible.

A recurring empirical observation in modern large language models is that certain downstream competencies appear to abruptly as training progresses: for a range of checkpoints the measured performance on a task remains near a random or trivial baseline, and then, within a comparatively small increment of training, performance becomes reliably nontrivial. This qualitative pattern has been reported across many tasks and model families and has motivated a threshold-based view of capability onset. In particular, loss-threshold perspectives such as suggest that, within a fixed family and evaluation protocol, the onset of a capability can often be forecast from a scalar progress proxy (typically training loss or perplexity on a reference distribution), by positing that a task becomes solvable once the proxy crosses a task-dependent threshold.

The difficulty addressed in the present work is that forecasting is not well-posed under naïve proxies. When models differ in tokenizer, pretraining corpus, or even byte-level preprocessing conventions, the numerical scale of cross-entropy on a fixed string corpus is not directly comparable: the same underlying string distribution may induce substantially different token-length normalizations and different baseline entropies. Consequently, two checkpoints with essentially identical latent modeling ability may exhibit losses that differ by an additive constant (and sometimes a multiplicative distortion) that is determined primarily by tokenization and corpus mismatch rather than by learned structure. This comparability crisis is not merely cosmetic; it undermines any attempt to provide portable early-warning reports such as ``task t will exceed baseline once calibration loss reaches ℓ’’ unless ℓ is defined in a tokenizer-robust manner.

We formalize this issue by fixing a public calibration corpus 𝒞 with canonical byte-level representation and considering a collection of checkpoints θ drawn from heterogeneous families. Each checkpoint induces an autoregressive distribution over a tokenizer-specific vocabulary, hence yields an average per-token cross-entropy on 𝒞 under its own tokenizer. The central observation is that per-token cross-entropy decomposes into (i) a term reflecting model-specific predictive structure and (ii) a tokenizer-dependent offset reflecting the entropy of the token stream itself. Without correcting for (ii), a raw loss threshold transports poorly across tokenizers: the same underlying character-level uncertainty may correspond to different token-level entropies, and the induced offset does not vanish even as model size grows.

Our point of departure is that, while tokenization-invariant string likelihood would be a natural remedy, it is computationally intractable in the relevant setting. Given an autoregressive model over tokens and a deterministic decode map, the segmentation-invariant likelihood of a string is the sum of probabilities of all token sequences decoding to that string. Exact computation of this marginal requires summing over a tokenization lattice whose size may be polynomial but whose weighted-path sum is, in general, #P-hard. Thus, if one insists on exact tokenization invariance at the level of likelihood, no efficient black-box evaluation procedure is available for large-scale calibration.

We therefore seek an efficiently computable surrogate that removes the dominant tokenizer-induced distortion while preserving monotone information about modeling progress. The proposed surrogate is a obtained by subtracting a tokenizer-specific unigram baseline computed on the same calibration corpus. Concretely, for each tokenizer T we estimate the maximum-likelihood unigram model U_T from token counts on T(𝒞), and define the calibrated score L^*(θ) as the difference between the model cross-entropy on 𝒞 and the unigram cross-entropy on 𝒞 under the same tokenizer. The subtraction is designed to cancel the leading-order offset attributable to token inventory and segmentation frequency, yielding a quantity that is comparably scaled across tokenizers up to an affine distortion and noise.

The technical claims of the paper are organized around two questions: (Q1) what can be recovered about cross-family progress from black-box next-token probabilities on a fixed public corpus, and (Q2) under what conditions can we forecast the onset of downstream performance above baseline from such a progress measure. For (Q1), we posit a latent variable z(θ) capturing tokenizer-invariant modeling progress and assume that calibration cross-entropy is an affine function of z(θ) with tokenizer-dependent parameters and sub-Gaussian noise. Under a mild additional condition that the unigram baseline approximates the tokenizer-induced offset up to a bounded tokenizer-dependent remainder, the calibrated metric L^*(θ) becomes an affine proxy for z(θ) with substantially reduced cross-tokenizer variance. For (Q2), we assume each task performance P_t(θ) is a monotone Lipschitz function of z(θ) plus measurement noise; then, given a modest number of labeled checkpoints, we fit a monotone curve in L^*-space and define an estimated emergence threshold as the smallest L^* at which predicted performance exceeds a baseline margin.

Our contributions are the following. First, we define the calibration procedure and metric L^*, together with an explicit evaluation protocol requiring only teacher-forced loss computation on a public corpus and token counting for the unigram baseline. Second, we provide finite-sample guarantees for threshold estimation when performance is modeled as a monotone function of latent progress, yielding an error bound that scales with the noise level and the number of labeled checkpoints rather than with tokenizer offsets. Third, we complement the upper bound with a lower bound showing that any method that relies solely on raw loss (or on compute/parameter count) cannot be uniformly cross-tokenizer consistent when tokenizer-dependent offsets are unknown; unavoidable error persists proportional to offset mismatch. Fourth, we clarify the computational boundary by formalizing the #P-hardness of exact segmentation-marginal likelihood, which motivates the use of calibration-based surrogates rather than exact tokenization invariants.

The net effect is a separation between what is feasible and what is not: while exact invariance to tokenization at the level of string likelihood is computationally out of reach, simple calibration against a tokenizer-specific unigram model suffices, under transparent assumptions, to support cross-family forecasting of emergence thresholds with controlled error. The remainder of the paper elaborates these points, and the subsequent section situates the approach within prior work on emergent abilities, loss-based thresholds, and normalization schemes.

The empirical literature on in large language models is now substantial. A prominent line of work, exemplified by , reports qualitative phase-transition-like behavior: as scale or training progresses, performance on certain benchmarks remains near a baseline and then rises sharply over a comparatively narrow range of checkpoints. Subsequent studies broaden this catalog across reasoning, instruction following, and tool use, and propose different operational definitions of ``emergence’’ (e.g., threshold crossing versus changes in curvature of a scaling curve). For our purposes, the key commonality is methodological: emergence is typically diagnosed by plotting an evaluation metric against a scalar proxy for progress (often compute, parameter count, or a loss-like quantity) and identifying a rapid transition region.

A parallel thread debates to what extent apparent emergence is an artifact of measurement, metric choice, or plotting conventions rather than a property of the underlying competence. Several authors argue that discontinuities can be induced by saturating metrics, floor effects, or transformations (e.g., log scales), and that many ``emergent’’ curves become smooth when examined in a more appropriate coordinate system or with uncertainty properly accounted for; conversely, others emphasize that thresholded downstream success criteria (such as exact match) naturally yield sharp transitions even if an underlying latent ability improves smoothly. While we do not attempt to adjudicate this debate in general, it motivates a conservative stance: any cross-model forecasting method should be explicit about (i) the proxy used for progress, (ii) the baselines defining nontriviality, and (iii) the distortions introduced by representation choices such as tokenization. In particular, even if task performance as a function of latent capability is smooth, mis-calibrated progress proxies can create or destroy the appearance of threshold behavior across heterogeneous model families.

Loss-based thresholding has been proposed as a principled alternative to compute-based heuristics. A representative example is , which argues that, within a controlled setting, a fixed loss threshold on a reference distribution can predict when a capability becomes available, and that loss is a more direct measure of learned structure than parameter count or training FLOPs. This perspective aligns with classical information-theoretic intuitions: cross-entropy is an operational measure of predictive uncertainty, and improvements in loss can be compared across checkpoints even when training schedules differ. However, existing demonstrations are largely (or within a small set of closely matched tokenizers and preprocessing pipelines), where the mapping from training to evaluation is relatively stable. The unresolved difficulty is that loss values are only directly comparable when the underlying token stream and normalization are comparable, an assumption that fails for cross-family comparisons involving different tokenizers or byte-level conventions.

Normalization schemes for language modeling metrics have a longer history, particularly in the evaluation of compression and character-level modeling. Bits-per-character (bpc) and bits-per-byte (bpb) provide tokenization-agnostic units by measuring negative log-likelihood per character/byte under a canonical encoding; such measures are meaningful when the model directly defines a distribution over characters/bytes or when one can efficiently evaluate the probability of the canonical sequence. Related ideas appear in byte-level tokenizers and in comparisons between word-based and subword-based models, where perplexity is known to be sensitive to segmentation length. In practice, however, most contemporary LLMs expose only next-token probabilities over a tokenizer-specific vocabulary, and are evaluated in nats/bits per token under that tokenizer. Converting these token-level probabilities to a true byte-level likelihood generally requires marginalizing over all tokenizations consistent with a byte string, which is computationally prohibitive in the black-box autoregressive setting. Thus, while bpb is an attractive conceptual target, it is not directly available for arbitrary token-based checkpoints without additional structural access.

Several works propose heuristic adjustments intended to improve cross-model comparability of loss or perplexity, including length normalization choices, mixture-of-tokenizers evaluation, and calibrations against simple baselines (e.g., uniform or unigram models) to factor out vocabulary and segmentation effects. The underlying intuition is that raw per-token cross-entropy conflates two components: (i) entropy intrinsic to the token stream induced by a tokenizer on the evaluation corpus, and (ii) the model-dependent reduction in uncertainty due to learned higher-order structure. This is closely related to classical decompositions in language modeling between unigram entropy and higher-order gains, and to the observation that tokenization can shift entropy by changing the granularity at which uncertainty is measured. Nonetheless, prior normalization proposals are often presented as empirical recipes rather than as components of an explicit identification problem: namely, how to recover a tokenizer-invariant notion of progress from token-level losses when the evaluation interface precludes exact tokenization marginalization.

The gap we identify, and which motivates our formulation, is therefore not the absence of progress proxies, but the absence of a proxy with a clear invariance target and an accompanying forecasting protocol. Existing emergence and threshold papers either (a) fix a single family/tokenizer and thereby avoid comparability issues, or (b) compare across models but implicitly assume that raw loss/perplexity is commensurate across tokenizers, an assumption contradicted by well-known segmentation effects. Conversely, tokenization-invariant evaluations (character/byte likelihood) are conceptually clean but operationally unavailable for many checkpoints. What remains unresolved is how to obtain, from black-box next-token probabilities alone, a progress measure that reduces tokenizer-dependent offsets sufficiently to support portable threshold forecasts across heterogeneous families, without requiring retraining, access to internal representations, or intractable marginalization.

3. Problem Setup and Definitions: calibration corpus 𝒞, model families, tokenizers, downstream tasks, random baselines, and the target forecasting task (threshold crossing).

We formalize a setting in which we wish to forecast of downstream task performance across heterogeneous model families, using only black-box access to next-token log-probabilities on a fixed public text corpus. The central complication is that the natural loss observable for a checkpoint is measured in under that checkpoint’s tokenizer, and tokenization varies substantially across families; thus, raw loss values are not directly comparable without additional structure.

Let ℱ denote a set of model families. A family f ∈ ℱ is characterized by (i) a tokenizer T_f with vocabulary V_{T_f}, (ii) a pretraining procedure (including corpus and optimization schedule), and (iii) a sequence of publicly available checkpoints indexed by training progress.
A is denoted by θ, and we write T(θ) for its tokenizer when the dependence matters.
Each checkpoint defines an autoregressive distribution over token sequences,
$$ p_\theta(t_{1:k}) \;=\; \prod_{j=1}^k p_\theta(t_j \mid t_{<j}), \qquad t_j\in V_{T(\theta)}. $$
We assume only the standard inference interface: given a prefix (in tokens) we can query the conditional log-probabilities log p_θ(⋅ ∣ prefix) over V_T(θ). No gradient access, retraining, or internal-state inspection is assumed.

We fix a public calibration corpus 𝒞 consisting of canonical text strings with a fixed preprocessing pipeline. Concretely, 𝒞 is a multiset (or distribution) over UTF-8 byte sequences, and all normalization decisions (newlines, whitespace, Unicode normalization, document boundaries) are specified once and held constant. This canonicalization is part of the definition of 𝒞 and is shared across all evaluations.
We emphasize that 𝒞 serves as a for measuring progress; it is not tuned per family and does not depend on any particular downstream task.

Given a tokenizer T, each string x ∈ 𝒞 deterministically tokenizes to a token sequence T(x) = (t₁, …, t_|T(x)|).
Using teacher forcing, we define the per-token cross-entropy of checkpoint θ on 𝒞 under tokenizer T as
$$ \mathrm{CE}_{\mathcal{C}}^{T}(\theta) \;:=\; \frac{\mathbb{E}_{x\sim \mathcal{C}}\big[-\log p_\theta\big(T(x)\big)\big]}{\mathbb{E}_{x\sim \mathcal{C}}\big[|T(x)|\big]} \;=\; \frac{\mathbb{E}_{x\sim \mathcal{C}}\big[-\sum_{j=1}^{|T(x)|}\log p_\theta(t_j\mid t_{<j})\big]}{\mathbb{E}_{x\sim \mathcal{C}}\big[|T(x)|\big]}. $$
In practice we estimate CE_𝒞^T(θ) from n samples of 𝒞 (or from a fixed finite list), by streaming through each T(x) and aggregating token-level losses. We take the normalization by expected token length to emphasize that the observable is inherently tokenizer-dependent: even for the same underlying byte sequence, the induced token stream and its entropy vary with T.

Let 𝒯 be a fixed suite of downstream tasks. Each task t ∈ 𝒯 specifies (i) a data distribution D_t over evaluation instances, (ii) an evaluation procedure (prompting format, decoding rule, and scoring), and (iii) a scalar metric m_t mapping model outputs to [0, 1] (e.g., accuracy, exact match, or verifier-based success probability).
We denote by P_t(θ) ∈ [0, 1] the performance of checkpoint θ under this procedure, i.e.,
P_t(θ) := 𝔼_{(x, y) ∼ D_t}[m_t(θ; x, y)],
with the understanding that the expectation may include randomness from decoding (if any) and from the metric (e.g., stochastic verifier). Each task also comes with a b_t, defined by the expected score of a task-specific randomized strategy (e.g., uniform guessing among choices, sampling an answer from a prescribed distribution, or returning an empty string when appropriate). The baseline b_t fixes what we mean by ``nontrivial’’ performance on task t.

Fix a tolerance parameter ϵ > 0. Our primary forecasting question is: for a given checkpoint θ and task t, can we predict whether θ exceeds baseline by margin ϵ, i.e.,
𝟙{P_t(θ) > b_t + ϵ}?
Equivalently, we may wish to estimate the for each task: the smallest value of a progress coordinate at which task performance reliably rises above baseline (up to noise). The key requirement is : the predictor should remain accurate when applied to families with different tokenizers and pretraining corpora, using only the shared calibration corpus 𝒞 and the standardized task evaluations.

We assume access to a set of checkpoints {θ_i} across families, and we can compute (i) calibration losses CE_𝒞^T(θ_i)(θ_i) via teacher forcing on 𝒞, and (ii) task performances P_t(θ_i) for some subset of tasks/checkpoints (possibly limited by evaluation cost).
A typical use case is that task evaluations are available for a sparse set of checkpoints, while calibration losses on 𝒞 are cheap enough to compute broadly.
We evaluate forecasting methods by procedures: fit any task-specific mapping from the available labeled pairs (progress proxy(θ_i), P_t(θ_i)) using all but one family, then test threshold-crossing prediction on the held-out family. This protocol directly measures robustness to tokenizer and corpus shifts at the family boundary.

The above definitions make explicit what is and is not observable: the interface yields token-level likelihoods under each model’s tokenizer, but does not provide a tokenizer-invariant likelihood of the underlying byte string.
Consequently, the same 𝒞 induces different token streams T(𝒞) across families, and the mapping θ ↦ CE_𝒞^T(θ)(θ) mixes (a) model progress and (b) representation-dependent offsets.
The remainder of the paper proposes a calibrated progress proxy that factors out the dominant tokenizer-dependent component on 𝒞 and yields a common coordinate in which task thresholds can be estimated and transferred across families.

4. Calibrated Progress Metric L^*: construction, variants (additive vs multiplicative normalization; domain-sliced calibration), and design principles (public, stable, hard-to-game).

The observable CE_𝒞^T(θ) conflates two effects: (i) the extent to which θ captures multi-token structure of strings drawn from 𝒞, and (ii) tokenizer-induced changes in the marginal entropy of the induced token stream. We therefore introduce a calibrated proxy that removes a large, model-independent portion of the tokenizer dependence by subtracting a unigram baseline computed on the tokenization of the canonical corpus.

Fix a tokenizer T and consider the multiset of tokens obtained by applying T to 𝒞. Let p̂_T denote the empirical unigram distribution on V_T induced by token counts on T(𝒞), i.e.,
$$ \widehat{p}_T(v) \;=\; \frac{c_T(v)}{\sum_{u\in V_T} c_T(u)} \qquad (v\in V_T), $$
and define U_T to be the maximum-likelihood unigram model U_T(v) = p̂_T(v). The corresponding cross-entropy on 𝒞 is
$$ \mathrm{CE}_{\mathcal{C}}^{T}(U_T) \;=\; \frac{\mathbb{E}_{x\sim\mathcal{C}}\big[-\sum_{j=1}^{|T(x)|}\log U_T(t_j)\big]}{\mathbb{E}_{x\sim\mathcal{C}}[|T(x)|]}. $$
Since U_T ignores context, CE_𝒞^T(U_T) is dominated by the marginal token entropy induced by T on 𝒞, and thus captures the leading tokenizer-dependent offset that makes raw losses incomparable.

For a checkpoint θ with tokenizer T = T(θ) we define
L^*(θ) := CE_𝒞^T(θ) − CE_𝒞^T(U_T).
Equivalently, by expanding both terms and regrouping, we may write L^*(θ) as an average log-likelihood ratio against the unigram baseline:
$$ L^*(\theta) = \frac{\mathbb{E}_{x\sim\mathcal{C}}\Big[\sum_{j=1}^{|T(x)|}\log \frac{U_T(t_j)}{p_\theta(t_j\mid t_{<j})}\Big]}{\mathbb{E}_{x\sim\mathcal{C}}[|T(x)|]}. $$
Thus L^*(θ) ≤ 0 for any θ that improves upon the unigram predictor on average, and more negative'' meansmore structure captured beyond bag-of-tokens statistics’’ on the calibration distribution. The crucial point is that the subtraction is performed a tokenizer: U_T changes with T, but is shared across all checkpoints that use T, so L^* removes the dominant translation term that depends on how T segments 𝒞.

The additive normalization above is the simplest and most stable: it preserves units (nats per token) and directly cancels an offset term in the affine distortion model. One may also consider multiplicative variants such as
$$ L^{\mathrm{rel}}(\theta) \;:=\; \frac{\mathrm{CE}_{\mathcal{C}}^{T}(\theta)}{\mathrm{CE}_{\mathcal{C}}^{T}(U_T)}, \qquad\text{or}\qquad L^{\mathrm{gain}}(\theta) \;:=\; \frac{\mathrm{CE}_{\mathcal{C}}^{T}(U_T)-\mathrm{CE}_{\mathcal{C}}^{T}(\theta)}{\mathrm{CE}_{\mathcal{C}}^{T}(U_T)}. $$
These ratios partially adjust for scale differences across tokenizers (e.g., token vocabularies that induce systematically higher marginal entropies), at the cost of amplifying estimation noise when CE_𝒞^T(U_T) is small and, more importantly, of no longer exactly matching an ``unknown additive offset’’ model. In our setting, where the leading non-comparability arises from tokenization-dependent rather than , we treat the additive L^* as the default; multiplicative normalizations may be used as sensitivity checks when families exhibit evidence of heterogeneous slopes a_T in the affine model.

A second normalization axis concerns the denominator. Our definition measures per-token improvement. If desired, one can convert to a per-byte (or per-character) scale by multiplying by the expected tokens-per-byte factor 𝔼[|T(x)|]/𝔼[bytes(x)] on 𝒞. We do not rely on such conversions for the theorems, but note that they can improve interpretability when comparing tokenizers with very different average token lengths.

A single corpus 𝒞 may contain heterogeneous regimes (natural language, code, tables, mathematics), and tasks often depend disproportionately on a particular regime. To accommodate this, we may partition 𝒞 into slices {𝒞_s}_s ∈ S using a public, deterministic classifier (e.g., file-type metadata or regex-based heuristics on the canonical byte strings). For each slice we define
L_s^*(θ) := CE_{𝒞_s}^T(θ) − CE_{𝒞_s}^T(U_T, s),
where U_T, s is the unigram model estimated from T(𝒞_s). This yields a vector L^*(θ) = (L_s^*(θ))_s ∈ S that can be mapped to task performance via a monotone model in a low-dimensional space (e.g., an isotonic fit on a weighted sum ∑_sw_sL_s^* with w_s ≥ 0). Slice-wise calibration has two roles: it reduces confounding when progress is uneven across domains, and it provides diagnostics for why a task threshold transfers (or fails to transfer) across families.

Because 𝒞 is public, we cannot appeal to secrecy to prevent overfitting; instead we require that the metric be stable and difficult to game in any targeted way that would generalize poorly. We therefore adopt the following principles. (i) 𝒞 and its canonicalization are versioned and fully specified, so that independent parties compute identical L^*. (ii) subtracting CE_𝒞^T(U_T) makes L^* insensitive to large changes in marginal token entropy induced by segmentation choices, thereby discouraging ``tokenizer engineering’’ as a way to fabricate apparent progress. (iii) 𝒞 should be sufficiently diverse and large that memorization of particular strings does not substantially change per-token cross-entropy, and improvements on 𝒞 correlate with general improvements in modeling structure. (iv) if domain slicing is used, the slice definition is public and deterministic, preventing post hoc selection of slices to favor a given family.

Having fixed L^* and its variants, we next describe how to compute CE_𝒞^T(θ) and U_T efficiently for many checkpoints, and how to fit monotone task curves and estimate emergence thresholds with uncertainty.

5. Algorithms: computing CE_𝒞^T(θ), building unigram baseline U_T, fitting monotone performance curves and estimating thresholds with uncertainty.

All quantities in our framework are computable from black-box next-token log-probabilities together with a public, canonically preprocessed calibration corpus. We therefore separate the procedure into (i) tokenizer-level preprocessing that is shared across all checkpoints using a given tokenizer, and (ii) checkpoint-level evaluation that requires only teacher-forced forward passes.

We assume each example x ∈ 𝒞 is represented as a fixed UTF-8 byte sequence under a versioned canonicalization (normalization, newline conventions, and any filtering are part of the corpus specification). For each tokenizer T appearing among our checkpoints, we run a single streaming pass over 𝒞 to obtain tokenized sequences T(x) = (t₁, …, t_|T(x)|) and accumulate token counts
$$ c_T(v)\;:=\;\sum_{x\in\mathcal{C}}\sum_{j=1}^{|T(x)|}\mathbf{1}\{t_j=v\}, \qquad v\in V_T. $$
In practice we do not store all tokenized sequences; we store (a) the count table c_T(⋅), (b) the total token mass N_T := ∑_vc_T(v), and (c) optional per-example token lengths |T(x)| if we wish to report auxiliary diagnostics such as tokens-per-byte. This tokenizer pass is amortized across all checkpoints that use T.

Given counts, the maximum-likelihood unigram model is U_T(v) = c_T(v)/N_T. We then compute the unigram cross-entropy on 𝒞 without any further access to model probabilities, using only the counts:
$$ \mathrm{CE}_{\mathcal{C}}^{T}(U_T) \;=\; -\frac{1}{N_T}\sum_{v\in V_T} c_T(v)\log\frac{c_T(v)}{N_T}. $$
We treat tokens with c_T(v) = 0 as absent from the empirical support and omit them from the sum. No smoothing is needed because the baseline is evaluated on the same sample from which it is estimated; however, if one wishes to reserve a small held-out subset of 𝒞 for auditing, a smoothed unigram (e.g., add-α) can be substituted with no change to the downstream fitting procedure.

For a checkpoint θ with tokenizer T, we estimate CE_𝒞^T(θ) via a streaming teacher-forced loss over tokenized examples. Writing T(x) = (t₁, …, t_n), we query the model for log p_θ(t_j ∣ t_< j) along the prefix and accumulate
$$ S_\theta\;:=\;\sum_{x\in\mathcal{C}}\sum_{j=1}^{|T(x)|} -\log p_\theta(t_j\mid t_{<j}), \qquad N_T\;:=\;\sum_{x\in\mathcal{C}} |T(x)|. $$
The estimator is then $\widehat{\mathrm{CE}}_{\mathcal{C}}^{T}(\theta)=S_\theta/N_T$. This computation is linear in the number of calibration tokens and uses only forward passes. For numerical stability we recommend accumulating in double precision and, where available, requesting log-probabilities directly rather than probabilities. If the model imposes a context window, we split long T(x) into overlapping segments and include each token’s loss exactly once (standard sliding-window evaluation); since 𝒞 is fixed, this policy is itself part of the public specification.

Once $\widehat{\mathrm{CE}}_{\mathcal{C}}^{T}(\theta)$ is available, the calibrated metric is computed by subtraction using the precomputed tokenizer constant:
$$ \widehat{L}^*(\theta) \;=\; \widehat{\mathrm{CE}}_{\mathcal{C}}^{T}(\theta)-\mathrm{CE}_{\mathcal{C}}^{T}(U_T). $$
Because CE_𝒞^T(U_T) depends only on T and 𝒞, it is computed once per tokenizer and reused for all checkpoints. This is the primary mechanism by which we obtain cross-family comparability without modifying any model.

For each task t ∈ 𝒯, suppose we have evaluated a set of checkpoints {θ_i}_i = 1^m to obtain empirical performances P̂_t(θ_i) (e.g., accuracy or pass@1) together with the corresponding L̂^*(θ_i). We fit a monotone mapping ĝ_t from L^* to performance. Two concrete instantiations are convenient:

We emphasize that monotonicity is the only structural constraint needed for the subsequent theory; the isotonic choice is thus the default.

Fix a margin ϵ > 0 and define the empirical emergence threshold as the smallest ℓ at which the fitted curve exceeds baseline by ϵ:
η̂_t := inf {ℓ : ĝ_t(ℓ) ≥ b_t + ϵ}.
For isotonic regression, ĝ_t is defined on intervals; we take η̂_t to be the left endpoint of the first fitted block whose value is at least b_t + ϵ. When checkpoints are sparse near the crossing, we may optionally report an interval bracketed by the largest L̂^* with ĝ_t < b_t + ϵ and the smallest with ĝ_t ≥ b_t + ϵ.

Two sources of noise enter: calibration-loss estimation (finite |𝒞| and model stochasticity, if any) and task-evaluation noise (finite test set, sampling in generative evaluation). We therefore attach uncertainty to η̂_t using either (i) a nonparametric bootstrap over checkpoints, resampling the pairs (L̂^*(θ_i), P̂_t(θ_i)) and refitting ĝ_t to obtain a bootstrap distribution of η̂_t, or (ii) an analytic delta-style approximation when using a parametric curve. If per-checkpoint task evaluations provide standard errors (e.g., binomial confidence intervals), we incorporate them as weights in isotonic regression, improving stability when some checkpoints are evaluated with fewer trials.

To assess portability, we recommend leave-one-family-out evaluation: fit ĝ_t (and thus η̂_t) using all but one family, then predict which held-out checkpoints satisfy P̂_t(θ) > b_t + ϵ by thresholding ĝ_t(L̂^*(θ)). This protocol directly targets the cross-tokenizer generalization problem and motivates the theoretical analysis that follows.

In the next section we formalize the latent affine distortion model underpinning these procedures and derive finite-sample error bounds (and matching lower bounds) for threshold prediction across heterogeneous tokenizers.

6. Theory I — Latent Affine Distortion Model: formal assumptions, estimator for latent progress using L^*, and finite-sample error bounds for threshold prediction; matching lower bounds for raw-loss/compute predictors.

We now formalize the modeling assumptions that render the algorithmic quantities from the previous section predictive across heterogeneous families, and we isolate the precise obstruction faced by any method that does not calibrate out tokenizer-dependent offsets.

Fix the public calibration corpus 𝒞 with canonical preprocessing. For a given tokenizer T, the calibration loss CE_𝒞^T(θ) depends both on the underlying predictive structure learned by the checkpoint θ and on tokenizer-induced artifacts (e.g., average token length, frequency of rare tokens, and boundary conventions) that shift per-token cross-entropy. We encode this separation via a tokenizer-invariant latent progress variable z(θ) ∈ ℝ and tokenizer-specific affine distortions:

where a_T > 0 and b_T ∈ ℝ depend only on T (and 𝒞), and ξ is mean-zero noise capturing both finite-sample estimation and residual model mismatch. In particular, even if two checkpoints have identical z(θ), their raw cross-entropies can differ by b_T if they use different tokenizers, and thus raw loss is not directly comparable across families.

The key observation is that the dominant tokenizer dependence in is an additive offset. We therefore seek a quantity computable from 𝒞 and T alone that tracks b_T. The empirical unigram baseline U_T is defined by token counts on T(𝒞), and its cross-entropy CE_𝒞^T(U_T) can be computed without querying any checkpoint. Our calibration hypothesis is that this unigram cross-entropy captures the tokenizer-induced offset up to a tokenizer-dependent but model-independent discrepancy:

where δ_T is bounded (or at least does not vary adversarially across tokenizers used in practice). Equation is not tautological; rather, it asserts that the aspects of tokenization that shift per-token loss for models (e.g., the empirical entropy of token types under T) are largely captured by the unigram entropy.

In practice we observe $\widehat{\mathrm{CE}}_{\mathcal{C}}^{T}(\theta)$ computed by teacher forcing on a finite sample of calibration tokens (or examples) of size n. Under standard concentration assumptions (e.g., per-token log-loss is sub-exponential under 𝒞 and the model), we may treat the estimation error as σ-sub-Gaussian at the corpus level:
$$ \widehat{L}^*(\theta)=L^*(\theta)+\varepsilon_{\mathrm{cal}},\qquad \varepsilon_{\mathrm{cal}}\ \text{is mean-zero and}\ O(\sigma/\sqrt{n}). $$
The essential point is that calibration does not increase estimation variance: CE_𝒞^T(U_T) is a deterministic function of the token counts (and can be computed exactly on the same sample), so uncertainty arises primarily from model loss evaluation and any held-out auditing choice.

Fix a downstream task t with baseline b_t and performance P_t(θ) ∈ [0, 1]. We posit a monotone link between latent progress and task performance:
P_t(θ) = ϕ_t(z(θ)) + ϵ_t,
where ϕ_t is nondecreasing and L-Lipschitz, and ϵ_t is mean-zero σ_t-sub-Gaussian measurement noise arising from finite evaluation budgets. Define the (latent) emergence threshold η_t as the smallest z such that ϕ_t(z) ≥ b_t + ϵ for a user-chosen margin ϵ > 0. Given m labeled checkpoints, we estimate a monotone curve ĝ_t in L^*-space (e.g., isotonic regression), and set η̂_t = inf {ℓ : ĝ_t(ℓ) ≥ b_t + ϵ} as in the algorithmic section.

The proof combines concentration of L̂^*(θ) around a_Tz(θ), standard risk bounds for isotonic regression under sub-Gaussian noise, and a local inverse-Lipschitz transfer through the monotone map at the crossing; importantly, the bound depends on noise and sample size but not on unknown tokenizer offsets.

We finally state the obstruction: any estimator that uses only raw loss CE_𝒞^T(θ) (or compute/parameter count proxies that correlate with loss within a family) cannot, in general, remove b_T without additional information tied to T. This yields an irreducible cross-tokenizer error.

The proof is a two-point indistinguishability argument: by choosing instances where the observable statistics coincide up to noise but the latent mapping differs via b_T, any uncalibrated rule must err by an amount proportional to the offset mismatch. Thus, Theorem~ is essentially tight in its dependence on calibration.

We next address a complementary question: if one wished to avoid tokenizer dependence entirely by marginalizing over all segmentations of a string, can this be done efficiently? The answer is negative in general, and it motivates L^* as the appropriate surrogate under realistic resource constraints.

7. Theory II — Tokenization Intractability: #P-hardness of exact segmentation-invariant likelihood for autoregressive models; implications and why L^* is the right surrogate under resource constraints.

A natural response to the tokenizer-dependence exhibited by per-token cross-entropy is to attempt to eliminate tokenization from the definition of likelihood altogether. Concretely, fix a canonical byte string x (e.g., an element of the calibration corpus 𝒞). For a tokenizer T with vocabulary V_T, let decode_T be the deterministic map from token sequences s = (t₁, …, t_k) ∈ V_T^* to byte strings, and define the set of of x by
𝒮_T(x) := {s ∈ V_T^*: decode_T(s) = x}.
Given an autoregressive model p_θ over V_T we may then define the segmentation-marginal (tokenization-invariant) likelihood
$$ p_\theta^{\mathrm{marg}}(x)\ :=\ \sum_{s\in \mathcal{S}_T(x)} p_\theta(s) \qquad\text{where}\qquad p_\theta(s)=\prod_{j=1}^{|s|} p_\theta(t_j \mid t_{<j}). $$
One could correspondingly define a string-level calibration loss −log p_θ^marg(x) and average it over x ∼ 𝒞. If computable, such a quantity would remove dependence on a particular segmentation T(x) and might appear to dominate our calibrated surrogate L^*.

The obstruction is that the summand p_θ(s) depends on the entire token prefix t_< j through the autoregressive conditionals. Even when 𝒮_T(x) admits a compact representation as a tokenization lattice (a DAG whose paths correspond to valid segmentations), the value of a path cannot be written as a product of edge weights that depend only on local lattice state, because the conditional at position j is a function of the full prefix sequence, not merely of the already-decoded substring. Thus the standard forward algorithm for HMMs (or weighted finite-state acceptors) is unavailable absent strong additional structure.

We reduce from counting paths in a directed acyclic graph (DAG). Let G = (V, E) be a DAG with distinguished vertices s, t, and let #PATHS(G) denote the number of directed s → t paths. We construct a tokenizer T and a fixed string x such that each s → t path corresponds bijectively to a token sequence in 𝒮_T(x), and we define p_θ so that each such sequence has identical probability mass (or, more generally, probability equal to a prescribed path weight), thereby forcing p_θ^marg(x) to encode #PATHS(G).

One convenient construction is to choose x to be a constant byte string (e.g., x = a^L for a suitable length L), and to define the vocabulary V_T so that each edge e ∈ E corresponds to a token whose decoded byte string is a^ℓ(e), with lengths ℓ(e) chosen so that the concatenation constraint enforces that only edge sequences forming valid s → t paths sum to exactly length L. The tokenization lattice over x then contains a path for each valid sequence of edge-tokens whose lengths sum to L, and by construction this set coincides with the set of s → t paths in G. We then define an autoregressive model p_θ whose next-token probabilities implement the adjacency constraints of G: condition on the prefix encoding a partial path ending at vertex v, the model assigns positive probability only to edge-tokens outgoing from v, and assigns probability 1/deg⁺(v) to each such outgoing edge. Under this definition, every complete s → t path has probability equal to the product of reciprocals of out-degrees along the path, and p_θ^marg(x) becomes the sum over all such path probabilities. By modifying these conditionals one may instead encode unit weights and obtain p_θ^marg(x) = #PATHS(G)/Z for an easily computable normalizing constant Z. Since exact computation of #PATHS(G) is #P-complete, exact computation of p_θ^marg(x) is #P-hard.

The essential point is that the autoregressive dependence on the entire prefix allows us to ``remember’’ the current vertex of the path within the hidden state (i.e., within the prefix itself), while the decoding map forces all complete paths to correspond to the same output string x. This completes the reduction at the level of a proof sketch; a full proof fixes the length-coding to ensure polynomial lattice size and verifies that the constructed conditionals define a valid probability distribution.

Theorem~ formalizes that tokenization-marginal likelihood is not merely expensive but computationally intractable in the worst case for general autoregressive models. Consequently, any proposal to compare checkpoints across tokenizers by evaluating −log p_θ^marg(x) on a nontrivial calibration corpus 𝒞 either (i) restricts to special model classes with additional locality properties, (ii) incurs exponential-time dependence on string length in the worst case, or (iii) relies on uncontrolled approximations whose error can be arbitrarily large when probability mass is spread across many segmentations.

One may approximate p_θ^marg(x) by summing over only a subset of segmentations, for instance by enumerating the top-B segmentations under a proposal distribution and computing a truncated sum. This yields a lower bound
p_θ, B^marg(x) := ∑_{s ∈ 𝒮_T, B(x)}p_θ(s) ≤ p_θ^marg(x),
and hence an upper bound on the corresponding negative log-likelihood. However, the gap log p_θ^marg(x) − log p_θ, B^marg(x) is controlled by the leftover mass outside the beam, which may be large precisely in regimes where tokenization ambiguity is high. Since our objective is stable cross-family forecasting, a metric whose approximation error can vary unpredictably across tokenizers is not an acceptable foundation.

In contrast, L^*(θ) = CE_𝒞^T(θ) − CE_𝒞^T(U_T) is computable in time linear in |T(𝒞)| using only teacher forcing and token counts, and it directly targets the dominant nuisance dependence (the tokenizer-induced offset) in the latent affine distortion model. The preceding section established that, under mild statistical assumptions, this calibration suffices for cross-family threshold prediction with error controlled by noise and sample size rather than by arbitrary tokenizer offsets. Theory~II clarifies that there is no generally efficient alternative that achieves true segmentation invariance by exact marginalization. We therefore treat L^* not as a heuristic, but as a principled response to a computational barrier: it removes the leading tokenizer effect that is both practically consequential and efficiently estimable, while avoiding an intractable marginal likelihood that would otherwise be the canonical ``tokenizer-free’’ quantity.

8. Experimental Protocol (for strengthening claims): datasets, model families, checkpoints, tasks (multiple-choice + verifier-based), and evaluation metrics (continuous + discontinuous); transfer tests (leave-one-family-out).

We evaluate whether the calibrated progress metric L^*(θ) supports (i) cross-family prediction of downstream performance curves and (ii) cross-family localization of emergence thresholds relative to task baselines. Our protocol therefore separates (a) computation of L^*(θ) on a fixed public calibration corpus from (b) task evaluation on a suite 𝒯, and measures generalization via leave-one-family-out (LOFO) transfer.

We fix a public corpus 𝒞 specified as a canonical list of UTF-8 byte strings together with an explicit normalization pipeline (e.g., newline normalization and exact byte preservation; no additional detokenization). We publish hashes of each element and the concatenation order to make 𝒞 immutable. To control variance in CE_𝒞^T(⋅) estimates, we choose 𝒞 sufficiently large that n := |T(𝒞)| is on the order of 10⁷ tokens for typical tokenizers, and we additionally report results for smaller prefixes (e.g., n ∈ {10⁵, 10⁶, 10⁷}) to empirically validate concentration behavior. We ensure that 𝒞 spans multiple registers (web text, encyclopedic text, code, and formal mathematics) so that unigram offsets induced by tokenization are not artifactually tied to a single domain.

We include multiple open-weight families ℱ that differ in tokenizer, scale, and pretraining mixture. For each family, we sample checkpoints across the training trajectory or across parameter scales so that L^*(θ) spans a wide range (including models near random-task baselines). When intermediate training checkpoints are unavailable, we approximate a trajectory by selecting multiple scales within the same family (e.g., {1B, 3B, 7B, 13B} parameters) and treat scale as an imperfect proxy for progress; this is acceptable because our predictors are trained on L^* rather than on scale. For each checkpoint we record: tokenizer identifier, vocabulary size |V_T|, maximum context length used in evaluation, and decoding defaults. We avoid any post hoc filtering of checkpoints based on task outcomes to prevent selection bias in threshold estimates.

For each tokenizer T appearing in the experiment, we tokenize 𝒞 once and compute unigram counts c_T(v) and the maximum-likelihood baseline U_T. We then compute CE_𝒞^T(U_T) exactly from counts. For each checkpoint θ using tokenizer T, we compute CE_𝒞^T(θ) by teacher forcing on T(𝒞), streaming through tokens to avoid storing per-token losses. We report (i) the mean cross-entropy and (ii) an empirical standard error estimated by block bootstrap over documents in 𝒞, which captures both finite-sample effects and mild nonstationarity. Finally, we set L^*(θ) = CE_𝒞^T(θ) − CE_𝒞^T(U_T) and propagate uncertainty by subtracting the deterministic baseline and retaining the model-loss standard error.

We construct 𝒯 to include both discontinuous metrics (accuracy/EM) and continuous metrics (probabilistic scores), because emergence phenomena are often more interpretable under discrete success but more statistically efficient under continuous scoring. We include: (i) multiple-choice knowledge and reasoning tasks, (ii) short-form generation with exact-match or normalized-match, and (iii) verifier-based tasks where success is determined by an external checker (unit tests for code, symbolic verification for algebra, or a deterministic rubric implemented as a program). For each task t we specify a random baseline b_t (e.g., uniform choice for multiple-choice, random-string or null output for EM, or a verifier-determined success rate under a randomized policy), and we fix the prompt format, few-shot count, and choice ordering policy to avoid leakage through formatting.

For multiple-choice tasks, we compute both (a) accuracy (argmax choice) and (b) a continuous score given by the normalized probability assigned to the correct option. Concretely, for choices c ∈ {1, …, k} with tokenizations T(c), we compute length-normalized log-likelihoods ℓ_c := |T(c)|⁻¹∑_jlog p_θ(t_c, j ∣ prompt, t_{c, < j}) and define a softmax over {ℓ_c}; this yields a well-defined continuous measure even when accuracy is saturated or near baseline. For open-generation tasks we report EM (or a canonical normalization thereof) and, when appropriate, a verifier-derived success probability. For all tasks we fix decoding to greedy (temperature 0) unless the task definition requires sampling; in the latter case we use a fixed seed schedule and report Monte Carlo standard errors.

Given evaluated pairs (L^*(θ_i), P_t(θ_i)), we fit a monotone predictor ĝ_t using isotonic regression; we also fit a parametric alternative (a logistic curve with a fitted change-point) as a robustness check. For a fixed margin ϵ > 0, we estimate the emergence threshold in L^*-space by
η̂_t := inf {ℓ : ĝ_t(ℓ) ≥ b_t + ϵ},
and we report uncertainty intervals by resampling checkpoints (nonparametric bootstrap over the labeled set used for fitting).

To test portability across tokenizers and training mixtures, we perform LOFO evaluation: for each held-out family f ∈ ℱ, we fit ĝ_t and η̂_t using checkpoints from ℱ \ {f} and then predict P_t(θ) and threshold crossing on checkpoints in f. We measure (i) squared error in P_t predictions, (ii) calibration of predicted probabilities for verifier-based success (reliability curves and Brier score), and (iii) classification error for threshold crossing 𝟙{P_t(θ) > b_t + ϵ}. We compare against ablations that replace L^* by raw CE_𝒞^T(θ), parameter count, and FLOPs, using identical fitting procedures, thereby isolating the effect of unigram calibration rather than the curve-fitting method.

We publish the exact lists of checkpoints, tokenizer identifiers, 𝒞 hashes, task prompts, evaluation code, and all intermediate aggregates (counts for U_T, per-checkpoint CE, and per-task outputs) sufficient to recompute L^* and the fitted thresholds. This ensures that subsequent sections on stress tests can vary one axis (e.g., corpus choice or tokenization perturbations) while holding the remainder of the pipeline fixed.

9. Stress Tests and Failure Modes: dataset shift, calibration corpus mismatch, tokenization perturbations, adversarial/gaming analysis, and when L^* should not be expected to work.

We now delineate stress tests intended to falsify (or at least delimit) the empirical claim that L^*(θ) provides a portable progress coordinate across heterogeneous families. The point of these tests is not to ``improve’’ L^* by ad hoc fixes, but to expose when the assumptions implicit in Theorems~1–3 are violated strongly enough that threshold forecasts become unreliable.

Hypothesis (H2) requires that task performance be (approximately) a monotone function of a tokenizer-invariant z(θ). This can fail under distribution shift in the task distribution D_t relative to what is captured by 𝒞 and by pretraining. We therefore evaluate each task t under controlled perturbations of D_t that preserve the task definition but alter surface statistics, e.g.,
(i) paraphrasing prompts and options,
(ii) swapping topic mixtures (e.g. science vs. sports) while keeping difficulty constant,
(iii) introducing code-style changes (renaming variables, reformatting) for verifier-based coding tasks.
We report, for each perturbation, the change in LOFO threshold-crossing error and the stability of the fitted curve ĝ_t (measured by the maximum vertical deviation between fits). Large instability indicates that the ``single-coordinate’’ ordering is insufficient for that task family, even if unperturbed performance appeared well predicted.

Theorem~1 is local to the choice of 𝒞; if 𝒞 is poorly matched to the entropy structure induced by a tokenizer, the term δ_T may be large or may correlate with model family, defeating calibration. We stress-test corpus dependence by defining a small set of alternative public calibration corpora {𝒞^(r)}_r = 1^R spanning distinct registers (e.g. formal math, code, encyclopedic prose) under the same canonical byte-level preprocessing. For each checkpoint we compute L^*(r)(θ) using 𝒞^(r) and examine:

If L^* is genuinely capturing a tokenizer-invariant progress coordinate, then changing 𝒞 should move all checkpoints in a largely order-preserving manner and should not induce large family-specific residuals. When it does, we treat forecasts as corpus-conditional and report the dependence explicitly.

Although L^* removes a dominant unigram offset, it does not marginalize over segmentations (cf. Theorem~4). We therefore perform controlled perturbations of the tokenizer while keeping the underlying byte strings fixed. Concretely, for BPE/Unigram tokenizers we construct perturbed tokenizers by (i) altering normalization rules (e.g. whitespace collapsing vs. preservation), (ii) injecting stochastic BPE-dropout at tokenization time and re-estimating U_T on the induced distribution, and (iii) restricting/expanding the vocabulary by pruning rare merges. For each perturbation T̃ we recompute CE_𝒞^T̃(U_T̃) and L_T̃^*(θ) for models that can be evaluated under T̃ (or for proxy models trained with that tokenizer when direct evaluation is impossible). The diagnostic is whether L^* remains approximately affine across perturbations: we fit L_T̃^*(θ) ≈ α_T̃L_T^*(θ) + β_T̃ and examine the residual variance. Large, structured residuals indicate that higher-order tokenization effects (beyond unigram entropy) are material for 𝒞 and hence for portability.

A distinct failure mode is that downstream evaluation itself introduces tokenizer-coupled artifacts, e.g. multiple-choice scoring that depends on option length under T. While our protocol uses length-normalized likelihoods, normalization is not a panacea when token boundaries align differently with semantic units. We stress-test by re-encoding each task instance under multiple semantically equivalent surface forms with differing token lengths (e.g. adding neutral padding, changing punctuation conventions) and measuring whether the relative ordering of checkpoints at fixed L^* changes. When it does, we attribute the effect to task measurement noise rather than to L^* per se, and we report a task-specific ``format instability’’ score that upper-bounds attainable prediction accuracy under any scalar progress metric.

Because 𝒞 is public, a capable actor could fine-tune a model to reduce CE_𝒞^T(θ) (and hence improve L^*) without commensurate gains on tasks, invalidating threshold forecasts. We operationalize this as an explicit attack: starting from a base checkpoint, we perform lightweight continued training on 𝒞 (or a close paraphrase set) while holding total compute fixed, and we measure the slope ΔP_t/ΔL^* across tasks. A healthy regime exhibits broadly positive slopes; a successful gaming attack yields large negative or near-zero slopes. As mitigations, we consider (i) using multiple disjoint calibration corpora and aggregating L^* (e.g. median across corpora), and (ii) maintaining an additional calibration set for governance use, while retaining the public protocol for open scientific comparison. We emphasize that these mitigations address strategic behavior, not purely statistical noise.

We list conditions under which the theoretical and empirical premises plausibly fail:

In such regimes, we treat L^* as a descriptive statistic of text modeling skill on 𝒞 rather than as a portable progress coordinate, and we recommend reporting these stress-test outcomes alongside any claimed threshold forecasts.

10. Discussion and Governance Use-Cases: portable early-warning reporting; how L^* interacts with RL post-training and inference-time compute (boundaries of applicability); limitations and future work.

The principal claim of this work is not that L^*(θ) is an intrinsic measure of ``intelligence,’’ but that it is a coordinate for forecasting thresholded capability on fixed text tasks under heterogeneous tokenizers. In settings where one wishes to anticipate when a task t will become reliably solvable by publicly released checkpoint, portability is the binding constraint: raw losses, parameter counts, and compute do not commute across families because they carry family-specific offsets. The subtraction in L^* is deliberately minimal; it removes only what is required to align cross-entropies sufficiently to support threshold localization at the granularity implied by Theorem~2.

From a governance perspective, the immediate use-case is for capability onset. Concretely, given a stream of newly released checkpoints {θ} (across any family for which we can compute CE_𝒞^T(θ)), one may maintain a public table of L^*(θ) values on a fixed 𝒞 and, for each monitored task t, report P̂_t(θ) = ĝ_t(L^*(θ)) together with a conservative threshold-crossing statement. A simple operationalization is: fix ϵ > 0 and a tolerable false-alarm rate δ, and declare ``at-risk of emergence’’ if an upper confidence bound on P̂_t(θ) exceeds b_t + ϵ. By Theorem~2, the uncertainty in the implied threshold η̂_t decays as m^−1/2 in the number of labeled checkpoints used to fit ĝ_t; hence the protocol benefits from a small, curated set of evaluations on representative checkpoints, after which large-scale monitoring reduces to cheap calibration loss computation.

We emphasize that such reporting is most defensible when (i) the task evaluation is standardized and (ii) the model is deployed in a mode close to that used for calibration (teacher-forced likelihood reflecting the same base distribution). The calibrated coordinate is designed to support statements of the form: ``among checkpoints that share broadly similar training objectives, those with more negative L^* should not systematically underperform on t unless the task violates the single-coordinate hypothesis.’’ This motivates a governance framing in which L^* is treated as a : it identifies checkpoints for which costly task evaluations are most informative (near η̂_t), and it supplies a portable axis on which to stratify models in reports, without asserting that the axis is causal or complete.

A boundary of applicability concerns post-training procedures, especially preference optimization and reinforcement learning from feedback. These procedures can change the induced policy over completions in ways that are weakly coupled to CE_𝒞^T(θ). In the extreme, one can modify response style, refusal behavior, or tool-use proclivities while leaving next-token likelihood on generic text nearly unchanged; then L^* remains a measure of pretraining-like modeling skill, whereas P_t may shift due to post-training. We therefore recommend a two-coordinate accounting when RL-style post-training is material: report (L^*(θ_base), Δ_t(θ)), where Δ_t(θ) is the task-specific gain (or loss) attributable to post-training relative to a matched base checkpoint. When matched bases are unavailable, one may still use L^* to predict the component of performance and treat deviations as post-training residuals. Formally, one may posit P_t(θ) = ϕ_t(z(θ_base)) + ρ_t(θ) + ϵ_t, where ρ_t captures non-likelihood-aligned modifications; the governance-relevant point is that ρ_t cannot be inferred from L^* alone and must be measured on tasks or proxied by additional diagnostics.

A second boundary concerns inference-time compute. Many deployments use sampling schemes (self-consistency, best-of-k, verifier-guided search) that can amplify success probabilities relative to a single greedy or temperature-controlled decode. Our analysis is silent about these amplification mechanisms: L^* is computed under teacher forcing and thus reflects the base distribution, not the algorithm used to extract answers. A practical remedy is to parameterize performance by both calibrated progress and an inference budget variable c (e.g. number of samples, verifier calls), fitting P̂_t(θ; c) ≈ ĝ_t(L^*(θ), c) with monotonicity in both arguments. Even when c is held fixed in evaluation, governance users should interpret a threshold η̂_t as : a task may ``emerge’’ earlier under aggressive search than under a restricted budget, and the relevant budget depends on the deployment environment.

Several limitations follow directly. First, L^* remains corpus-conditional: it is calibrated on 𝒞, and portability is only promised to the extent that 𝒞 captures the entropy structure that induces tokenizer offsets. Second, L^* is scalar and therefore cannot represent genuinely multi-dimensional progress (e.g. models specializing in code versus dialogue); when the single-coordinate assumption fails, the appropriate response is not to contort L^* but to represent progress by a small vector L^*(r) over a controlled family of calibration corpora and to fit task-specific combinations. Third, L^* is a likelihood-based statistic; as models become increasingly shaped by non-likelihood objectives, we should expect the correlation between L^* and downstream capability to weaken, and monitoring should accordingly weight direct task evaluation more heavily.

Future work therefore divides naturally into three directions. (i) replace the affine distortion model with a hierarchical model in which tokenization effects include bounded higher-order terms, and characterize when unigram subtraction remains minimax-optimal among feasible calibrators. (ii) construct a standardized set of calibration corpora with canonical preprocessing and publish uncertainty intervals for L^*(θ) under finite n, enabling principled comparisons across laboratories. (iii) generalize the framework to multimodal inputs and to settings with tool calls, where teacher-forced text likelihood is an incomplete proxy and where thresholding should be defined over an augmented action space. Our position is that L^* is a useful baseline: it provides a minimal, reproducible alignment across tokenizers, and it clarifies which additional axes must be introduced when that baseline ceases to predict emergence.

Loss That Transfers: Calibrated Progress Metrics and Tight Bounds for Cross-Tokenizer Emergence Forecasting

Metadata

Table of Contents

Content

1. Introduction: emergence, loss-threshold perspective (Du et al.), and the comparability crisis across tokenizers/corpora; statement of contributions and what is provably possible/impossible.

2. Related Work: emergent abilities (Wei et al.), metric artifacts debate, loss-based thresholds (Du et al.), normalized perplexity/bpb ideas, and why cross-family forecasting remains unresolved.

3. Problem Setup and Definitions: calibration corpus 𝒞, model families, tokenizers, downstream tasks, random baselines, and the target forecasting task (threshold crossing).

4. Calibrated Progress Metric L*: construction, variants (additive vs multiplicative normalization; domain-sliced calibration), and design principles (public, stable, hard-to-game).

5. Algorithms: computing CE𝒞T(θ), building unigram baseline UT, fitting monotone performance curves and estimating thresholds with uncertainty.

6. Theory I — Latent Affine Distortion Model: formal assumptions, estimator for latent progress using L*, and finite-sample error bounds for threshold prediction; matching lower bounds for raw-loss/compute predictors.

7. Theory II — Tokenization Intractability: #P-hardness of exact segmentation-invariant likelihood for autoregressive models; implications and why L* is the right surrogate under resource constraints.