โ† Back

Loss That Transfers: Calibrated Progress Metrics and Tight Bounds for Cross-Tokenizer Emergence Forecasting

Metadata

Table of Contents

  1. 1. Introduction: emergence, loss-threshold perspective (Du et al.), and the comparability crisis across tokenizers/corpora; statement of contributions and what is provably possible/impossible.
  2. 2. Related Work: emergent abilities (Wei et al.), metric artifacts debate, loss-based thresholds (Du et al.), normalized perplexity/bpb ideas, and why cross-family forecasting remains unresolved.
  3. 3. Problem Setup and Definitions: calibration corpus ๐’ž, model families, tokenizers, downstream tasks, random baselines, and the target forecasting task (threshold crossing).
  4. 4. Calibrated Progress Metric L*: construction, variants (additive vs multiplicative normalization; domain-sliced calibration), and design principles (public, stable, hard-to-game).
  5. 5. Algorithms: computing CE๐’žT(ฮธ), building unigram baseline UT, fitting monotone performance curves and estimating thresholds with uncertainty.
  6. 6. Theory I โ€” Latent Affine Distortion Model: formal assumptions, estimator for latent progress using L*, and finite-sample error bounds for threshold prediction; matching lower bounds for raw-loss/compute predictors.
  7. 7. Theory II โ€” Tokenization Intractability: #P-hardness of exact segmentation-invariant likelihood for autoregressive models; implications and why L* is the right surrogate under resource constraints.
  8. 8. Experimental Protocol (for strengthening claims): datasets, model families, checkpoints, tasks (multiple-choice + verifier-based), and evaluation metrics (continuous + discontinuous); transfer tests (leave-one-family-out).
  9. 9. Stress Tests and Failure Modes: dataset shift, calibration corpus mismatch, tokenization perturbations, adversarial/gaming analysis, and when L* should not be expected to work.
  10. 10. Discussion and Governance Use-Cases: portable early-warning reporting; how L* interacts with RL post-training and inference-time compute (boundaries of applicability); limitations and future work.

Content

1. Introduction: emergence, loss-threshold perspective (Du et al.), and the comparability crisis across tokenizers/corpora; statement of contributions and what is provably possible/impossible.

A recurring empirical observation in modern large language models is that certain downstream competencies appear to abruptly as training progresses: for a range of checkpoints the measured performance on a task remains near a random or trivial baseline, and then, within a comparatively small increment of training, performance becomes reliably nontrivial. This qualitative pattern has been reported across many tasks and model families and has motivated a threshold-based view of capability onset. In particular, loss-threshold perspectives such as suggest that, within a fixed family and evaluation protocol, the onset of a capability can often be forecast from a scalar progress proxy (typically training loss or perplexity on a reference distribution), by positing that a task becomes solvable once the proxy crosses a task-dependent threshold.

The difficulty addressed in the present work is that forecasting is not well-posed under naรฏve proxies. When models differ in tokenizer, pretraining corpus, or even byte-level preprocessing conventions, the numerical scale of cross-entropy on a fixed string corpus is not directly comparable: the same underlying string distribution may induce substantially different token-length normalizations and different baseline entropies. Consequently, two checkpoints with essentially identical latent modeling ability may exhibit losses that differ by an additive constant (and sometimes a multiplicative distortion) that is determined primarily by tokenization and corpus mismatch rather than by learned structure. This comparability crisis is not merely cosmetic; it undermines any attempt to provide portable early-warning reports such as ``task t will exceed baseline once calibration loss reaches โ„“โ€™โ€™ unless โ„“ is defined in a tokenizer-robust manner.

We formalize this issue by fixing a public calibration corpus ๐’ž with canonical byte-level representation and considering a collection of checkpoints ฮธ drawn from heterogeneous families. Each checkpoint induces an autoregressive distribution over a tokenizer-specific vocabulary, hence yields an average per-token cross-entropy on ๐’ž under its own tokenizer. The central observation is that per-token cross-entropy decomposes into (i) a term reflecting model-specific predictive structure and (ii) a tokenizer-dependent offset reflecting the entropy of the token stream itself. Without correcting for (ii), a raw loss threshold transports poorly across tokenizers: the same underlying character-level uncertainty may correspond to different token-level entropies, and the induced offset does not vanish even as model size grows.

Our point of departure is that, while tokenization-invariant string likelihood would be a natural remedy, it is computationally intractable in the relevant setting. Given an autoregressive model over tokens and a deterministic decode map, the segmentation-invariant likelihood of a string is the sum of probabilities of all token sequences decoding to that string. Exact computation of this marginal requires summing over a tokenization lattice whose size may be polynomial but whose weighted-path sum is, in general, #P-hard. Thus, if one insists on exact tokenization invariance at the level of likelihood, no efficient black-box evaluation procedure is available for large-scale calibration.

We therefore seek an efficiently computable surrogate that removes the dominant tokenizer-induced distortion while preserving monotone information about modeling progress. The proposed surrogate is a obtained by subtracting a tokenizer-specific unigram baseline computed on the same calibration corpus. Concretely, for each tokenizer T we estimate the maximum-likelihood unigram model UT from token counts on T(๐’ž), and define the calibrated score L*(ฮธ) as the difference between the model cross-entropy on ๐’ž and the unigram cross-entropy on ๐’ž under the same tokenizer. The subtraction is designed to cancel the leading-order offset attributable to token inventory and segmentation frequency, yielding a quantity that is comparably scaled across tokenizers up to an affine distortion and noise.

The technical claims of the paper are organized around two questions: (Q1) what can be recovered about cross-family progress from black-box next-token probabilities on a fixed public corpus, and (Q2) under what conditions can we forecast the onset of downstream performance above baseline from such a progress measure. For (Q1), we posit a latent variable z(ฮธ) capturing tokenizer-invariant modeling progress and assume that calibration cross-entropy is an affine function of z(ฮธ) with tokenizer-dependent parameters and sub-Gaussian noise. Under a mild additional condition that the unigram baseline approximates the tokenizer-induced offset up to a bounded tokenizer-dependent remainder, the calibrated metric L*(ฮธ) becomes an affine proxy for z(ฮธ) with substantially reduced cross-tokenizer variance. For (Q2), we assume each task performance Pt(ฮธ) is a monotone Lipschitz function of z(ฮธ) plus measurement noise; then, given a modest number of labeled checkpoints, we fit a monotone curve in L*-space and define an estimated emergence threshold as the smallest L* at which predicted performance exceeds a baseline margin.

Our contributions are the following. First, we define the calibration procedure and metric L*, together with an explicit evaluation protocol requiring only teacher-forced loss computation on a public corpus and token counting for the unigram baseline. Second, we provide finite-sample guarantees for threshold estimation when performance is modeled as a monotone function of latent progress, yielding an error bound that scales with the noise level and the number of labeled checkpoints rather than with tokenizer offsets. Third, we complement the upper bound with a lower bound showing that any method that relies solely on raw loss (or on compute/parameter count) cannot be uniformly cross-tokenizer consistent when tokenizer-dependent offsets are unknown; unavoidable error persists proportional to offset mismatch. Fourth, we clarify the computational boundary by formalizing the #P-hardness of exact segmentation-marginal likelihood, which motivates the use of calibration-based surrogates rather than exact tokenization invariants.

The net effect is a separation between what is feasible and what is not: while exact invariance to tokenization at the level of string likelihood is computationally out of reach, simple calibration against a tokenizer-specific unigram model suffices, under transparent assumptions, to support cross-family forecasting of emergence thresholds with controlled error. The remainder of the paper elaborates these points, and the subsequent section situates the approach within prior work on emergent abilities, loss-based thresholds, and normalization schemes.


The empirical literature on in large language models is now substantial. A prominent line of work, exemplified by , reports qualitative phase-transition-like behavior: as scale or training progresses, performance on certain benchmarks remains near a baseline and then rises sharply over a comparatively narrow range of checkpoints. Subsequent studies broaden this catalog across reasoning, instruction following, and tool use, and propose different operational definitions of ``emergenceโ€™โ€™ (e.g., threshold crossing versus changes in curvature of a scaling curve). For our purposes, the key commonality is methodological: emergence is typically diagnosed by plotting an evaluation metric against a scalar proxy for progress (often compute, parameter count, or a loss-like quantity) and identifying a rapid transition region.

A parallel thread debates to what extent apparent emergence is an artifact of measurement, metric choice, or plotting conventions rather than a property of the underlying competence. Several authors argue that discontinuities can be induced by saturating metrics, floor effects, or transformations (e.g., log scales), and that many ``emergentโ€™โ€™ curves become smooth when examined in a more appropriate coordinate system or with uncertainty properly accounted for; conversely, others emphasize that thresholded downstream success criteria (such as exact match) naturally yield sharp transitions even if an underlying latent ability improves smoothly. While we do not attempt to adjudicate this debate in general, it motivates a conservative stance: any cross-model forecasting method should be explicit about (i) the proxy used for progress, (ii) the baselines defining nontriviality, and (iii) the distortions introduced by representation choices such as tokenization. In particular, even if task performance as a function of latent capability is smooth, mis-calibrated progress proxies can create or destroy the appearance of threshold behavior across heterogeneous model families.

Loss-based thresholding has been proposed as a principled alternative to compute-based heuristics. A representative example is , which argues that, within a controlled setting, a fixed loss threshold on a reference distribution can predict when a capability becomes available, and that loss is a more direct measure of learned structure than parameter count or training FLOPs. This perspective aligns with classical information-theoretic intuitions: cross-entropy is an operational measure of predictive uncertainty, and improvements in loss can be compared across checkpoints even when training schedules differ. However, existing demonstrations are largely (or within a small set of closely matched tokenizers and preprocessing pipelines), where the mapping from training to evaluation is relatively stable. The unresolved difficulty is that loss values are only directly comparable when the underlying token stream and normalization are comparable, an assumption that fails for cross-family comparisons involving different tokenizers or byte-level conventions.

Normalization schemes for language modeling metrics have a longer history, particularly in the evaluation of compression and character-level modeling. Bits-per-character (bpc) and bits-per-byte (bpb) provide tokenization-agnostic units by measuring negative log-likelihood per character/byte under a canonical encoding; such measures are meaningful when the model directly defines a distribution over characters/bytes or when one can efficiently evaluate the probability of the canonical sequence. Related ideas appear in byte-level tokenizers and in comparisons between word-based and subword-based models, where perplexity is known to be sensitive to segmentation length. In practice, however, most contemporary LLMs expose only next-token probabilities over a tokenizer-specific vocabulary, and are evaluated in nats/bits per token under that tokenizer. Converting these token-level probabilities to a true byte-level likelihood generally requires marginalizing over all tokenizations consistent with a byte string, which is computationally prohibitive in the black-box autoregressive setting. Thus, while bpb is an attractive conceptual target, it is not directly available for arbitrary token-based checkpoints without additional structural access.

Several works propose heuristic adjustments intended to improve cross-model comparability of loss or perplexity, including length normalization choices, mixture-of-tokenizers evaluation, and calibrations against simple baselines (e.g., uniform or unigram models) to factor out vocabulary and segmentation effects. The underlying intuition is that raw per-token cross-entropy conflates two components: (i) entropy intrinsic to the token stream induced by a tokenizer on the evaluation corpus, and (ii) the model-dependent reduction in uncertainty due to learned higher-order structure. This is closely related to classical decompositions in language modeling between unigram entropy and higher-order gains, and to the observation that tokenization can shift entropy by changing the granularity at which uncertainty is measured. Nonetheless, prior normalization proposals are often presented as empirical recipes rather than as components of an explicit identification problem: namely, how to recover a tokenizer-invariant notion of progress from token-level losses when the evaluation interface precludes exact tokenization marginalization.

The gap we identify, and which motivates our formulation, is therefore not the absence of progress proxies, but the absence of a proxy with a clear invariance target and an accompanying forecasting protocol. Existing emergence and threshold papers either (a) fix a single family/tokenizer and thereby avoid comparability issues, or (b) compare across models but implicitly assume that raw loss/perplexity is commensurate across tokenizers, an assumption contradicted by well-known segmentation effects. Conversely, tokenization-invariant evaluations (character/byte likelihood) are conceptually clean but operationally unavailable for many checkpoints. What remains unresolved is how to obtain, from black-box next-token probabilities alone, a progress measure that reduces tokenizer-dependent offsets sufficiently to support portable threshold forecasts across heterogeneous families, without requiring retraining, access to internal representations, or intractable marginalization.


3. Problem Setup and Definitions: calibration corpus ๐’ž, model families, tokenizers, downstream tasks, random baselines, and the target forecasting task (threshold crossing).

We formalize a setting in which we wish to forecast of downstream task performance across heterogeneous model families, using only black-box access to next-token log-probabilities on a fixed public text corpus. The central complication is that the natural loss observable for a checkpoint is measured in under that checkpointโ€™s tokenizer, and tokenization varies substantially across families; thus, raw loss values are not directly comparable without additional structure.

Let โ„ฑ denote a set of model families. A family fโ€„โˆˆโ€„โ„ฑ is characterized by (i) a tokenizer Tf with vocabulary VTf, (ii) a pretraining procedure (including corpus and optimization schedule), and (iii) a sequence of publicly available checkpoints indexed by training progress.
A is denoted by ฮธ, and we write T(ฮธ) for its tokenizer when the dependence matters.
Each checkpoint defines an autoregressive distribution over token sequences,
$$ p_\theta(t_{1:k}) \;=\; \prod_{j=1}^k p_\theta(t_j \mid t_{<j}), \qquad t_j\in V_{T(\theta)}. $$
We assume only the standard inference interface: given a prefix (in tokens) we can query the conditional log-probabilities logโ€†pฮธ(โ‹…โ€…โˆฃโ€…prefix) over VT(ฮธ). No gradient access, retraining, or internal-state inspection is assumed.

We fix a public calibration corpus ๐’ž consisting of canonical text strings with a fixed preprocessing pipeline. Concretely, ๐’ž is a multiset (or distribution) over UTF-8 byte sequences, and all normalization decisions (newlines, whitespace, Unicode normalization, document boundaries) are specified once and held constant. This canonicalization is part of the definition of ๐’ž and is shared across all evaluations.
We emphasize that ๐’ž serves as a for measuring progress; it is not tuned per family and does not depend on any particular downstream task.

Given a tokenizer T, each string xโ€„โˆˆโ€„๐’ž deterministically tokenizes to a token sequence T(x)โ€„=โ€„(t1,โ€†โ€ฆ,โ€†t|T(x)|).
Using teacher forcing, we define the per-token cross-entropy of checkpoint ฮธ on ๐’ž under tokenizer T as
$$ \mathrm{CE}_{\mathcal{C}}^{T}(\theta) \;:=\; \frac{\mathbb{E}_{x\sim \mathcal{C}}\big[-\log p_\theta\big(T(x)\big)\big]}{\mathbb{E}_{x\sim \mathcal{C}}\big[|T(x)|\big]} \;=\; \frac{\mathbb{E}_{x\sim \mathcal{C}}\big[-\sum_{j=1}^{|T(x)|}\log p_\theta(t_j\mid t_{<j})\big]}{\mathbb{E}_{x\sim \mathcal{C}}\big[|T(x)|\big]}. $$
In practice we estimate CE๐’žT(ฮธ) from n samples of ๐’ž (or from a fixed finite list), by streaming through each T(x) and aggregating token-level losses. We take the normalization by expected token length to emphasize that the observable is inherently tokenizer-dependent: even for the same underlying byte sequence, the induced token stream and its entropy vary with T.

Let ๐’ฏ be a fixed suite of downstream tasks. Each task tโ€„โˆˆโ€„๐’ฏ specifies (i) a data distribution Dt over evaluation instances, (ii) an evaluation procedure (prompting format, decoding rule, and scoring), and (iii) a scalar metric mt mapping model outputs to [0,โ€†1] (e.g., accuracy, exact match, or verifier-based success probability).
We denote by Pt(ฮธ)โ€„โˆˆโ€„[0,โ€†1] the performance of checkpoint ฮธ under this procedure, i.e.,
Pt(ฮธ)โ€…:=โ€…๐”ผ(x,โ€†y)โ€„โˆผโ€„Dt[mt(ฮธ;โ€†x,โ€†y)],
with the understanding that the expectation may include randomness from decoding (if any) and from the metric (e.g., stochastic verifier). Each task also comes with a bt, defined by the expected score of a task-specific randomized strategy (e.g., uniform guessing among choices, sampling an answer from a prescribed distribution, or returning an empty string when appropriate). The baseline bt fixes what we mean by ``nontrivialโ€™โ€™ performance on task t.

Fix a tolerance parameter ฯตโ€„>โ€„0. Our primary forecasting question is: for a given checkpoint ฮธ and task t, can we predict whether ฮธ exceeds baseline by margin ฯต, i.e.,
๐Ÿ™{Pt(ฮธ)โ€„>โ€„btโ€…+โ€…ฯต}?
Equivalently, we may wish to estimate the for each task: the smallest value of a progress coordinate at which task performance reliably rises above baseline (up to noise). The key requirement is : the predictor should remain accurate when applied to families with different tokenizers and pretraining corpora, using only the shared calibration corpus ๐’ž and the standardized task evaluations.

We assume access to a set of checkpoints {ฮธi} across families, and we can compute (i) calibration losses CE๐’žT(ฮธi)(ฮธi) via teacher forcing on ๐’ž, and (ii) task performances Pt(ฮธi) for some subset of tasks/checkpoints (possibly limited by evaluation cost).
A typical use case is that task evaluations are available for a sparse set of checkpoints, while calibration losses on ๐’ž are cheap enough to compute broadly.
We evaluate forecasting methods by procedures: fit any task-specific mapping from the available labeled pairs (progress proxy(ฮธi),โ€†Pt(ฮธi)) using all but one family, then test threshold-crossing prediction on the held-out family. This protocol directly measures robustness to tokenizer and corpus shifts at the family boundary.

The above definitions make explicit what is and is not observable: the interface yields token-level likelihoods under each modelโ€™s tokenizer, but does not provide a tokenizer-invariant likelihood of the underlying byte string.
Consequently, the same ๐’ž induces different token streams T(๐’ž) across families, and the mapping ฮธโ€„โ†ฆโ€„CE๐’žT(ฮธ)(ฮธ) mixes (a) model progress and (b) representation-dependent offsets.
The remainder of the paper proposes a calibrated progress proxy that factors out the dominant tokenizer-dependent component on ๐’ž and yields a common coordinate in which task thresholds can be estimated and transferred across families.


4. Calibrated Progress Metric L*: construction, variants (additive vs multiplicative normalization; domain-sliced calibration), and design principles (public, stable, hard-to-game).

The observable CE๐’žT(ฮธ) conflates two effects: (i) the extent to which ฮธ captures multi-token structure of strings drawn from ๐’ž, and (ii) tokenizer-induced changes in the marginal entropy of the induced token stream. We therefore introduce a calibrated proxy that removes a large, model-independent portion of the tokenizer dependence by subtracting a unigram baseline computed on the tokenization of the canonical corpus.

Fix a tokenizer T and consider the multiset of tokens obtained by applying T to ๐’ž. Let pฬ‚T denote the empirical unigram distribution on VT induced by token counts on T(๐’ž), i.e.,
$$ \widehat{p}_T(v) \;=\; \frac{c_T(v)}{\sum_{u\in V_T} c_T(u)} \qquad (v\in V_T), $$
and define UT to be the maximum-likelihood unigram model UT(v)โ€„=โ€„pฬ‚T(v). The corresponding cross-entropy on ๐’ž is
$$ \mathrm{CE}_{\mathcal{C}}^{T}(U_T) \;=\; \frac{\mathbb{E}_{x\sim\mathcal{C}}\big[-\sum_{j=1}^{|T(x)|}\log U_T(t_j)\big]}{\mathbb{E}_{x\sim\mathcal{C}}[|T(x)|]}. $$
Since UT ignores context, CE๐’žT(UT) is dominated by the marginal token entropy induced by T on ๐’ž, and thus captures the leading tokenizer-dependent offset that makes raw losses incomparable.

For a checkpoint ฮธ with tokenizer Tโ€„=โ€„T(ฮธ) we define
L*(ฮธ)โ€…:=โ€…CE๐’žT(ฮธ)โ€…โˆ’โ€…CE๐’žT(UT).
Equivalently, by expanding both terms and regrouping, we may write L*(ฮธ) as an average log-likelihood ratio against the unigram baseline:
$$ L^*(\theta) = \frac{\mathbb{E}_{x\sim\mathcal{C}}\Big[\sum_{j=1}^{|T(x)|}\log \frac{U_T(t_j)}{p_\theta(t_j\mid t_{<j})}\Big]}{\mathbb{E}_{x\sim\mathcal{C}}[|T(x)|]}. $$
Thus L*(ฮธ)โ€„โ‰คโ€„0 for any ฮธ that improves upon the unigram predictor on average, and more negative'' meansmore structure captured beyond bag-of-tokens statisticsโ€™โ€™ on the calibration distribution. The crucial point is that the subtraction is performed a tokenizer: UT changes with T, but is shared across all checkpoints that use T, so L* removes the dominant translation term that depends on how T segments ๐’ž.

The additive normalization above is the simplest and most stable: it preserves units (nats per token) and directly cancels an offset term in the affine distortion model. One may also consider multiplicative variants such as
$$ L^{\mathrm{rel}}(\theta) \;:=\; \frac{\mathrm{CE}_{\mathcal{C}}^{T}(\theta)}{\mathrm{CE}_{\mathcal{C}}^{T}(U_T)}, \qquad\text{or}\qquad L^{\mathrm{gain}}(\theta) \;:=\; \frac{\mathrm{CE}_{\mathcal{C}}^{T}(U_T)-\mathrm{CE}_{\mathcal{C}}^{T}(\theta)}{\mathrm{CE}_{\mathcal{C}}^{T}(U_T)}. $$
These ratios partially adjust for scale differences across tokenizers (e.g., token vocabularies that induce systematically higher marginal entropies), at the cost of amplifying estimation noise when CE๐’žT(UT) is small and, more importantly, of no longer exactly matching an ``unknown additive offsetโ€™โ€™ model. In our setting, where the leading non-comparability arises from tokenization-dependent rather than , we treat the additive L* as the default; multiplicative normalizations may be used as sensitivity checks when families exhibit evidence of heterogeneous slopes aT in the affine model.

A second normalization axis concerns the denominator. Our definition measures per-token improvement. If desired, one can convert to a per-byte (or per-character) scale by multiplying by the expected tokens-per-byte factor ๐”ผ[|T(x)|]/๐”ผ[bytes(x)] on ๐’ž. We do not rely on such conversions for the theorems, but note that they can improve interpretability when comparing tokenizers with very different average token lengths.

A single corpus ๐’ž may contain heterogeneous regimes (natural language, code, tables, mathematics), and tasks often depend disproportionately on a particular regime. To accommodate this, we may partition ๐’ž into slices {๐’žs}sโ€„โˆˆโ€„S using a public, deterministic classifier (e.g., file-type metadata or regex-based heuristics on the canonical byte strings). For each slice we define
Ls*(ฮธ)โ€…:=โ€…CE๐’žsT(ฮธ)โ€…โˆ’โ€…CE๐’žsT(UT,โ€†s),
where UT,โ€†s is the unigram model estimated from T(๐’žs). This yields a vector L*(ฮธ)โ€„=โ€„(Ls*(ฮธ))sโ€„โˆˆโ€„S that can be mapped to task performance via a monotone model in a low-dimensional space (e.g., an isotonic fit on a weighted sum โˆ‘swsLs* with wsโ€„โ‰ฅโ€„0). Slice-wise calibration has two roles: it reduces confounding when progress is uneven across domains, and it provides diagnostics for why a task threshold transfers (or fails to transfer) across families.

Because ๐’ž is public, we cannot appeal to secrecy to prevent overfitting; instead we require that the metric be stable and difficult to game in any targeted way that would generalize poorly. We therefore adopt the following principles. (i) ๐’ž and its canonicalization are versioned and fully specified, so that independent parties compute identical L*. (ii) subtracting CE๐’žT(UT) makes L* insensitive to large changes in marginal token entropy induced by segmentation choices, thereby discouraging ``tokenizer engineeringโ€™โ€™ as a way to fabricate apparent progress. (iii) ๐’ž should be sufficiently diverse and large that memorization of particular strings does not substantially change per-token cross-entropy, and improvements on ๐’ž correlate with general improvements in modeling structure. (iv) if domain slicing is used, the slice definition is public and deterministic, preventing post hoc selection of slices to favor a given family.

Having fixed L* and its variants, we next describe how to compute CE๐’žT(ฮธ) and UT efficiently for many checkpoints, and how to fit monotone task curves and estimate emergence thresholds with uncertainty.


5. Algorithms: computing CE๐’žT(ฮธ), building unigram baseline UT, fitting monotone performance curves and estimating thresholds with uncertainty.

All quantities in our framework are computable from black-box next-token log-probabilities together with a public, canonically preprocessed calibration corpus. We therefore separate the procedure into (i) tokenizer-level preprocessing that is shared across all checkpoints using a given tokenizer, and (ii) checkpoint-level evaluation that requires only teacher-forced forward passes.

We assume each example xโ€„โˆˆโ€„๐’ž is represented as a fixed UTF-8 byte sequence under a versioned canonicalization (normalization, newline conventions, and any filtering are part of the corpus specification). For each tokenizer T appearing among our checkpoints, we run a single streaming pass over ๐’ž to obtain tokenized sequences T(x)โ€„=โ€„(t1,โ€†โ€ฆ,โ€†t|T(x)|) and accumulate token counts
$$ c_T(v)\;:=\;\sum_{x\in\mathcal{C}}\sum_{j=1}^{|T(x)|}\mathbf{1}\{t_j=v\}, \qquad v\in V_T. $$
In practice we do not store all tokenized sequences; we store (a) the count table cT(โ‹…), (b) the total token mass NTโ€„:=โ€„โˆ‘vcT(v), and (c) optional per-example token lengths |T(x)| if we wish to report auxiliary diagnostics such as tokens-per-byte. This tokenizer pass is amortized across all checkpoints that use T.

Given counts, the maximum-likelihood unigram model is UT(v)โ€„=โ€„cT(v)/NT. We then compute the unigram cross-entropy on ๐’ž without any further access to model probabilities, using only the counts:
$$ \mathrm{CE}_{\mathcal{C}}^{T}(U_T) \;=\; -\frac{1}{N_T}\sum_{v\in V_T} c_T(v)\log\frac{c_T(v)}{N_T}. $$
We treat tokens with cT(v)โ€„=โ€„0 as absent from the empirical support and omit them from the sum. No smoothing is needed because the baseline is evaluated on the same sample from which it is estimated; however, if one wishes to reserve a small held-out subset of ๐’ž for auditing, a smoothed unigram (e.g., add-ฮฑ) can be substituted with no change to the downstream fitting procedure.

For a checkpoint ฮธ with tokenizer T, we estimate CE๐’žT(ฮธ) via a streaming teacher-forced loss over tokenized examples. Writing T(x)โ€„=โ€„(t1,โ€†โ€ฆ,โ€†tn), we query the model for logโ€†pฮธ(tjโ€…โˆฃโ€…tโ€„<โ€„j) along the prefix and accumulate
$$ S_\theta\;:=\;\sum_{x\in\mathcal{C}}\sum_{j=1}^{|T(x)|} -\log p_\theta(t_j\mid t_{<j}), \qquad N_T\;:=\;\sum_{x\in\mathcal{C}} |T(x)|. $$
The estimator is then $\widehat{\mathrm{CE}}_{\mathcal{C}}^{T}(\theta)=S_\theta/N_T$. This computation is linear in the number of calibration tokens and uses only forward passes. For numerical stability we recommend accumulating in double precision and, where available, requesting log-probabilities directly rather than probabilities. If the model imposes a context window, we split long T(x) into overlapping segments and include each tokenโ€™s loss exactly once (standard sliding-window evaluation); since ๐’ž is fixed, this policy is itself part of the public specification.

Once $\widehat{\mathrm{CE}}_{\mathcal{C}}^{T}(\theta)$ is available, the calibrated metric is computed by subtraction using the precomputed tokenizer constant:
$$ \widehat{L}^*(\theta) \;=\; \widehat{\mathrm{CE}}_{\mathcal{C}}^{T}(\theta)-\mathrm{CE}_{\mathcal{C}}^{T}(U_T). $$
Because CE๐’žT(UT) depends only on T and ๐’ž, it is computed once per tokenizer and reused for all checkpoints. This is the primary mechanism by which we obtain cross-family comparability without modifying any model.

For each task tโ€„โˆˆโ€„๐’ฏ, suppose we have evaluated a set of checkpoints {ฮธi}iโ€„=โ€„1m to obtain empirical performances Pฬ‚t(ฮธi) (e.g., accuracy or pass@1) together with the corresponding Lฬ‚*(ฮธi). We fit a monotone mapping gฬ‚t from L* to performance. Two concrete instantiations are convenient:

We emphasize that monotonicity is the only structural constraint needed for the subsequent theory; the isotonic choice is thus the default.

Fix a margin ฯตโ€„>โ€„0 and define the empirical emergence threshold as the smallest โ„“ at which the fitted curve exceeds baseline by ฯต:
ฮทฬ‚tโ€…:=โ€…infโ€†{โ„“โ€„:โ€„gฬ‚t(โ„“)โ€„โ‰ฅโ€„btโ€…+โ€…ฯต}.
For isotonic regression, gฬ‚t is defined on intervals; we take ฮทฬ‚t to be the left endpoint of the first fitted block whose value is at least btโ€…+โ€…ฯต. When checkpoints are sparse near the crossing, we may optionally report an interval bracketed by the largest Lฬ‚* with gฬ‚tโ€„<โ€„btโ€…+โ€…ฯต and the smallest with gฬ‚tโ€„โ‰ฅโ€„btโ€…+โ€…ฯต.

Two sources of noise enter: calibration-loss estimation (finite |๐’ž| and model stochasticity, if any) and task-evaluation noise (finite test set, sampling in generative evaluation). We therefore attach uncertainty to ฮทฬ‚t using either (i) a nonparametric bootstrap over checkpoints, resampling the pairs (Lฬ‚*(ฮธi),โ€†Pฬ‚t(ฮธi)) and refitting gฬ‚t to obtain a bootstrap distribution of ฮทฬ‚t, or (ii) an analytic delta-style approximation when using a parametric curve. If per-checkpoint task evaluations provide standard errors (e.g., binomial confidence intervals), we incorporate them as weights in isotonic regression, improving stability when some checkpoints are evaluated with fewer trials.

To assess portability, we recommend leave-one-family-out evaluation: fit gฬ‚t (and thus ฮทฬ‚t) using all but one family, then predict which held-out checkpoints satisfy Pฬ‚t(ฮธ)โ€„>โ€„btโ€…+โ€…ฯต by thresholding gฬ‚t(Lฬ‚*(ฮธ)). This protocol directly targets the cross-tokenizer generalization problem and motivates the theoretical analysis that follows.

In the next section we formalize the latent affine distortion model underpinning these procedures and derive finite-sample error bounds (and matching lower bounds) for threshold prediction across heterogeneous tokenizers.


6. Theory I โ€” Latent Affine Distortion Model: formal assumptions, estimator for latent progress using L*, and finite-sample error bounds for threshold prediction; matching lower bounds for raw-loss/compute predictors.

We now formalize the modeling assumptions that render the algorithmic quantities from the previous section predictive across heterogeneous families, and we isolate the precise obstruction faced by any method that does not calibrate out tokenizer-dependent offsets.

Fix the public calibration corpus ๐’ž with canonical preprocessing. For a given tokenizer T, the calibration loss CE๐’žT(ฮธ) depends both on the underlying predictive structure learned by the checkpoint ฮธ and on tokenizer-induced artifacts (e.g., average token length, frequency of rare tokens, and boundary conventions) that shift per-token cross-entropy. We encode this separation via a tokenizer-invariant latent progress variable z(ฮธ)โ€„โˆˆโ€„โ„ and tokenizer-specific affine distortions:

where aTโ€„>โ€„0 and bTโ€„โˆˆโ€„โ„ depend only on T (and ๐’ž), and ฮพ is mean-zero noise capturing both finite-sample estimation and residual model mismatch. In particular, even if two checkpoints have identical z(ฮธ), their raw cross-entropies can differ by bT if they use different tokenizers, and thus raw loss is not directly comparable across families.

The key observation is that the dominant tokenizer dependence in is an additive offset. We therefore seek a quantity computable from ๐’ž and T alone that tracks bT. The empirical unigram baseline UT is defined by token counts on T(๐’ž), and its cross-entropy CE๐’žT(UT) can be computed without querying any checkpoint. Our calibration hypothesis is that this unigram cross-entropy captures the tokenizer-induced offset up to a tokenizer-dependent but model-independent discrepancy:

where ฮดT is bounded (or at least does not vary adversarially across tokenizers used in practice). Equation is not tautological; rather, it asserts that the aspects of tokenization that shift per-token loss for models (e.g., the empirical entropy of token types under T) are largely captured by the unigram entropy.

In practice we observe $\widehat{\mathrm{CE}}_{\mathcal{C}}^{T}(\theta)$ computed by teacher forcing on a finite sample of calibration tokens (or examples) of size n. Under standard concentration assumptions (e.g., per-token log-loss is sub-exponential under ๐’ž and the model), we may treat the estimation error as ฯƒ-sub-Gaussian at the corpus level:
$$ \widehat{L}^*(\theta)=L^*(\theta)+\varepsilon_{\mathrm{cal}},\qquad \varepsilon_{\mathrm{cal}}\ \text{is mean-zero and}\ O(\sigma/\sqrt{n}). $$
The essential point is that calibration does not increase estimation variance: CE๐’žT(UT) is a deterministic function of the token counts (and can be computed exactly on the same sample), so uncertainty arises primarily from model loss evaluation and any held-out auditing choice.

Fix a downstream task t with baseline bt and performance Pt(ฮธ)โ€„โˆˆโ€„[0,โ€†1]. We posit a monotone link between latent progress and task performance:
Pt(ฮธ)โ€„=โ€„ฯ•t(z(ฮธ))โ€…+โ€…ฯตt,
where ฯ•t is nondecreasing and L-Lipschitz, and ฯตt is mean-zero ฯƒt-sub-Gaussian measurement noise arising from finite evaluation budgets. Define the (latent) emergence threshold ฮทt as the smallest z such that ฯ•t(z)โ€„โ‰ฅโ€„btโ€…+โ€…ฯต for a user-chosen margin ฯตโ€„>โ€„0. Given m labeled checkpoints, we estimate a monotone curve gฬ‚t in L*-space (e.g., isotonic regression), and set ฮทฬ‚tโ€„=โ€„infโ€†{โ„“โ€„:โ€„gฬ‚t(โ„“)โ€„โ‰ฅโ€„btโ€…+โ€…ฯต} as in the algorithmic section.

The proof combines concentration of Lฬ‚*(ฮธ) around aTz(ฮธ), standard risk bounds for isotonic regression under sub-Gaussian noise, and a local inverse-Lipschitz transfer through the monotone map at the crossing; importantly, the bound depends on noise and sample size but not on unknown tokenizer offsets.

We finally state the obstruction: any estimator that uses only raw loss CE๐’žT(ฮธ) (or compute/parameter count proxies that correlate with loss within a family) cannot, in general, remove bT without additional information tied to T. This yields an irreducible cross-tokenizer error.

The proof is a two-point indistinguishability argument: by choosing instances where the observable statistics coincide up to noise but the latent mapping differs via bT, any uncalibrated rule must err by an amount proportional to the offset mismatch. Thus, Theorem~ is essentially tight in its dependence on calibration.

We next address a complementary question: if one wished to avoid tokenizer dependence entirely by marginalizing over all segmentations of a string, can this be done efficiently? The answer is negative in general, and it motivates L* as the appropriate surrogate under realistic resource constraints.


7. Theory II โ€” Tokenization Intractability: #P-hardness of exact segmentation-invariant likelihood for autoregressive models; implications and why L* is the right surrogate under resource constraints.

A natural response to the tokenizer-dependence exhibited by per-token cross-entropy is to attempt to eliminate tokenization from the definition of likelihood altogether. Concretely, fix a canonical byte string x (e.g., an element of the calibration corpus ๐’ž). For a tokenizer T with vocabulary VT, let decodeT be the deterministic map from token sequences sโ€„=โ€„(t1,โ€†โ€ฆ,โ€†tk)โ€„โˆˆโ€„VT* to byte strings, and define the set of of x by
๐’ฎT(x)โ€„:=โ€„{sโ€„โˆˆโ€„VT*:ย decodeT(s)โ€„=โ€„x}.
Given an autoregressive model pฮธ over VT we may then define the segmentation-marginal (tokenization-invariant) likelihood
$$ p_\theta^{\mathrm{marg}}(x)\ :=\ \sum_{s\in \mathcal{S}_T(x)} p_\theta(s) \qquad\text{where}\qquad p_\theta(s)=\prod_{j=1}^{|s|} p_\theta(t_j \mid t_{<j}). $$
One could correspondingly define a string-level calibration loss โˆ’logโ€†pฮธmarg(x) and average it over xโ€„โˆผโ€„๐’ž. If computable, such a quantity would remove dependence on a particular segmentation T(x) and might appear to dominate our calibrated surrogate L*.

The obstruction is that the summand pฮธ(s) depends on the entire token prefix tโ€„<โ€„j through the autoregressive conditionals. Even when ๐’ฎT(x) admits a compact representation as a tokenization lattice (a DAG whose paths correspond to valid segmentations), the value of a path cannot be written as a product of edge weights that depend only on local lattice state, because the conditional at position j is a function of the full prefix sequence, not merely of the already-decoded substring. Thus the standard forward algorithm for HMMs (or weighted finite-state acceptors) is unavailable absent strong additional structure.

We reduce from counting paths in a directed acyclic graph (DAG). Let Gโ€„=โ€„(V,โ€†E) be a DAG with distinguished vertices s,โ€†t, and let #PATHS(G) denote the number of directed sโ€„โ†’โ€„t paths. We construct a tokenizer T and a fixed string x such that each sโ€„โ†’โ€„t path corresponds bijectively to a token sequence in ๐’ฎT(x), and we define pฮธ so that each such sequence has identical probability mass (or, more generally, probability equal to a prescribed path weight), thereby forcing pฮธmarg(x) to encode #PATHS(G).

One convenient construction is to choose x to be a constant byte string (e.g., xโ€„=โ€„aL for a suitable length L), and to define the vocabulary VT so that each edge eโ€„โˆˆโ€„E corresponds to a token whose decoded byte string is aโ„“(e), with lengths โ„“(e) chosen so that the concatenation constraint enforces that only edge sequences forming valid sโ€„โ†’โ€„t paths sum to exactly length L. The tokenization lattice over x then contains a path for each valid sequence of edge-tokens whose lengths sum to L, and by construction this set coincides with the set of sโ€„โ†’โ€„t paths in G. We then define an autoregressive model pฮธ whose next-token probabilities implement the adjacency constraints of G: condition on the prefix encoding a partial path ending at vertex v, the model assigns positive probability only to edge-tokens outgoing from v, and assigns probability 1/deg+(v) to each such outgoing edge. Under this definition, every complete sโ€„โ†’โ€„t path has probability equal to the product of reciprocals of out-degrees along the path, and pฮธmarg(x) becomes the sum over all such path probabilities. By modifying these conditionals one may instead encode unit weights and obtain pฮธmarg(x)โ€„=โ€„#PATHS(G)/Z for an easily computable normalizing constant Z. Since exact computation of #PATHS(G) is #P-complete, exact computation of pฮธmarg(x) is #P-hard.

The essential point is that the autoregressive dependence on the entire prefix allows us to ``rememberโ€™โ€™ the current vertex of the path within the hidden state (i.e., within the prefix itself), while the decoding map forces all complete paths to correspond to the same output string x. This completes the reduction at the level of a proof sketch; a full proof fixes the length-coding to ensure polynomial lattice size and verifies that the constructed conditionals define a valid probability distribution.

Theorem~ formalizes that tokenization-marginal likelihood is not merely expensive but computationally intractable in the worst case for general autoregressive models. Consequently, any proposal to compare checkpoints across tokenizers by evaluating โˆ’logโ€†pฮธmarg(x) on a nontrivial calibration corpus ๐’ž either (i) restricts to special model classes with additional locality properties, (ii) incurs exponential-time dependence on string length in the worst case, or (iii) relies on uncontrolled approximations whose error can be arbitrarily large when probability mass is spread across many segmentations.

One may approximate pฮธmarg(x) by summing over only a subset of segmentations, for instance by enumerating the top-B segmentations under a proposal distribution and computing a truncated sum. This yields a lower bound
pฮธ,โ€†Bmarg(x)โ€„:=โ€„โˆ‘sโ€„โˆˆโ€„๐’ฎT,โ€†B(x)pฮธ(s)ย โ‰คย pฮธmarg(x),
and hence an upper bound on the corresponding negative log-likelihood. However, the gap logโ€†pฮธmarg(x)โ€…โˆ’โ€…logโ€†pฮธ,โ€†Bmarg(x) is controlled by the leftover mass outside the beam, which may be large precisely in regimes where tokenization ambiguity is high. Since our objective is stable cross-family forecasting, a metric whose approximation error can vary unpredictably across tokenizers is not an acceptable foundation.

In contrast, L*(ฮธ)โ€„=โ€„CE๐’žT(ฮธ)โ€…โˆ’โ€…CE๐’žT(UT) is computable in time linear in |T(๐’ž)| using only teacher forcing and token counts, and it directly targets the dominant nuisance dependence (the tokenizer-induced offset) in the latent affine distortion model. The preceding section established that, under mild statistical assumptions, this calibration suffices for cross-family threshold prediction with error controlled by noise and sample size rather than by arbitrary tokenizer offsets. Theory~II clarifies that there is no generally efficient alternative that achieves true segmentation invariance by exact marginalization. We therefore treat L* not as a heuristic, but as a principled response to a computational barrier: it removes the leading tokenizer effect that is both practically consequential and efficiently estimable, while avoiding an intractable marginal likelihood that would otherwise be the canonical ``tokenizer-freeโ€™โ€™ quantity.


8. Experimental Protocol (for strengthening claims): datasets, model families, checkpoints, tasks (multiple-choice + verifier-based), and evaluation metrics (continuous + discontinuous); transfer tests (leave-one-family-out).

We evaluate whether the calibrated progress metric L*(ฮธ) supports (i) cross-family prediction of downstream performance curves and (ii) cross-family localization of emergence thresholds relative to task baselines. Our protocol therefore separates (a) computation of L*(ฮธ) on a fixed public calibration corpus from (b) task evaluation on a suite ๐’ฏ, and measures generalization via leave-one-family-out (LOFO) transfer.

We fix a public corpus ๐’ž specified as a canonical list of UTF-8 byte strings together with an explicit normalization pipeline (e.g., newline normalization and exact byte preservation; no additional detokenization). We publish hashes of each element and the concatenation order to make ๐’ž immutable. To control variance in CE๐’žT(โ‹…) estimates, we choose ๐’ž sufficiently large that nโ€„:=โ€„|T(๐’ž)| is on the order of 107 tokens for typical tokenizers, and we additionally report results for smaller prefixes (e.g., nโ€„โˆˆโ€„{105,โ€†106,โ€†107}) to empirically validate concentration behavior. We ensure that ๐’ž spans multiple registers (web text, encyclopedic text, code, and formal mathematics) so that unigram offsets induced by tokenization are not artifactually tied to a single domain.

We include multiple open-weight families โ„ฑ that differ in tokenizer, scale, and pretraining mixture. For each family, we sample checkpoints across the training trajectory or across parameter scales so that L*(ฮธ) spans a wide range (including models near random-task baselines). When intermediate training checkpoints are unavailable, we approximate a trajectory by selecting multiple scales within the same family (e.g., {1B,โ€†3B,โ€†7B,โ€†13B} parameters) and treat scale as an imperfect proxy for progress; this is acceptable because our predictors are trained on L* rather than on scale. For each checkpoint we record: tokenizer identifier, vocabulary size |VT|, maximum context length used in evaluation, and decoding defaults. We avoid any post hoc filtering of checkpoints based on task outcomes to prevent selection bias in threshold estimates.

For each tokenizer T appearing in the experiment, we tokenize ๐’ž once and compute unigram counts cT(v) and the maximum-likelihood baseline UT. We then compute CE๐’žT(UT) exactly from counts. For each checkpoint ฮธ using tokenizer T, we compute CE๐’žT(ฮธ) by teacher forcing on T(๐’ž), streaming through tokens to avoid storing per-token losses. We report (i) the mean cross-entropy and (ii) an empirical standard error estimated by block bootstrap over documents in ๐’ž, which captures both finite-sample effects and mild nonstationarity. Finally, we set L*(ฮธ)โ€„=โ€„CE๐’žT(ฮธ)โ€…โˆ’โ€…CE๐’žT(UT) and propagate uncertainty by subtracting the deterministic baseline and retaining the model-loss standard error.

We construct ๐’ฏ to include both discontinuous metrics (accuracy/EM) and continuous metrics (probabilistic scores), because emergence phenomena are often more interpretable under discrete success but more statistically efficient under continuous scoring. We include: (i) multiple-choice knowledge and reasoning tasks, (ii) short-form generation with exact-match or normalized-match, and (iii) verifier-based tasks where success is determined by an external checker (unit tests for code, symbolic verification for algebra, or a deterministic rubric implemented as a program). For each task t we specify a random baseline bt (e.g., uniform choice for multiple-choice, random-string or null output for EM, or a verifier-determined success rate under a randomized policy), and we fix the prompt format, few-shot count, and choice ordering policy to avoid leakage through formatting.

For multiple-choice tasks, we compute both (a) accuracy (argmax choice) and (b) a continuous score given by the normalized probability assigned to the correct option. Concretely, for choices cโ€„โˆˆโ€„{1,โ€†โ€ฆ,โ€†k} with tokenizations T(c), we compute length-normalized log-likelihoods โ„“cโ€„:=โ€„|T(c)|โˆ’1โˆ‘jlogโ€†pฮธ(tc,โ€†jโ€…โˆฃโ€…prompt,โ€†tc,โ€†โ€„<โ€„j) and define a softmax over {โ„“c}; this yields a well-defined continuous measure even when accuracy is saturated or near baseline. For open-generation tasks we report EM (or a canonical normalization thereof) and, when appropriate, a verifier-derived success probability. For all tasks we fix decoding to greedy (temperature 0) unless the task definition requires sampling; in the latter case we use a fixed seed schedule and report Monte Carlo standard errors.

Given evaluated pairs (L*(ฮธi),โ€†Pt(ฮธi)), we fit a monotone predictor gฬ‚t using isotonic regression; we also fit a parametric alternative (a logistic curve with a fitted change-point) as a robustness check. For a fixed margin ฯตโ€„>โ€„0, we estimate the emergence threshold in L*-space by
ฮทฬ‚tย :=ย infโ€†{โ„“โ€„:โ€„gฬ‚t(โ„“)โ€„โ‰ฅโ€„btโ€…+โ€…ฯต},
and we report uncertainty intervals by resampling checkpoints (nonparametric bootstrap over the labeled set used for fitting).

To test portability across tokenizers and training mixtures, we perform LOFO evaluation: for each held-out family fโ€„โˆˆโ€„โ„ฑ, we fit gฬ‚t and ฮทฬ‚t using checkpoints from โ„ฑโ€…\โ€…{f} and then predict Pt(ฮธ) and threshold crossing on checkpoints in f. We measure (i) squared error in Pt predictions, (ii) calibration of predicted probabilities for verifier-based success (reliability curves and Brier score), and (iii) classification error for threshold crossing ๐Ÿ™{Pt(ฮธ)โ€„>โ€„btโ€…+โ€…ฯต}. We compare against ablations that replace L* by raw CE๐’žT(ฮธ), parameter count, and FLOPs, using identical fitting procedures, thereby isolating the effect of unigram calibration rather than the curve-fitting method.

We publish the exact lists of checkpoints, tokenizer identifiers, ๐’ž hashes, task prompts, evaluation code, and all intermediate aggregates (counts for UT, per-checkpoint CE, and per-task outputs) sufficient to recompute L* and the fitted thresholds. This ensures that subsequent sections on stress tests can vary one axis (e.g., corpus choice or tokenization perturbations) while holding the remainder of the pipeline fixed.


9. Stress Tests and Failure Modes: dataset shift, calibration corpus mismatch, tokenization perturbations, adversarial/gaming analysis, and when L* should not be expected to work.

We now delineate stress tests intended to falsify (or at least delimit) the empirical claim that L*(ฮธ) provides a portable progress coordinate across heterogeneous families. The point of these tests is not to ``improveโ€™โ€™ L* by ad hoc fixes, but to expose when the assumptions implicit in Theorems~1โ€“3 are violated strongly enough that threshold forecasts become unreliable.

Hypothesis (H2) requires that task performance be (approximately) a monotone function of a tokenizer-invariant z(ฮธ). This can fail under distribution shift in the task distribution Dt relative to what is captured by ๐’ž and by pretraining. We therefore evaluate each task t under controlled perturbations of Dt that preserve the task definition but alter surface statistics, e.g.,
(i) paraphrasing prompts and options,
(ii) swapping topic mixtures (e.g.ย science vs.ย sports) while keeping difficulty constant,
(iii) introducing code-style changes (renaming variables, reformatting) for verifier-based coding tasks.
We report, for each perturbation, the change in LOFO threshold-crossing error and the stability of the fitted curve gฬ‚t (measured by the maximum vertical deviation between fits). Large instability indicates that the ``single-coordinateโ€™โ€™ ordering is insufficient for that task family, even if unperturbed performance appeared well predicted.

Theorem~1 is local to the choice of ๐’ž; if ๐’ž is poorly matched to the entropy structure induced by a tokenizer, the term ฮดT may be large or may correlate with model family, defeating calibration. We stress-test corpus dependence by defining a small set of alternative public calibration corpora {๐’ž(r)}rโ€„=โ€„1R spanning distinct registers (e.g.ย formal math, code, encyclopedic prose) under the same canonical byte-level preprocessing. For each checkpoint we compute L*(r)(ฮธ) using ๐’ž(r) and examine:

If L* is genuinely capturing a tokenizer-invariant progress coordinate, then changing ๐’ž should move all checkpoints in a largely order-preserving manner and should not induce large family-specific residuals. When it does, we treat forecasts as corpus-conditional and report the dependence explicitly.

Although L* removes a dominant unigram offset, it does not marginalize over segmentations (cf.ย Theorem~4). We therefore perform controlled perturbations of the tokenizer while keeping the underlying byte strings fixed. Concretely, for BPE/Unigram tokenizers we construct perturbed tokenizers by (i) altering normalization rules (e.g.ย whitespace collapsing vs.ย preservation), (ii) injecting stochastic BPE-dropout at tokenization time and re-estimating UT on the induced distribution, and (iii) restricting/expanding the vocabulary by pruning rare merges. For each perturbation Tฬƒ we recompute CE๐’žTฬƒ(UTฬƒ) and LTฬƒ*(ฮธ) for models that can be evaluated under Tฬƒ (or for proxy models trained with that tokenizer when direct evaluation is impossible). The diagnostic is whether L* remains approximately affine across perturbations: we fit LTฬƒ*(ฮธ)โ€„โ‰ˆโ€„ฮฑTฬƒLT*(ฮธ)โ€…+โ€…ฮฒTฬƒ and examine the residual variance. Large, structured residuals indicate that higher-order tokenization effects (beyond unigram entropy) are material for ๐’ž and hence for portability.

A distinct failure mode is that downstream evaluation itself introduces tokenizer-coupled artifacts, e.g.ย multiple-choice scoring that depends on option length under T. While our protocol uses length-normalized likelihoods, normalization is not a panacea when token boundaries align differently with semantic units. We stress-test by re-encoding each task instance under multiple semantically equivalent surface forms with differing token lengths (e.g.ย adding neutral padding, changing punctuation conventions) and measuring whether the relative ordering of checkpoints at fixed L* changes. When it does, we attribute the effect to task measurement noise rather than to L* per se, and we report a task-specific ``format instabilityโ€™โ€™ score that upper-bounds attainable prediction accuracy under any scalar progress metric.

Because ๐’ž is public, a capable actor could fine-tune a model to reduce CE๐’žT(ฮธ) (and hence improve L*) without commensurate gains on tasks, invalidating threshold forecasts. We operationalize this as an explicit attack: starting from a base checkpoint, we perform lightweight continued training on ๐’ž (or a close paraphrase set) while holding total compute fixed, and we measure the slope ฮ”Pt/ฮ”L* across tasks. A healthy regime exhibits broadly positive slopes; a successful gaming attack yields large negative or near-zero slopes. As mitigations, we consider (i) using multiple disjoint calibration corpora and aggregating L* (e.g.ย median across corpora), and (ii) maintaining an additional calibration set for governance use, while retaining the public protocol for open scientific comparison. We emphasize that these mitigations address strategic behavior, not purely statistical noise.

We list conditions under which the theoretical and empirical premises plausibly fail:

In such regimes, we treat L* as a descriptive statistic of text modeling skill on ๐’ž rather than as a portable progress coordinate, and we recommend reporting these stress-test outcomes alongside any claimed threshold forecasts.


10. Discussion and Governance Use-Cases: portable early-warning reporting; how L* interacts with RL post-training and inference-time compute (boundaries of applicability); limitations and future work.

The principal claim of this work is not that L*(ฮธ) is an intrinsic measure of ``intelligence,โ€™โ€™ but that it is a coordinate for forecasting thresholded capability on fixed text tasks under heterogeneous tokenizers. In settings where one wishes to anticipate when a task t will become reliably solvable by publicly released checkpoint, portability is the binding constraint: raw losses, parameter counts, and compute do not commute across families because they carry family-specific offsets. The subtraction in L* is deliberately minimal; it removes only what is required to align cross-entropies sufficiently to support threshold localization at the granularity implied by Theorem~2.

From a governance perspective, the immediate use-case is for capability onset. Concretely, given a stream of newly released checkpoints {ฮธ} (across any family for which we can compute CE๐’žT(ฮธ)), one may maintain a public table of L*(ฮธ) values on a fixed ๐’ž and, for each monitored task t, report Pฬ‚t(ฮธ)โ€„=โ€„gฬ‚t(L*(ฮธ)) together with a conservative threshold-crossing statement. A simple operationalization is: fix ฯตโ€„>โ€„0 and a tolerable false-alarm rate ฮด, and declare ``at-risk of emergenceโ€™โ€™ if an upper confidence bound on Pฬ‚t(ฮธ) exceeds btโ€…+โ€…ฯต. By Theorem~2, the uncertainty in the implied threshold ฮทฬ‚t decays as mโˆ’1/2 in the number of labeled checkpoints used to fit gฬ‚t; hence the protocol benefits from a small, curated set of evaluations on representative checkpoints, after which large-scale monitoring reduces to cheap calibration loss computation.

We emphasize that such reporting is most defensible when (i) the task evaluation is standardized and (ii) the model is deployed in a mode close to that used for calibration (teacher-forced likelihood reflecting the same base distribution). The calibrated coordinate is designed to support statements of the form: ``among checkpoints that share broadly similar training objectives, those with more negative L* should not systematically underperform on t unless the task violates the single-coordinate hypothesis.โ€™โ€™ This motivates a governance framing in which L* is treated as a : it identifies checkpoints for which costly task evaluations are most informative (near ฮทฬ‚t), and it supplies a portable axis on which to stratify models in reports, without asserting that the axis is causal or complete.

A boundary of applicability concerns post-training procedures, especially preference optimization and reinforcement learning from feedback. These procedures can change the induced policy over completions in ways that are weakly coupled to CE๐’žT(ฮธ). In the extreme, one can modify response style, refusal behavior, or tool-use proclivities while leaving next-token likelihood on generic text nearly unchanged; then L* remains a measure of pretraining-like modeling skill, whereas Pt may shift due to post-training. We therefore recommend a two-coordinate accounting when RL-style post-training is material: report (L*(ฮธbase),โ€†ฮ”t(ฮธ)), where ฮ”t(ฮธ) is the task-specific gain (or loss) attributable to post-training relative to a matched base checkpoint. When matched bases are unavailable, one may still use L* to predict the component of performance and treat deviations as post-training residuals. Formally, one may posit Pt(ฮธ)โ€„=โ€„ฯ•t(z(ฮธbase))โ€…+โ€…ฯt(ฮธ)โ€…+โ€…ฯตt, where ฯt captures non-likelihood-aligned modifications; the governance-relevant point is that ฯt cannot be inferred from L* alone and must be measured on tasks or proxied by additional diagnostics.

A second boundary concerns inference-time compute. Many deployments use sampling schemes (self-consistency, best-of-k, verifier-guided search) that can amplify success probabilities relative to a single greedy or temperature-controlled decode. Our analysis is silent about these amplification mechanisms: L* is computed under teacher forcing and thus reflects the base distribution, not the algorithm used to extract answers. A practical remedy is to parameterize performance by both calibrated progress and an inference budget variable c (e.g.ย number of samples, verifier calls), fitting Pฬ‚t(ฮธ;โ€†c)โ€„โ‰ˆโ€„gฬ‚t(L*(ฮธ),โ€†c) with monotonicity in both arguments. Even when c is held fixed in evaluation, governance users should interpret a threshold ฮทฬ‚t as : a task may ``emergeโ€™โ€™ earlier under aggressive search than under a restricted budget, and the relevant budget depends on the deployment environment.

Several limitations follow directly. First, L* remains corpus-conditional: it is calibrated on ๐’ž, and portability is only promised to the extent that ๐’ž captures the entropy structure that induces tokenizer offsets. Second, L* is scalar and therefore cannot represent genuinely multi-dimensional progress (e.g.ย models specializing in code versus dialogue); when the single-coordinate assumption fails, the appropriate response is not to contort L* but to represent progress by a small vector L*(r) over a controlled family of calibration corpora and to fit task-specific combinations. Third, L* is a likelihood-based statistic; as models become increasingly shaped by non-likelihood objectives, we should expect the correlation between L* and downstream capability to weaken, and monitoring should accordingly weight direct task evaluation more heavily.

Future work therefore divides naturally into three directions. (i) replace the affine distortion model with a hierarchical model in which tokenization effects include bounded higher-order terms, and characterize when unigram subtraction remains minimax-optimal among feasible calibrators. (ii) construct a standardized set of calibration corpora with canonical preprocessing and publish uncertainty intervals for L*(ฮธ) under finite n, enabling principled comparisons across laboratories. (iii) generalize the framework to multimodal inputs and to settings with tool calls, where teacher-forced text likelihood is an incomplete proxy and where thresholding should be defined over an augmented action space. Our position is that L* is a useful baseline: it provides a minimal, reproducible alignment across tokenizers, and it clarifies which additional axes must be introduced when that baseline ceases to predict emergence.