← Back

Active-FLP: Correlation-Aware Multi-Fidelity Scaling Laws with Calibrated Prediction Intervals

Metadata

Table of Contents

  1. 1. Introduction: downstream forecasting under emergence; why correlation and active design matter; relation to FLP/FLP-M.
  2. 2. Preliminaries and Setup: FLP factorization; multi-fidelity costs; dependence in checkpoints; goals (accuracy + coverage + minimal compute).
  3. 3. Formal Problem Definition: design space (), tasks (), oracles, costs, and what constitutes a valid forecast.
  4. 4. Statistical Model and Assumptions: Stage-1 FLOPs loss law; Stage-2 loss performance law; emergence as censoring/link; β-mixing dependence; exchangeability across runs.
  5. 5. Correlation-Aware Inference: effective sample size Neff; why checkpoint-level i.i.d. assumptions fail; block construction over runs/checkpoints.
  6. 6. Block-Conformal Prediction Intervals: run-level calibration, block residuals, finite-sample coverage theorem, and practical implementation details.
  7. 7. Active-FLP Algorithm: acquisition functions for selecting next (configuration, checkpoint, benchmark); multi-fidelity scheduling (converged vs partial); invariants and termination criteria.
  8. 8. Theory: (i) effective-sample-size and estimation error bounds; (ii) active sample complexity in an emergence-like special case; (iii) lower bounds / impossibility when all data are pre-emergence.
  9. 9. Experimental Plan (flagged as needed for strength): datasets/logs, evaluation suite, ablations (active vs passive, block sizes, dependence), calibration diagnostics, and cost accounting.
  10. 10. Discussion: extensions to many-source mixtures, recipe shift detection, and limitations; relation back to FLP/FLP-M findings.

Content

1. Introduction: downstream forecasting under emergence; why correlation and active design matter; relation to FLP/FLP-M.

We study the problem of forecasting downstream capabilities of a language model at a training configuration x* before we have paid the cost to train at x* to convergence and to evaluate the full suite of downstream benchmarks. The practical motivation is straightforward: the dominant costs in contemporary scaling experiments arise not from fitting a small number of surrogate parameters, but from (i) running many expensive training jobs and (ii) repeatedly evaluating expensive tasks during training. A forecasting method is useful only insofar as it simultaneously (a) respects these cost asymmetries, (b) remains statistically calibrated despite the strong dependence induced by checkpoints within a run, and (c) adapts its data collection so that the expensive queries are concentrated where they are informative.

A central difficulty is . Many tasks display a broad regime in which performance is indistinguishable from a floor (often random guessing) followed by a comparatively rapid improvement as training compute increases or loss decreases. In such settings, naive extrapolation in compute space is brittle: fitting a smooth function C ↦ Pτ(C) from sparse evaluations typically fails because most early evaluations provide almost no information about post-threshold behavior. In particular, when all observed evaluations lie below the floor, the data may be consistent with widely different post-emergence trajectories; without additional structure, the forecasting problem is not merely hard but can be information-theoretically underdetermined. Any successful methodology must therefore (i) incorporate a representation in which emergence is more nearly regular and (ii) decide to spend evaluations so that at least some data are collected in the informative regime.

Our approach is to forecast downstream performance through the intermediate object of the . The guiding principle is the factorization
C(x) → L(x) → Pτ(x),
which separates the engineering choice of training configuration (and its compute) from the task-specific mapping between a model state and its capabilities. This factorization is already implicit in the scaling-law literature, but we treat it as an algorithmic design pattern rather than a post hoc descriptive fit. At a high level, we use a small number of fully converged training runs to learn a map from compute to loss, and then we use many cheaper, intermediate checkpoints (where available) to learn how task performance depends on loss. The first stage provides a in which different configurations can be compared and interpolated, while the second stage exploits the observation that, once expressed as a function of loss, many capabilities vary more smoothly and monotonically, even when they appear discontinuous in compute space.

A second difficulty is that the natural data stream in training is dependent. Checkpoints within a single run are not independent draws; they are a strongly correlated time series. Treating these checkpoints as i.i.d. and applying standard uncertainty quantification can yield spuriously narrow intervals, because the apparent sample size is inflated by dependence. This issue is not a minor technicality: since intermediate checkpoints are precisely the cheap signal we wish to exploit, any method that cannot correctly account for within-run correlation will fail at the level of uncertainty quantification. We therefore require that the forecasting procedure calibrate at the level of independent units (runs) or at the level of blocks that are sufficiently separated so that the dependence is controlled. In this sense the problem is inherently one of learning with a nonstandard statistical constraint: we must use low-cost, correlated observations without violating coverage guarantees.

The cost structure further forces adaptivity. The most informative evaluations for an emergent task are those taken near the emergence boundary, but this boundary is unknown a priori and is task dependent. If one evaluates tasks on a passive grid of configurations or at uniformly spaced checkpoints, then in the pre-emergence regime one accumulates many redundant floor-level observations, while the region in which the task begins to improve is undersampled. The resulting waste is not merely computational; it is statistical, since the passive strategy spends its effective sample budget far from the decision region that dominates forecasting error. By contrast, an active strategy can use its current uncertainty estimates to select additional evaluations that preferentially reduce the final interval width at x*, thereby approaching the information-theoretic rate with substantially fewer expensive queries.

Our work is positioned relative to the (FLP) viewpoint and its mixture extensions (often denoted FLP-M). FLP asserts that, within a fixed model family and training recipe, converged loss is predictable from total compute (or from a vector of per-source computes in the multi-source setting) by a simple parametric scaling law. We treat this as a first-stage structural assumption that enables extrapolation in log C with few converged runs. However, our goal is not to stop at forecasting loss. Rather, we use loss as a intermediate representation for downstream prediction. In doing so, we inherit the strengths of FLP (cheaply forecasting loss at x* from a small design) while addressing a separate, task-dependent problem: mapping loss to performance under emergence. In particular, the same loss forecast can be combined with different task models, and the same collection of checkpoint losses can support forecasting across a large task set, provided evaluations are chosen strategically.

A methodological contribution of this perspective is that it clarifies where multi-fidelity observations enter and how they should be aggregated. Fully converged runs anchor the compute-to-loss map; intermediate checkpoints contribute primarily to the loss-to-task map, and they do so in a way that is cost effective only if we acknowledge dependence and calibrate appropriately. Moreover, the emergence phenomenon suggests that expensive task evaluations should be delayed until the model is likely to have moved beyond the floor, which can be inferred from the loss trajectory. Thus, rather than viewing benchmark evaluation as an afterthought performed at the end of training, we view it as an problem coupled to training dynamics.

Finally, we emphasize that the deliverable is not merely a point estimate τ(x*), but a prediction interval with explicit miscoverage level. In the scientific use case—deciding whether to train a target model, or whether to adjust a data mixture—an overconfident forecast is worse than an uninformative one. The dependence structure of checkpoints, together with emergence-induced censoring, makes calibrated uncertainty quantification nontrivial; nonetheless, by constructing residuals at the run level (or via non-overlapping blocks) we can bring conformal calibration to bear in a way that is compatible with the sequential, multi-fidelity data collection loop.

The remainder of the paper formalizes this program. We first specify the setup in which compute, losses, checkpoints, and task evaluations are observed at different costs, and we define the goal as achieving both accuracy and coverage under a constrained budget. We then present an active algorithm that interleaves training and evaluation actions, fits the two-stage surrogate, and returns conformal intervals that remain valid under checkpoint dependence. Subsequent results quantify the effect of dependence through an effective sample size, compare active and passive strategies in an emergent special case, and delineate impossibility regimes when censoring prevents any post-emergence information from being observed.


2. Preliminaries and Setup: FLP factorization; multi-fidelity costs; dependence in checkpoints; goals (accuracy + coverage + minimal compute).

We fix throughout a model family and training recipe class, denoted by (optimizer, learning-rate schedule template, tokenizer, context length, and other ``non-scaling’’ ingredients). A is an element x ∈ 𝒳, where 𝒳 is a finite (or effectively compact) design space encoding choices such as model size, token budget, and, when relevant, a compute allocation across data sources. Our aim is to forecast properties of a configuration x* ∈ 𝒳 before training x* to convergence.

Each configuration x determines a total pretraining compute C(x). In the simplest setting C(x) ∈ ℝ+ is a scalar proportional to total training FLOPs. In a multi-source setting we allow K ≥ 2 data sources and write C(x) ∈ ℝ+K for the vector of per-source computes, with Ck(x) the compute spent on source k. We will treat C(x) as the primary ``scale’’ coordinate and, when using parametric scaling laws, work in log C (componentwise in the vector case). The design space 𝒳 is assumed to contain only configurations compatible with the fixed recipe , so that comparisons across x do not confound changes in optimization dynamics with scale.

We evaluate models on a fixed set of D validation or probe domains. For a converged training run at configuration x, we define the
L(x) ∈ ℝD,   L(x) = (L1(x), …, LD(x)),
where Ld(x) is the loss on domain d at convergence (or at a fixed, agreed-upon training termination criterion under ). During training we may also observe intermediate checkpoint losses. For run index r and checkpoint index t, we write
Lr, t(x) ∈ ℝD
for the observed loss vector at checkpoint t in run r at configuration x. These checkpoint losses are typically cheap to compute relative to downstream benchmark evaluations and, crucially, they provide a common coordinate system across configurations and across tasks. When fitting task predictors we may use a feature map ϕ : ℝD → ℝp to incorporate transformations (e.g., interaction terms, domain reweightings, or augmentations), but the loss vector L remains the fundamental state descriptor.

Let 𝒯 be a finite set of downstream tasks or benchmarks, indexed by τ. For each τ and configuration x, let Pτ(x) denote the downstream performance at convergence (e.g., accuracy, pass@k, or another scalar metric). When we evaluate task τ at checkpoint t of run r, we observe a noisy estimate
Yr, t, τ(x),
which we regard as a random variable centered around the corresponding latent checkpoint performance. We emphasize that Yr, t, τ(x) can be expensive to obtain; moreover, for some τ the cost is large enough that evaluating τ at every checkpoint is impractical. The emergence phenomenon motivates allowing a task-specific monotone link gτ mapping a latent score (often approximately linear in ϕ(L)) to the reported metric. This link may encode saturation, floors (e.g., random guessing), and rapid post-threshold changes.

Our basic modeling primitive is the factorization
x ↦ C(x) ↦ L(x) ↦ Pτ(x),
which we operationalize as a coupled pair of statistical surrogates. Stage~1 models the relationship between compute and converged loss. Concretely, for each domain d we posit a log-power law of the form
log Ld(x) = ad + bdlog C(x) + ηd(x),
(or an appropriate multi-source extension when C(x) ∈ ℝ+K), with ηd(x) capturing run-to-run variability and treated as sub-Gaussian noise at the run level. Stage~2 models the relationship between loss and downstream performance. For each task τ we posit
Pτ(x) = gτ (θτϕ(L(x))) + ξτ(x),
where gτ is monotone (often taken 1-Lipschitz for stability) and ξτ(x) is sub-Gaussian noise. This decomposition makes explicit that converged runs are primarily needed to anchor C ↦ L, while many cheaper checkpoint observations can be exploited to learn L ↦ Pτ across a large task set.

The forecasting problem is intrinsically multi-fidelity. A query is a fully converged training run at configuration x, costing costtrain(x), and yielding L(x) (as well as the full training trajectory, if logged). A query is a partial training run (or an extension of an existing run) to reach a checkpoint t, at which point we may record Lr, t(x) and selectively evaluate tasks τ to obtain Yr, t, τ(x). Each task evaluation incurs its own cost costeval(τ), which may vary widely across τ. The algorithmic consequence is that the learner must decide (i) which configurations to train, (ii) to what training fraction (or which checkpoints to log), and (iii) which tasks to evaluate at which checkpoints, with the objective of reducing forecast uncertainty at x* per unit cost.

Checkpoint observations within a run are not i.i.d.; they form a strongly dependent time series driven by the optimization dynamics of . We model this dependence by assuming that, within any fixed run r, the sequence of residuals (after conditioning on the fitted surrogate) is β-mixing with coefficients β(m) at lag m. Across distinct runs (training jobs), we assume independence (or, at minimum, exchangeability at the run level). This distinction is central: intermediate checkpoints can dramatically increase the number of observed (L, Y) pairs, but they do not automatically increase the sample size. Any calibrated procedure must therefore aggregate checkpoint information into units that respect dependence, e.g., by forming one residual per run, or by using non-overlapping blocks of length B (possibly separated by gaps) and calibrating at the block level. This is precisely the regime in which block-based conformal methods become appropriate: they preserve finite-sample coverage guarantees while permitting us to exploit low-cost, correlated observations.

Given a target configuration x*, we seek for each τ ∈ 𝒯 a point prediction τ(x*) and an accompanying prediction interval τ(x*) with marginal coverage at least 1 − δ. We measure accuracy by the interval half-width and adopt a target tolerance ε > 0. The operational goal is thus to achieve both (i) coverage ℙ(Pτ(x*) ∈ τ(x*)) ≥ 1 − δ and (ii) width width(τ(x*)) ≤ 2ε, while minimizing the total cost incurred by training and evaluation queries. The remainder of the paper treats this as an interactive design problem: data are collected sequentially, models and calibration sets are updated online, and actions are chosen to reduce the final uncertainty at x* in a cost-aware manner.

The next section formalizes the design space, oracle access model, and what it means for an algorithm to return a forecast under the foregoing dependence and cost constraints.


3. Formal Problem Definition: design space (), tasks (), oracles, costs, and what constitutes a valid forecast.

We now formalize the interactive forecasting problem as an online decision process over training and evaluation queries, with explicit costs and a precise notion of forecast validity. Throughout, the learner is given a finite (or discretized compact) design space 𝒳, a finite task set 𝒯, and a fixed target configuration x* ∈ 𝒳 whose converged downstream performance is to be forecast.

Each configuration x ∈ 𝒳 induces a compute allocation C(x). In the single-source setting C(x) ∈ ℝ+; in the multi-source setting we allow C(x) ∈ ℝ+K, with coordinates Ck(x) denoting the compute allocated to source k ∈ {1, …, K}. We treat C(x) as the sole scaling coordinate exposed to the algorithm; other recipe ingredients are fixed by . For each x, convergence produces a validation loss vector L(x) ∈ ℝD on D probe domains. During training we may also observe checkpoint losses Lr, t(x) at checkpoint indices t, for run index r.

For each task τ ∈ 𝒯 and configuration x, let Pτ(x) denote the (unknown) converged task performance under the fixed recipe . The object of interest is the vector {Pτ(x*)}τ ∈ 𝒯. Task evaluations are noisy: when the learner evaluates τ at checkpoint t within run r of configuration x, it observes a random variable Yr, t, τ(x), interpreted as a noisy estimate of the checkpoint performance (and, in particular, not assumed independent across t within the same run). We do not assume that Yr, t, τ(x) is available for all τ at all checkpoints; rather, τ-selection is part of the learner’s control.

We specify the learner’s interaction with the world via two oracles.

We allow the learner to interleave oracle calls arbitrarily, subject to budget constraints. In particular, the learner may (i) start new runs at different configurations x, (ii) extend an existing partial run to later checkpoints, and (iii) decide at which checkpoints to evaluate which tasks.

Each training call incurs a cost costtrain(x) when executed to convergence, or a corresponding partial-training cost when executed only up to some training fraction (we may regard this as costtrain(x, t), nondecreasing in t). Each evaluation call for task τ incurs cost costeval(τ), independent of x and t for simplicity (the formalism extends to costeval(τ, x, t)). We write the total cost accumulated by an algorithm 𝒜 up to termination as
Cost(𝒜) = ∑training callscosttrain(⋅) + ∑evaluation callscosteval(⋅).
Optionally, the learner is given a hard budget Budget and must ensure Cost(𝒜) ≤ Budget; otherwise we treat cost minimization as the objective under statistical constraints.

At any time, the learner’s state consists of a finite dataset of logged checkpoint losses and selected task evaluations. It is convenient to denote the set of completed runs by obs, and for each run r ∈ ℛobs to maintain: the configuration xr, the set of observed checkpoints r, the loss vectors {Lr, t(xr)}t ∈ ℐr, and the set of evaluated task outputs {Yr, t, τ(xr) : (t, τ) ∈ ℰr}, where r ⊆ ℐr × 𝒯 is chosen adaptively. For runs trained to convergence we also include L(xr). We emphasize that within-run checkpoint observations are treated as dependent; consequently, any statistical procedure that uses them must specify how dependence is handled (e.g., by aggregation to run-level quantities or by block constructions).

Fix a miscoverage level δ ∈ (0, 1) and a target half-width ε > 0. An algorithm terminates by returning, for each τ ∈ 𝒯, a point predictor τ(x*) and an interval τ(x*) ⊂ ℝ. We regard the interval as the primary object.

The probability in Definition~ is taken over all randomness in training dynamics, evaluation noise, and any internal randomization of 𝒜. Since the algorithm is adaptive, validity is understood in the usual sense of predictive inference: the returned interval must cover the fixed target Pτ(x*) with probability at least 1 − δ when τ(x*) is constructed from adaptively collected data. In particular, dependence across checkpoints implies that naive i.i.d. calibration is not admissible; our algorithms will therefore impose run- or block-level calibration units to recover exchangeability at the calibration granularity.

We can now state the optimization problem implicit in the interactive protocol. The learner seeks to minimize total cost while achieving validity and target width for the target configuration x*. Formally, one may consider
min𝒜  𝔼[Cost(𝒜)]  s.t.  ∀τ ∈ 𝒯: ℙ (Pτ(x*) ∈ τ(x*)) ≥ 1 − δ,  width(τ(x*)) ≤ 2ε,
or, under a hard budget Budget, to maximize achieved accuracy subject to Cost(𝒜) ≤ Budget. Since the tasks may have heterogeneous evaluation costs, the feasible set of query policies is strongly shaped by {costeval(τ)}τ ∈ 𝒯; this motivates policies that allocate evaluations non-uniformly across τ, and, within each τ, focus evaluations on checkpoints and configurations expected to be informative for x*.

While the preceding definitions do not impose a particular statistical model, they force a methodological constraint: the calibration procedure used to construct τ(x*) must treat dependent checkpoint observations appropriately. In our setting, the natural units that can plausibly be treated as exchangeable are independent training runs (or non-overlapping blocks within runs, when justified). Consequently, the learner’s design choices affect not only the mean-squared error of point estimates but also the availability of calibration residuals needed to certify finite-sample coverage. This tension between cheap, dependent checkpoint data and expensive, independent run-level data is the central resource trade-off addressed by the algorithmic framework developed subsequently.


We now impose a statistical structure on the interactive data stream generated by the training and evaluation oracles. Throughout, we fix the recipe class (optimizer family, schedule template, tokenizer, context length, data pipeline conventions), and we regard all stochasticity as arising from the random seed, data order, and any internal nondeterminism of the training/evaluation stack. Thus, for each x ∈ 𝒳, repeated runs at the same configuration correspond to repeated draws from a well-defined run distribution determined by (x, ℛ).

Our modeling is explicitly two-stage. Stage~1 describes how compute allocation determines the converged validation loss vector on probe domains. Stage~2 describes how (possibly intermediate) loss vectors determine downstream task performance, with ``emergence’’ represented as a monotone link with a floor. Dependence across checkpoints is handled separately from variation across independent runs; the former will later force correlation-aware calibration.

For each probe domain d ∈ {1, …, D} we posit a log-linear scaling relationship between converged loss and compute. In the single-source case C(x) ∈ ℝ+ we assume that for a converged run at configuration x,

where ad ∈ ℝ, bd ∈ ℝ are unknown parameters and ηd(x) is run-to-run noise. In the multi-source case C(x) ∈ ℝ+K we use the natural extension

interpreting {bd, k} as a domain-specific sensitivity to per-source compute. We emphasize that – are not intended as mechanistic laws of optimization; they are an operational identifiability condition ensuring that a small number of fully-converged runs suffices to predict loss at untrained configurations, including the target x*.

We assume that ηd(x) is centered and sub-Gaussian uniformly over x and d: there exists σ1 > 0 such that for all λ ∈ ℝ,
$$ \mathbb{E}\!\left[\exp\!\bigl(\lambda \eta_d(x)\bigr)\right]\;\le\;\exp\!\left(\frac{\lambda^2\sigma_1^2}{2}\right). $$
We do not require that ηd(x) be independent across domains d; correlation across domains is allowed and is in fact expected. What we require is that different runs are independent draws (formalized below), so that run-level aggregation can restore exchangeability for calibration.

For each downstream task τ ∈ 𝒯 we assume that performance is a monotone function of a low-dimensional summary of the loss vector. Concretely, let ϕ : ℝD → ℝp be a fixed feature map (possibly the identity, possibly augmented with domainwise differences, logs, or interaction terms). We posit a task-specific parameter θτ ∈ ℝp and a monotone link gτ : ℝ → ℝ such that the converged task performance satisfies

where ξτ(x) is centered sub-Gaussian noise with proxy σ2, τ, uniformly over x ∈ 𝒳. We additionally assume that each gτ is 1-Lipschitz. This Lipschitz condition is a conservative smoothness requirement ensuring that uncertainty in the loss space does not amplify when mapped into the performance space; it is compatible with piecewise-linear and logistic-type links used to model saturating metrics.

To represent emergence, we allow gτ to have a floor region corresponding to chance performance. A canonical example, aligned with later lower-bound constructions, is a ``hinge with floor’’

or, more smoothly, a logistic with an additive floor. In either case, for a range of latent scores s below a task-dependent threshold, changes in s do not translate into observable improvements over Prand, τ. This has two consequences we will exploit: first, early evaluations can be effectively censored (nearly constant), and second, informative evaluations concentrate near the transition region where gτ begins to increase.

The evaluation oracle returns Yr, t, τ(x) at checkpoint t of run r. We connect checkpoint evaluations to loss vectors by assuming the same structural mapping as in , but applied to checkpoint losses:

where ζr, t, τ(x) is centered sub-Gaussian (proxy σ3, τ). We permit ζr, t, τ(x) to be correlated across τ at a fixed (r, t) (e.g., shared decoding stochasticity), but we will treat each τ marginally when constructing prediction intervals. Equation~ asserts that checkpoint-level task metrics are conditionally stable functions of checkpoint losses, and that the primary benefit of intermediate checkpoints is their reduced cost rather than their independence.

A single training run produces a time series {Lr, t(x)}t (and, for evaluated tasks, {Yr, t, τ(x)}t). These observations are not independent: successive checkpoints share parameters, data exposure, and optimization state. We capture this by assuming β-mixing within each run. Formally, for fixed r and x, let r, 1t denote the σ-field generated by all logged quantities up to checkpoint t, and let r, t + m denote the tail σ-field after t + m. We assume there exist coefficients β(m) → 0 such that

We do not assume a specific parametric dependence model (e.g. AR(1)); is only a quantitative statement that sufficiently separated checkpoint blocks behave approximately as if they were independent. This assumption is precisely what later permits block-based calibration procedures while still leveraging many checkpoints for parameter fitting.

In contrast to within-run dependence, we assume that different runs are independent draws from the recipe-induced randomness. In particular, if r ≠ r, then the full trajectories {Lr, t(xr)}t and {Lr, t(xr)}t are independent conditional on the chosen configurations (xr, xr). Moreover, for any fixed configuration x, repeated runs at x are exchangeable: the joint law of the run-level noise variables (e.g. {ηd(x)}d and {ξτ(x)}τ) is invariant under permutations of the run index. This exchangeability is the minimal symmetry condition required by conformal calibration at the run (or block) granularity.

Finally, we stress a modeling convention that will matter for validity under adaptive data collection: while the algorithm may choose configurations xr sequentially based on past observations, the randomness within a newly launched run is fresh and independent of the past. Thus, conditional on the algorithm’s chosen action, each new run produces an independent sample from the appropriate run distribution, and the ensuing dependence structure is confined to checkpoints within that run.

These assumptions, taken together, specify the statistical regime in which Active-FLP operates: Stage~1 provides an identifiable map from compute to loss at convergence; Stage~2 provides a monotone, possibly thresholded map from loss to task performance; and checkpoint data are abundant but dependent. The next section develops inference procedures that respect this dependence rather than suppressing it by incorrect i.i.d. approximations.


5. Correlation-Aware Inference: effective sample size Neff; why checkpoint-level i.i.d. assumptions fail; block construction over runs/checkpoints.

Our data stream contains a tension that is central to both statistical efficiency and validity. On the one hand, intermediate checkpoints are cheap and therefore plentiful; on the other hand, checkpoints within a run are strongly dependent by construction. Consequently, the naive identification of number of checkpoints'' withnumber of samples’’ is incorrect, and any inference procedure that calibrates uncertainty as if checkpoint evaluations were i.i.d. will typically undercover. In this section we formalize the notion of induced by within-run dependence, explain why i.i.d. approximations fail even when noise is sub-Gaussian, and describe a block-based construction that will later be used for correlation-aware calibration.

Fix a task τ and consider the Stage-2 regression model with design vectors zr, t = ϕ(Lr, t(xr)) and responses Yr, t, τ(xr). A standard i.i.d. least-squares analysis would treat {(zr, t, Yr, t, τ)} across all (r, t) as independent draws, yielding uncertainty shrinking at rate (NT)−1/2 when each run contributes T checkpoints. This conclusion is generally false because within a fixed run r, the sequence {zr, t}t evolves smoothly with training, and {ζr, t, τ}t may inherit temporal correlation from shared decoding randomness and state. In the extreme case where zr, t = zr and ζr, t, τ = ζr, 1, τ for all t, adding checkpoints does not add information: the T checkpoint observations collapse to a single observation. Even in less degenerate cases, high autocorrelation inflates the variance of estimators and invalidates calibration procedures that rely on exchangeability at the checkpoint level.

A simple diagnostic is to examine the variance of the empirical mean of residuals within a run. Let ϵr, t be a centered stationary sequence with Var(ϵr, t) = σ2 and autocorrelation ρ(m) = Corr(ϵr, t, ϵr, t + m). Then

When ρ(m) ≥ 0 decays slowly, the parenthetical factor can be Θ(1) rather than Θ(T−1), implying that the mean concentrates as if only Θ(1) samples were observed. Equation is not itself an assumption in our framework, but it captures the mechanism by which dependence reduces information. Our β-mixing condition is precisely the hypothesis under which one can control such variance inflation uniformly, and thereby obtain valid concentration bounds for sums of dependent checkpoint residuals.

We require a quantitative substitute for number of samples'' that accounts for temporal dependence. For concreteness, consider an estimator \(\widehat{\theta}_\tau\) fit from all checkpoints across \(N\) independent runs, each providing \(T\) logged checkpoints (not necessarily consecutive). Under standard regularity conditions for mixing sequences (bounded moments, suitable design non-degeneracy) one obtains bounds of the form \begin{equation} \label{eq:neff-definition} \mathbb{E}\|\widehat{\theta}_\tau-\theta_\tau\|_2^2 \;\lesssim\; \frac{1}{N_{\mathrm{eff}}},\qquad N_{\mathrm{eff}} \;\equiv\; N \cdot \frac{T}{1 + 2\sum_{m\ge 1}\rho(m)}, \end{equation} where \(\rho(m)\) is an appropriate dependence measure (e.g.\ maximal correlation) controlled by \(\beta(m)\). The interpretation is that each run contributes fewer than \(T\)independent-equivalent’’ samples, and the reduction factor depends on the integrated autocorrelation time. When checkpoints are nearly independent beyond a short lag, the sum mρ(m) is bounded and Neff ≈ NT. When dependence persists, Neff can be closer to N than to NT. This scaling is not merely an artifact of upper bounds: for classical models such as AR(1) checkpoint noise, minimax lower bounds match up to constants, so the dependence penalty cannot be removed by a more clever estimator. This perspective will be used in two places: to reason about how many checkpoints are useful for Stage-2 fitting, and to ensure that coverage claims are calibrated at the correct granularity.

Equation implies that collecting many adjacent checkpoints is statistically wasteful when the mixing time is long: the marginal cost may be small, but the marginal information can be negligible. It is therefore natural to log checkpoints on a sparse schedule (e.g. geometric in training steps) and to prefer additional independent runs when the goal is calibration rather than point estimation. In our interactive setting, we can make this tradeoff explicit: we treat runs as the fundamental independent units and checkpoints as low-cost, correlated refinements used primarily to improve the Stage-2 fit within runs.

Equally important, dependence affects uncertainty quantification. Even if θ̂τ is consistent under mixing, plug-in standard errors computed under i.i.d. assumptions will generally be too small by approximately the square root of the variance inflation factor in . Since our objective is distribution-free coverage statements (Section~), we do not rely on asymptotic normality; rather, we enforce correlation awareness at the level of calibration residuals.

We now describe a blocking scheme that converts dependent checkpoint residuals into approximately exchangeable aggregates. Fix a run r and suppose we have a sequence of checkpoint indices t1 < t2 < ⋯ < tT at which we observe Lr, tj(xr) and (for some tasks) Yr, tj, τ(xr). Let τ(L) denote the fitted Stage-2 predictor evaluated at a loss vector L. Define checkpoint residuals
er, tj, τ = Yr, tj, τ(xr) − τ(Lr, tj(xr)).
Because {er, tj, τ}j is dependent, we do not calibrate directly on its empirical quantiles. Instead, we partition {1, …, T} into non-overlapping blocks of length B (discarding a remainder if necessary). For block index b, let Ir, b = {(b − 1)B + 1, …, bB}. We then form a block residual statistic, for example

The choice in is a design decision: maxima are conservative (robust to occasional heavy-tailed fluctuations), while averages can be tighter when residuals are well-behaved. Under β-mixing, blocks separated by at least one block length behave approximately like independent draws, with an approximation error controlled by m ≥ Bβ(m). Thus, if B is chosen larger than the mixing time (or if we insert a separation gap between blocks), the collection {Er, b, τ}b can be treated as approximately exchangeable within a run.

Blocking alone does not yield distribution-free guarantees, because approximate independence is not enough for finite-sample conformal validity. Our key structural advantage is that runs are independent by assumption, so we can elevate the calibration unit from checkpoints (dependent) to runs (independent). Concretely, we may further aggregate block residuals into a single run-level residual,

where Agg is any measurable functional chosen before seeing the data (e.g. a quantile or trimmed mean across blocks). The resulting {Rr, τ}r = 1N are exchangeable across runs (within the calibration split), and therefore suitable for conformal calibration without violating the dependence constraints inside each run. The conceptual shift is that checkpoints improve the fidelity of Rr, τ as an estimate of a run’s typical residual magnitude, but do not multiply the number of independent calibration samples.

In practice we do not know β(m) and therefore do not know the optimal block length. We adopt a conservative approach: we (i) log checkpoints on a schedule with increasing spacing (so that later blocks cover a wider range of training states with fewer highly redundant points), and (ii) choose B large enough that empirical autocorrelations of residuals |er, t, τ| decay substantially within a block. This selection can be done using a small pilot set of runs, and then held fixed for subsequent calibration so that coverage arguments are not compromised by adaptive tuning of B.

The blocking and run-level aggregation described here is the mechanism by which we reconcile the abundance of checkpoints with the need for valid uncertainty quantification. In the next section we instantiate this mechanism in a block-conformal procedure, yielding finite-sample marginal coverage guarantees while retaining the cost advantages of multi-fidelity data collection.


6. Block-Conformal Prediction Intervals: run-level calibration, block residuals, finite-sample coverage theorem, and practical implementation details.

We now instantiate the correlation-aware blocking mechanism of Section~ into a conformal procedure that yields finite-sample marginal coverage for each downstream task. The central difficulty is that conformal prediction is distribution-free only under an exchangeability condition, whereas checkpoint-level residuals within a run are not exchangeable in any reasonable sense. Our resolution is to calibrate at the : each independent run contributes exactly one calibration residual per task (or, equivalently, one residual per non-overlapping run-block aggregate), and checkpoints serve only to define that residual in a stable way.

Fix a task τ. We partition the set of completed runs into a proper-training subset tr and a calibration subset cal, with |ℛcal| = ncal. Using tr, we fit the Stage-2 predictor
τ(L) = τ (θ̂τϕ(L)),
where τ may be fixed (e.g. logistic-with-floor) or estimated subject to monotonicity, and ϕ is the feature map over validation losses. For a target configuration x*, Stage-1 yields a predicted converged loss vector (x*), and the point forecast is

Our goal is then to calibrate an interval around whose width reflects both the statistical error of Stage-2 and the modeling error induced by mapping x ↦ L.

For each calibration run r ∈ ℛcal, we observe a set of checkpoint indices {t1 < ⋯ < tT} at which the task was evaluated (not necessarily at all checkpoints). We define checkpoint residuals
er, tj, τ = Yr, tj, τ(xr) − τ (Lr, tj(xr)),
and convert the dependent sequence {er, tj, τ}j = 1T into a single run-level statistic. Concretely, we choose a block length B and partition {1, …, T} into B-sized consecutive blocks Ir, 1, …, Ir, Br (dropping an incomplete tail if desired). For each block b, we compute a block residual Er, b, τ (e.g. maximum or average absolute residual within the block), and then aggregate across blocks:

The only requirement for conformal validity is that Rr, τ be a measurable function of run r’s data defined without peeking at held-out outcomes. The choice of Agg determines the robustness–tightness tradeoff: maxima control rare spikes (at the cost of conservatism), while averages track typical error (at the risk of underestimating extremes when residuals are heavy-tailed).

Let R(1), τ ≤ ⋯ ≤ R(ncal), τ denote the sorted run-level residuals from cal. Define the conformal quantile

We then return the symmetric prediction interval

Interval is intentionally agnostic to how τ was fit, and in particular does not require asymptotic normality or an i.i.d. approximation over checkpoints. The only exchangeability we use is across runs: {Rr, τ}r ∈ ℛcal and the (unobserved) target residual are treated as exchangeable draws under our run-level assumptions.

We state the coverage guarantee in the form used by Theorem~1 of the global context. Under (H4), calibration runs are exchangeable for fixed x (and independent across runs), hence the usual split-conformal argument applies to {Rr, τ}.


Assume that calibration runs are exchangeable at the run level, and that within each run, checkpoint residuals are β-mixing with coefficients β(m). Construct Rr, τ via a fixed blocking rule and define τ(x*) by . Then for any fixed target configuration x* and task τ,
ℙ (Pτ(x*) ∈ τ(x*)) ≥ 1 − δ.
Moreover, if blocks are separated by at least s checkpoints (e.g. by inserting gaps), the coverage deviation attributable to within-run dependence is bounded by O (∑m ≥ sβ(m)).

The proof is standard once the calibration unit is a run: conditional on the training split, the multiset {Rr, τ}r ∈ ℛcal is exchangeable, and the rank of the target residual among these values is uniform. The mixing condition is used only to justify that the block statistic Rr, τ is a stable summary of a dependent checkpoint sequence and, when gaps are inserted, to bound any additional approximation error due to residual dependence across nearby checkpoints.

  1. When only a few task evaluations exist per run, we set B = 1 and use Rr, τ = |er, t, τ| at the most recent evaluated checkpoint. When many evaluations exist, we prefer block maxima in early deployments (conservative) and transition to block averages once residual distributions appear well-behaved.

  2. We construct τ(x*) separately for each τ. If we require a simultaneous statement over τ ∈ 𝒯, a simple (though conservative) approach is Bonferroni: use δτ = δ/|𝒯| in . In cost-sensitive regimes we more commonly accept marginal coverage per task, consistent with the problem definition.

  3. Active-FLP updates models sequentially. To preserve validity, the conformal split must be respected: the predictor τ used to compute calibration residuals should be trained without using cal. When data are scarce, we replace a single split by K-fold cross-conformal calibration, producing K predictors and aggregating their residuals; this improves efficiency while maintaining near-identical validity guarantees under run-level exchangeability.

  4. Many benchmarks have bounded scores (e.g. [0, 1] accuracy). After computing , we clip endpoints to the feasible range. This cannot reduce coverage for bounded Pτ, but it can reduce width in saturated regimes (notably post-emergence plateaus).

  5. Stage-1 fitting error enters through the map x* ↦ (x*). Because residuals are computed against τ(Lr, t) at observed losses, they automatically incorporate misfit of τ as a function of L, but not necessarily additional uncertainty in (x*). In practice we address this by augmenting ϕ(L) with Stage-1 predictive features (e.g. predicted loss quantiles or compute C itself), and by ensuring that calibration runs span the neighborhood of (x*) in loss space; the active acquisition rule in the next section is designed precisely to enforce this coverage-relevant overlap.


7. Active-FLP Algorithm: acquisition functions for selecting next (configuration, checkpoint, benchmark); multi-fidelity scheduling (converged vs partial); invariants and termination criteria.

Having defined run-level conformal intervals τ(x), we now specify how Active-FLP selects the next training and evaluation actions so as to reduce the terminal widths width(τ(x*)) at minimal cost. The algorithm operates in an interactive multi-fidelity regime: at each round we may either (i) launch an additional training run (to convergence or to a prescribed fraction), or (ii) spend evaluation budget on one or more tasks τ at selected checkpoints within already-running jobs. The resulting dataset consists of independent runs indexed by r, each run providing a dependent checkpoint sequence {(Lr, t, Yr, t, τ)}t for those τ that we choose to evaluate.

We formalize each decision as an action
a = (x, α, 𝒮, 𝒰),
where x ∈ 𝒳 is the configuration, α ∈ (0, 1] is a training fidelity (interpreted as the fraction of the full compute budget C(x) to be expended in the next run, with α = 1 meaning converged), 𝒮 ⊆ 𝒯 is the set of tasks to evaluate, and 𝒰 ⊆ {t1 < t2 < ⋯} is the subset of checkpoint indices at which we will query those tasks. We take the marginal cost of such an action to be
cost(a) ≡ costtrain(x; α) + ∑τ ∈ 𝒮t ∈ 𝒰costeval(τ),
where costtrain(x; α) is approximately proportional to αC(x) up to system-dependent constants, and costeval(τ) may vary by orders of magnitude across tasks.

The algorithm maintains at each round a current predictor τ and a calibrated interval τ(x*) for each task τ. Our operational goal is to reach
maxτ ∈ 𝒯width (τ(x*)) ≤ 2ε
while respecting an optional global budget constraint. Accordingly, we choose actions by a one-step lookahead proxy: for each candidate a ∈ 𝒜 (a finite set of feasible actions), we estimate the expected interval-width reduction per unit cost,

where wτ ≥ 0 are user-chosen weights (default wτ = 1), 𝒟 is the current dataset, and τ new denotes the interval after incorporating the new observation(s). In practice we do not compute 𝔼[Δ(a) ∣ 𝒟] exactly; we approximate it using a surrogate for predictive uncertainty, typically a variance proxy for τ((x*)) together with an estimate of how a new labeled point (L, Yτ) would shrink that proxy under the Stage-2 regression model. A convenient heuristic is an upper-confidence bound (UCB) variant,
$$ \mathrm{Score}_{\mathrm{UCB}}(a) \;\equiv\; \frac{\widehat{\Delta}(a) + \kappa \cdot \widehat{\sigma}_\Delta(a)}{\mathsf{cost}(a)}, $$
with κ > 0 tuned to favor exploration when uncertainty is high.

Because Stage-2 uses L-space as the predictor, the most informative new data are those whose loss vectors lie near (x*) (or, more generally, in regions where the fitted map L ↦ Pτ has high curvature or high residual dispersion). Thus, a first-order acquisition principle is to select configurations x that, at the fidelity we intend to run, will produce checkpoints Lr, t(x) near the current target neighborhood. Concretely, we estimate a t ↦ t(x) from existing runs (or use a monotone interpolation of observed Lr, t sequences) and define a proximity score
π(x) ≡ mint ∈ 𝒰t(x) − (x*)∥2,
favoring x with small π(x) when Stage-2 uncertainty dominates. Conversely, when Stage-1 uncertainty dominates (e.g. too few converged runs), we prioritize converged actions at configurations that reduce the uncertainty of the Stage-1 fit near C(x*), which we implement by space-filling in log C (or along each source dimension when C ∈ ℝ+K) combined with leverage-based selection for the log-linear regression of log Ld on log C.

Active-FLP distinguishes two regimes.

If the Stage-1 fit (x) has large estimated error at x* (e.g. wide predictive bands in each domain d), then partial runs do not resolve the principal bottleneck: we cannot reliably map x* to the L-space region where Stage-2 should be calibrated. In this regime we schedule α = 1 runs and treat their converged L(x) values as high-priority observations.

Once Stage-1 predictive uncertainty at x* is below a tolerance (operationally, once augmenting Stage-1 with one more converged run yields negligible decrease in width(τ(x*)) for most τ), we allocate the majority of budget to partial runs and checkpoint evaluations. Here the purpose is to densify labeled pairs (Lr, t, Yr, t, τ) in the relevant L-space region, increasing Neff while keeping costtrain small.

A practical schedule is to propose α values from a discrete set (e.g. {0.1, 0.2, 0.4, 0.7, 1}) and include them in 𝒜. The acquisition then trades off the fact that increasing α yields later checkpoints with smaller losses (potentially closer to emergence) at higher cost.

Within a run, we restrict evaluations to a sparse checkpoint schedule to limit dependence and to avoid redundant cost. We typically use a geometric schedule tj ∝ γj (e.g. γ ∈ [1.2, 1.5]), optionally with enforced separation to support blocking (Section~). At each selected checkpoint we choose tasks by a two-tier rule:

This implements the informal principle that pre-emergence evaluations are often uninformative yet costly, whereas post-emergence evaluations near the transition substantially reduce uncertainty about Pτ(x*).

Throughout the online process we maintain the following invariants, which are necessary for the validity claims of Section~ and for the theoretical results that follow.

Active-FLP terminates when either (i) the target width criterion is met,
maxτ ∈ 𝒯width (τ(x*)) ≤ 2ε,
or (ii) the remaining budget is insufficient to execute any action with nontrivial expected reduction. If termination occurs by budget exhaustion, we return the current τ(x*) and τ(x*), together with a diagnostic that identifies tasks whose widths remain large and whether the bottleneck is Stage-1 (insufficient converged runs near C(x*)) or Stage-2 (insufficient effective labeled samples near (x*)). This diagnostic is used to justify, in the subsequent theory, why emergence-like censoring can force non-shrinking intervals absent additional assumptions.


8. Theory: (i) effective-sample-size and estimation error bounds; (ii) active sample complexity in an emergence-like special case; (iii) lower bounds / impossibility when all data are pre-emergence.

We now record the theoretical statements underlying Active-FLP. The results isolate three phenomena: (i) within-run checkpoint dependence reduces the statistical value of many intermediate evaluations, but admits a clean characterization; (ii) in a stylized emergence model, active evaluation concentrates effort near the transition and improves over passive designs; (iii) if all observed task measurements remain pre-emergence (censored at a floor), then forecasting post-emergence performance is information-theoretically impossible without additional structural assumptions.

Stage-2 fitting is naturally tempted to treat each checkpoint measurement as a separate sample. Under (H3), this is incorrect: for fixed run r, the residual sequence across checkpoints is dependent, so the nominal checkpoint count T must be discounted. The appropriate control parameter is the mixing profile β(m), which bounds how quickly dependence decays with separation m.

To state the consequence at the level we need, fix a task τ and consider the regression model
Yr, t, τ(x) = gτ (θτϕ(Lr, t(x))) + ξr, t, τ,
with gτ monotone and 1-Lipschitz and ξr, t, τ sub-Gaussian. Linearizing through the Lipschitz property (or working directly with an M-estimator for the generalized model), we obtain estimation error rates controlled by an effective sample size
$$ N_{\mathrm{eff}} \;\equiv\; N \cdot \frac{T}{1 + 2\sum_{m\ge 1}\rho(m)}, $$
where ρ(m) is a dependence measure dominated by β(m) (e.g. a suitable covariance or functional dependence coefficient). In particular, when mρ(m) < ∞, the penalty factor 1 + 2∑mρ(m) is constant, so Neff = Θ(NT); whereas when dependence is long-range, adding closely spaced checkpoints yields diminishing returns.

Under standard regularity conditions for least-squares (or a strongly convex loss for the M-estimator), we may summarize the rate as
$$ \mathbb{E}\bigl\|\widehat{\theta}_\tau-\theta_\tau\bigr\|_2^2 \;\le\; O\!\left(\frac{1}{N_{\mathrm{eff}}}\right), $$
which matches the i.i.d. scaling after replacing NT by Neff. Conversely, even in the simplest AR(1) model within runs, any unbiased estimator suffers variance Ω(1/Neff), so the reduction is not an artifact of the proof technique but an intrinsic limitation of correlated checkpoints.

Two operational corollaries guide the algorithmic design. First, checkpoint matters: if we only evaluate every s checkpoints, then (informally) ρ(m) is replaced by ρ(ms), improving Neff at fixed evaluation cost. Second, if we aggregate checkpoints into non-overlapping blocks of length B and treat each block as one calibration unit, then exchangeability is restored at the block level up to an error controlled by the tail m ≥ Bβ(m). This is precisely the rationale for run-level or block-level conformal calibration: we may fit Stage-2 using many dependent points, but we calibrate using residuals aggregated at a granularity for which exchangeability across units is a credible assumption.

We next formalize why emergence-aware acquisition can reduce expensive evaluations. Consider a one-dimensional compute axis C and assume the converged loss obeys the Stage-1 relation
log L(C) = a + blog C,
with sub-Gaussian run-to-run noise, so that a small number of converged runs yields a tight predictor (C) on a log scale. For a fixed task, we assume a floor-linked post-emergence model

where Lc is an unknown loss threshold for emergence, λ > 0, and ξ is sub-Gaussian. The key feature is the L(C) ≥ Lc, where all measurements concentrate near Prand and hence provide weak information about post-emergence behavior.

The forecasting goal is to construct, for a target C*, an interval (C*) with marginal coverage at least 1 − δ and half-width at most ε. Active-FLP gains by concentrating expensive evaluations near the transition L(C) ≈ Lc, which is the region where the link in is most sensitive. Using (C) as a surrogate coordinate, we can implement a binary-search-like allocation: we propose candidate evaluations whose predicted loss straddles the current plausible range for Lc, thereby shrinking uncertainty about whether C* lies in the censored regime or the linear regime. After locating the relevant regime, we allocate the remaining budget to reducing the conditional noise at C*, which necessarily incurs a ε−2 dependence from mean estimation.

In this special case one can show the existence of an active policy achieving
$$ O\!\left(\log\frac{1}{\varepsilon} + \frac{1}{\varepsilon^2}\log\frac{1}{\delta}\right) $$
expensive evaluations, where the logarithmic term reflects the transition-localization phase (a discrete analogue of zooming in on Lc) and the ε−2log (1/δ) term is the unavoidable cost of estimating a noisy value to accuracy ε. In contrast, any passive uniform-grid strategy over a compute range must pay an additional Ω(1/ε) term to ensure that at least one evaluation lands sufficiently close to the emergence boundary to resolve whether C* is pre- or post-emergence. This gap formalizes the intuition that emergence induces a small in loss space, and active methods win by finding it quickly.

Finally, we make precise the failure mode suggested by emergence: if all observed task values remain at floor, then distribution-free forecasting of post-emergence performance cannot succeed. Fix a task τ and suppose that all collected evaluations satisfy
Yr, t, τ ≤ Prand   almost surely.
Then no matter how adaptively we choose configurations, checkpoints, and evaluation times, the data are consistent with multiple post-emergence worlds. Concretely, we may construct two data-generating processes that (i) agree on the distribution of losses and on the conditional distribution of Yr, t, τ whenever Yr, t, τ ≤ Prand, yet (ii) differ by a constant gap in the implied value of Pτ(x*). Since the algorithm never observes uncensored outcomes, it cannot distinguish the two hypotheses. By Le Cam’s two-point method, any estimator τ(x*) must incur worst-case absolute error bounded below by a constant, and hence no nontrivial (ε, δ)-guarantee is possible for small ε.

This impossibility result should be interpreted as a boundary of assumption-free forecasting. It does not contradict the validity of conformal intervals: conformal prediction guarantees coverage relative to the empirical residual distribution under exchangeability of calibration units, but if the underlying problem is unidentifiable (because the data never enter the informative regime), then the resulting intervals cannot be forced to shrink. Practically, the theorem justifies the emergence-aware escalation rule: for tasks that remain fully censored under the current evaluation plan, the only remedies are to (a) increase fidelity to reach smaller losses, (b) redirect evaluation toward configurations whose loss trajectories overlap the conjectured threshold region, or (c) introduce auxiliary proxies or structural priors beyond (H1)–(H4).


9. Experimental Plan (flagged as needed for strength): datasets/logs, evaluation suite, ablations (active vs passive, block sizes, dependence), calibration diagnostics, and cost accounting.

We now specify an experimental protocol whose purpose is to (i) validate the finite-sample coverage guarantees empirically under realistic dependence, (ii) quantify the cost–accuracy tradeoffs of Active-FLP relative to passive designs, and (iii) stress-test the failure modes suggested by the lower bounds. Throughout, we fix a recipe class (optimizer and schedule template, tokenizer, context length, and data preprocessing), and we treat each independent training job (random seed, data order, and initialization) as one run r.

We define a finite design space 𝒳 by discretizing the primary axes that affect compute and loss: model size, token budget, and (when applicable) a K-source allocation C(x) ∈ ℝ+K with kCk(x) fixed at a chosen grid of totals. We emphasize that 𝒳 must include configurations that plausibly bracket the target x* in both compute and loss, since Stage-1 extrapolation error is otherwise conflated with the Stage-2 mapping error.

For every run r at configuration x, we log (a) a dense sequence of training checkpoints t ∈ {1, …, Tmax} with their wall-clock and FLOP counters; (b) the validation loss vector Lr, t(x) ∈ ℝD on D held-out probe domains; and (c) metadata sufficient to support dependence diagnostics (e.g. global step, effective batch size, and data shard identifiers). We also record the converged loss L(x) for runs trained to the convergence criterion used by (or to a fixed training fraction when the run is intentionally truncated). To keep storage and evaluation costs controlled, we additionally define a sparse checkpoint schedule 𝒮 ⊂ {1, …, Tmax} (e.g. geometric spacing) from which we draw candidate checkpoints for downstream evaluations.

We partition the downstream task set 𝒯 into (i) tasks 𝒯cheap, which can be evaluated frequently at modest costeval(τ), and (ii) tasks 𝒯exp, which we evaluate sparingly. The suite should include a range of behaviors: tasks expected to be smooth in loss space, tasks with sharp emergence, and tasks with substantial evaluation noise (e.g. small test sets). For each τ, we predefine the evaluation protocol (prompting template, decoding parameters, number of shots, and scoring), and we keep these fixed across runs to preserve exchangeability of residuals.

To support out-of-sample forecasting, we designate a subset of configurations 𝒳test ⊂ 𝒳 as targets x* that are never used as acquisition candidates during training of the policy. We then evaluate the returned intervals τ(x*) on these held-out targets using independent runs, yielding a direct empirical estimate of marginal coverage and interval width.

We compare Active-FLP to passive and partially active alternatives under matched total budgets. The baseline set includes:
We then run ablations over the modeling and dependence-handling choices:

For each τ and each held-out target x* ∈ 𝒳test, we compute empirical coverage
$$ \widehat{\mathrm{Cov}}_\tau \;=\; \frac{1}{|\mathcal{X}_{\mathrm{test}}|}\sum_{x^*\in\mathcal{X}_{\mathrm{test}}}\mathbf{1}\!\left\{P_\tau(x^*)\in \widehat{I}_\tau(x^*)\right\}, $$
where Pτ(x*) is approximated by averaging over multiple independent evaluation replicates at convergence (or by multiple runs when task noise is dominated by training randomness). We report both marginal coverage aggregated over targets and stratified coverage as a function of predicted loss (x*), since emergence suggests nonuniform difficulty across loss regimes. We also report the median and 90th percentile interval half-widths, emphasizing the maximum over tasks when the objective is to satisfy maxτwidth(τ(x*)) ≤ 2ε.

To diagnose dependence and justify the chosen B, we compute within-run autocorrelation and simple mixing proxies of the residual sequence (e.g. autocorrelation of Yr, t, τ − τ(Lr, t) at lags measured in checkpoints of 𝒮). We then check whether the empirical coverage deviation decreases as B increases, consistent with the qualitative bound controlled by m ≥ Bβ(m). In addition, we perform a negative control: recalibrate using randomly permuted checkpoints within each run (which destroys temporal dependence while preserving the marginal distribution) and verify that the i.i.d. conformal method becomes well-calibrated in this permuted setting.

We account for cost at the same granularity used by the algorithm. The realized total cost of a run of the policy is
Costtotal = ∑training actionscosttrain(x, fidelity) + ∑task evalscosteval(τ),
where the training term is measured in FLOPs (or normalized GPU-hours) and the evaluation term is measured in forward-pass FLOPs (or wall-clock), using consistent accounting across methods. We report (i) the cost to reach the target width ε at fixed δ, (ii) the achieved width at fixed budget, and (iii) the Pareto frontier over (Costtotal, maxτwidth(τ(x*))). To isolate the benefit of active acquisition from the benefit of multi-fidelity training, we also provide a decomposition into converged-run cost versus checkpoint-evaluation cost, since the intended advantage of Active-FLP is to reduce expensive benchmark evaluations while keeping the number of fully converged sampling runs small.

Finally, we include a targeted stress test of the censoring impossibility: we construct a regime in which, for a selected τ, all evaluated checkpoints are empirically at floor (no emerged observations), and we verify that all methods either return wide intervals or fail coverage when forced to return narrow ones. This experiment connects the theory to practice by demonstrating that, absent uncensored data or additional assumptions, interval shrinkage is not merely difficult but structurally unsupported by the observations.


10. Discussion: extensions to many-source mixtures, recipe shift detection, and limitations; relation back to FLP/FLP-M findings.

The formalism stated in the global context already accommodates a vector-valued compute allocation C(x) ∈ ℝ+K across K pretraining sources. The principal technical change relative to FLP-M (K = 2) is that Stage-1 becomes a multivariate scaling problem: we no longer fit a one-dimensional map C ↦ L, but rather a surface C ↦ L over a constrained region (typically a simplex $\{C\ge 0:\sum_{k=1}^K C_k=C_{\mathrm{tot}}\}$ for each total compute level). A direct generalization of the log-power law is

with sub-Gaussian run-to-run noise ηd. This form is attractive because it preserves convexity in log C and admits closed-form (or regularized) regression; however, it also highlights an identifiability constraint: if the experimental design keeps kCk fixed and explores only a low-dimensional face of the simplex, then {bd, k} cannot be separated without additional variation. In practice, a sound design for Stage-1 must therefore vary both and : we require a set of converged runs whose {log Ck} spans K up to the affine constraints induced by the budget grid.

Active-FLP extends naturally to this setting, but the acquisition problem becomes combinatorial if 𝒳 is defined by discretizing mixture ratios. A pragmatic approach is to maintain a finite candidate set of mixture points (e.g. a simplex lattice) and to score each candidate action a = (x, fidelity, tasks) by an estimate of expected interval shrinkage per unit cost. The key point is that can be driven by uncertainty in the induced loss vector (x) as well as uncertainty in the Stage-2 map L ↦ Pτ: when K is large, the most informative next run may be one that moves (x) in a direction that disambiguates θτϕ(L) for tasks whose emergence boundary is poorly localized.

A second, more structural, extension is to incorporate explicit diminishing-returns constraints on mixing. Empirically, mixture scaling laws often behave closer to a ``soft minimum’’ over sources than to a fully separable sum. One can encode this by replacing the linear form in log Ck with a log-sum-exp or CES-like aggregator, e.g.

which interpolates between max-like and sum-like behavior depending on ρd. While this increases modeling risk, it can reduce extrapolation error when the true dependence is not separable. Importantly, the conformal calibration step remains distribution-free at the run level; misspecification in Stage-1 primarily manifests as wider (but still valid) intervals, provided the residuals used for calibration remain exchangeable across runs.

All guarantees are conditional on a fixed recipe class . When changes (optimizer, schedule, tokenizer, context length, or data pipeline), the map C ↦ L and the downstream map L ↦ Pτ may shift. The practical failure mode is not merely increased variance, but a loss of exchangeability between calibration residuals and the target x*, which can invalidate conformal coverage.

We therefore require explicit . Since Stage-1 provides a low-dimensional summary ((x) and residuals L(x) − (x)), we can monitor recipe stability by reserving a small set of ``sentinel’’ configurations xsent that are periodically rerun. Let Δd denote the difference between the new run loss and the historical Stage-1 prediction in domain d; if |Δd| is systematically large relative to the calibration distribution, then the recipe has drifted. Formally, one may implement a run-level two-sample test on residual vectors, or a conformal p-value for the null that new residuals are drawn from the same distribution as the calibration runs. If a shift is detected, the correct response is to using only post-shift runs (or to treat each recipe as a separate group and apply group-conditional conformal methods), rather than to pool residuals across incompatible regimes.

A subtler case is a partial shift affecting only the task evaluation protocol (prompt format, decoding temperature, or scoring). Here Stage-1 can remain valid while Stage-2 shifts. This motivates logging task-level metadata and maintaining per-task change-point checks on Yr, t, τ − τ(Lr, t). When drift is localized to a subset of tasks, we can preserve efficiency by re-evaluating only those tasks for recalibration, leaving the remainder unchanged.

We record several limitations implied by the assumptions and by the lower bounds.

First, our coverage guarantee is over the randomness of runs and evaluations, not conditional on a fixed realized trajectory. Without stronger assumptions, distribution-free conditional coverage is unattainable in general. In applications where conditional guarantees are required (e.g. per-deployment safety constraints), one should expect to pay additional cost, either by stratifying calibration by covariates (loss regime, mixture region) or by imposing stronger parametric structure.

Second, dependence handling relies on (i) multiple independent runs and (ii) a mixing condition within runs sufficient for block aggregation. If one has only a single run, then any procedure that treats correlated checkpoints as independent will be overconfident, and block methods cannot restore exchangeability across runs. This is not an artifact of analysis but reflects the dependence lower bound stated in the global context.

Third, the Stage-2 model class Pτ = gτ(θτϕ(L)) + ξ is intentionally constrained: monotonicity and Lipschitzness encode emergence without allowing arbitrary oscillation. When tasks are genuinely non-monotone in loss space (e.g. due to prompt sensitivity or evaluation artifacts), the model will be misspecified; conformal intervals may remain valid, but can become wide, and active acquisition may target the wrong regions. A practical mitigation is to enrich ϕ(L) with interaction terms or to use isotonic regression on a learned one-dimensional score, while retaining monotonicity as a shape constraint.

Fourth, emergence creates an unavoidable data sparsity barrier. The censoring impossibility shows that if all observed Yr, t, τ remain at floor, no algorithm can infer post-emergence behavior without auxiliary signals. In practice, this means that for certain tasks one must either (a) spend enough compute to obtain at least one uncensored observation, (b) introduce proxies correlated with emergence (e.g. related tasks or intermediate metrics), or (c) accept vacuous intervals. Active-FLP is designed to avoid wasting evaluations far from the boundary, but it cannot remove this identifiability constraint.

Our framework can be viewed as a synthesis of two empirical regularities emphasized by FLP/FLP-M—simple scaling of loss with compute and predictable effects of mixing—with a third component needed for decision-making: calibrated uncertainty under dependent, multi-fidelity observations. Stage-1 inherits the central FLP premise that (within a fixed recipe) validation loss is well-approximated by a power law in compute, and the mixture generalization aligns with FLP-M by treating compute allocations as first-class variables. Stage-2 operationalizes the widely observed phenomenon that many benchmarks are approximately monotone functions of loss emergence, while allowing a task-specific link gτ to encode thresholding.

The principal methodological departure is the explicit treatment of (i) correlated checkpoints and (ii) adaptive evaluation. FLP/FLP-M primarily characterize mean trends; our goal is to produce τ(x*) with finite-sample coverage and to minimize the cost of expensive evaluations. The block-conformal construction is the mechanism by which we convert the empirical scaling regularities into actionable forecasts with uncertainty quantification, and the active policy is the mechanism by which we exploit those regularities to concentrate evaluations near the loss regimes that dominate interval width. Under the stated assumptions, this yields a concrete sense in which forecasting downstream performance at scale is not merely descriptive (fitting curves) but algorithmic: it is a sequential resource allocation problem with provable calibration properties.