← Back

RoboChinchilla: Compute-Optimal Scaling with Simulation and Real-World Data Costs for Robot Foundation Models

Metadata

Table of Contents

  1. 1. Introduction: why robotics needs compute-optimal scaling (contrast with language Chinchilla; highlight scarcity of compute scaling studies and the role of simulation/data engines).
  2. 2. Related Work: scaling laws (Kaplan, Chinchilla), robotics scaling meta-analyses, compute reporting, simulation-to-real and data engines, inference-vs-train compute notes (scope boundaries).
  3. 3. Problem Setup and Compute Accounting: define resources (N, Ds, Dr), success/error metrics, and total cost including training FLOPs and data-generation costs; justify cost units and conversion assumptions.
  4. 4. Joint Scaling Model: postulate/fit Err(N, Ds, Dr) with effective data Deff; discuss identifiability, offsets, and when power laws fail (broken power laws, saturation).
  5. 5. Compute-Optimal Allocation Theory: derive KKT conditions; give closed forms in special cases; prove convexity after log-transform in the relaxed problem; give rounding/feasibility procedures for discrete budgets.
  6. 6. Robustness to Estimation Error: quantify how errors in α̂, β̂, ρ̂ propagate into allocation and final error; provide stability bounds and confidence intervals.
  7. 7. Sample Complexity and Experiment Design: show how many scaling points/tasks are needed to estimate exponents to tolerance; propose a sequential design strategy for efficient sweeps.
  8. 8. Hardness Beyond Power Laws: NP-hardness for generic monotone error functions; implications for what assumptions are necessary for actionable prescriptions.
  9. 9. Experimental Protocol (Flagged as Implementation-Strengthening): standardized procedural suite, grid over (N, Ds, Dr) at constant budget; hierarchical fits; compare against heuristics; report compute with full accounting.
  10. 10. Discussion: implications for 2026 robot data engines, sim-to-real strategy, benchmark standardization; limitations and extensions (multi-modal mixtures, test-time compute, safety constraints).

Content

1. Introduction: why robotics needs compute-optimal scaling (contrast with language Chinchilla; highlight scarcity of compute scaling studies and the role of simulation/data engines).

We consider the practical problem of allocating a finite budget across three coupled resources: model capacity, simulated experience, and real-world experience. In large language modeling, this allocation problem is now routinely addressed by empirical scaling laws together with compute-optimal prescriptions (notably the data–parameter tradeoffs highlighted by -style analyses). In robotics, the analogous question is both more constrained and less settled. It is more constrained because real interaction data carries nontrivial marginal cost, operational risk, and latency; it is less settled because the field lacks standardized reporting of training compute, data volumes, and evaluation protocols across sufficiently broad sweeps. Our objective in this work is to place robotics training runs under an explicit budget model and to derive an allocation rule that is compute-optimal under a minimal scaling-law hypothesis, while remaining actionable when only noisy measurements are available.

The robotics setting differs from the language setting in several ways that make naive transposition of language scaling heuristics unreliable. First, the unit of data is not a token but an episode, trajectory segment, or transition, and the effective informational content of such units varies substantially across collection modalities. An episode acquired via on-robot teleoperation may include high-fidelity contact dynamics, sensor idiosyncrasies, and safety-limited exploration; a nominally similar episode in simulation may be cheaper and more diverse, but may omit salient disturbances and lead to systematic sim-to-real gaps. Consequently, a data point cannot be priced solely by storage or bandwidth: it must be priced by end-to-end marginal cost (robot time, human time, wear, and opportunity cost) and valued by its contribution to downstream success. Second, robotics policies are trained under a range of objectives and algorithms (behavior cloning, offline reinforcement learning, online reinforcement learning, hybrid methods), each with different sensitivities to model size and data quality. Third, evaluation is intrinsically multi-task and nonstationary: minor changes in embodiment, lighting, or object geometry can shift performance, and long-horizon tasks can fail through compounding error. Any compute-optimal prescription must therefore be phrased in terms of expected downstream error over a benchmark family rather than a single curated test set.

Despite these differences, we claim that the underlying resource-allocation problem is structurally similar: diminishing returns in both data and model capacity suggest that a small number of power-law exponents can govern the tradeoff between investing in larger models and investing in additional experience, once the experience is expressed in an appropriate effective unit. The central methodological question is then not whether scaling exists in some idealized limit, but whether a sufficiently accurate parametric approximation can be fit over the feasible budget range to guide decisions. We adopt the position that an explicit parametric model is an algorithmic prior: without it, the budget allocation problem becomes combinatorial and, in general, intractable to optimize adaptively. With it, we can reduce the allocation problem to a low-dimensional program whose solution can be interpreted and stress-tested.

A further robotics-specific issue is that the data pipeline itself is a controllable system. Modern robotics training relies increasingly on : procedural simulation, domain randomization, auto-curricula, fleet collection with shared autonomy, and teleoperation interfaces with varying levels of assistance. These engines do not merely produce more data; they change the distribution of data, the cost per episode, and the effective coverage of the task family. Accordingly, the question how much data should we collect?'' is inseparable fromin what modality, at what marginal cost, and with what quality multiplier relative to simulation?’’ We therefore model simulation and real data as distinct resources coupled through an effective-data mapping. This mapping is intentionally simple: we aim for a parameterization that can be estimated from limited sweeps and is stable enough to support optimization under budget uncertainty.

We emphasize that the allocation problem must include training compute. In robotics, it is common to report data volume and success rates while leaving the training compute implicit (or incomparable) due to differences in architectures, batch sizes, and wall-clock constraints. Yet compute is often the dominant cost in large-scale training regimes, and it interacts with data choices: increasing data without increasing compute can undertrain, while increasing compute without data can overfit or saturate. We therefore measure compute cost in a common budget unit and include it explicitly in the total cost. This yields a single constrained optimization problem that can be solved end-to-end, rather than a sequence of ad hoc decisions (e.g., fixing model size by convenience, then collecting as much data as possible, then tuning for wall-clock).

From an engineering perspective, the decision-maker typically faces the following trade: simulation episodes can be generated cheaply and at scale, but with imperfect fidelity; real episodes are expensive, slow, and subject to safety constraints, but can collapse model error that simulation fails to expose. The effective value of real data is thus not only higher on average but also potentially nonlinear in the regime where sim-to-real mismatch dominates. Nonetheless, for purposes of allocation, we require a tractable summary. We adopt a regime in which the marginal return of additional real data can be modeled through a constant quality multiplier over simulation, interpreted as an average exchange rate for coverage of the benchmark distribution. This is not a statement that real and simulated trajectories are interchangeable pointwise; it is a statement that, as an aggregate resource for reducing benchmark error, one unit of real experience can be worth multiple units of simulated experience.

The introduction of an effective-data exchange rate allows us to disentangle two decisions that are otherwise coupled: (i) how much data to acquire in total, and (ii) how to split that effective data between simulation and real-world collection given their costs. This separation is critical in robotics because the split is often the highest-leverage decision under fixed budget: a small amount of well-chosen real data can dominate large increments of simulation if the real data targets failure modes induced by model mismatch. Conversely, if the quality multiplier is modest or real collection is prohibitively expensive, it is rational to allocate almost entirely to simulation and accept the residual gap. Our contribution is to formalize this trade and to show that, under the assumed structure, the split reduces to a threshold comparison of effective cost per unit of coverage.

We also require that the prescription be usable in the common situation where scaling parameters are unknown. Robotics groups rarely have the resources to run exhaustive sweeps over model size and both data modalities. Instead, we advocate a pilot design that spans the feasible log-space with a modest budget, fits a shared-exponent model across tasks with random effects, and then allocates the remaining budget according to the implied compute-optimal solution. The intent is not to treat the scaling law as ground truth, but to treat it as a compact hypothesis that can be falsified and updated: when predictions fail, the remedy is additional pilot points targeted to the region of disagreement.

Finally, we delimit scope. Our focus is on training-time allocation for a fixed embodiment and sensor stack, and on downstream task error measured on a predefined benchmark family. We do not attempt to optimize inference-time latency, memory constraints, or on-device deployment costs, except insofar as they induce feasibility constraints on model size. We also do not claim universality of the exponents across all robot morphologies or learning paradigms; rather, we claim that within a coherent benchmark family and training recipe, the scaling model can be estimated and exploited to yield a principled allocation that improves over naive heuristics. The next section situates this viewpoint within prior work on scaling laws, compute-optimal training, and sim-to-real data pipelines.


Empirical power-law relations between downstream error and training resources have been documented most prominently in language modeling, beginning with studies that fit error as a separable function of model size and data size over multi-order-of-magnitude sweeps . A key methodological lesson from this literature is that, once a sufficiently stable parametric form is available, one may treat training as a constrained optimization problem in which compute, parameters, and data are traded off to minimize error under a fixed budget. Subsequent analyses refined the compute-optimal prescription by emphasizing that, under fixed training compute, under-training on too little data can dominate returns from scaling parameters, leading to an optimal data–parameter frontier (the ``Chinchilla’’ regime) that differs from earlier heuristic allocations . While the details of the exponents and constants depend on architecture, optimizer, and data curation, the allocation principle is robust: diminishing returns imply that marginal gains per unit cost should be balanced across the controllable resources.

The scaling-law viewpoint has also been pursued in other domains, including vision, multimodal learning, and reinforcement learning, with varying degrees of regularity and with additional confounders such as distribution shift and evaluation protocol changes. Two observations are relevant to our setting. First, scaling fits are most predictive when evaluated on a fixed benchmark distribution and a fixed training recipe; changing either can alter the effective exponents and, in extreme cases, the functional form. Second, even when a parametric form is not exact, it can serve as an actionable prior for experiment design: low-dimensional models enable adaptive sampling of resource configurations and provide a principled alternative to ad hoc ablations. Our work adopts this latter stance, treating the scaling law as a hypothesis to be fit, stress-tested, and used for allocation within the feasible budget regime.

In robotics, large-scale policies trained from heterogeneous data (e.g., multi-task behavior cloning from demonstrations, or hybrid imitation–reinforcement learning systems) have shown substantial gains from increasing dataset size and model capacity, and several systems papers have reported monotone improvements when scaling data engines and policy networks. However, the robotics evidence base differs from language in two structural ways. First, robotics evaluations are typically multi-task with significant heterogeneity across tasks; aggregating success rates can obscure task-dependent saturation and can induce apparent scaling even when only a subset of tasks improve. Second, robotics training pipelines vary widely in observation modalities, action representations, and learning objectives, making cross-paper comparisons of data size'' andmodel size’’ less meaningful without a common accounting of compute and collection cost. Consequently, while there are compelling demonstrations that ``more data and larger models help,’’ there are fewer studies that execute systematic sweeps over (N, D) while holding the rest of the pipeline fixed, and fewer still that incorporate the cost of acquiring real-world interaction data as a first-class constraint.

A number of community efforts have emphasized the need for standardized robotics benchmarks and for clearer reporting of experimental details, including dataset composition, number of environment steps, and evaluation protocol. Yet even when environment steps or episodes are reported, the mapping from those counts to training cost is often opaque: the same number of transitions may correspond to vastly different compute depending on architecture, optimizer settings, sequence length, replay strategy, augmentation, and number of gradient updates per transition. Moreover, real-world collection cost is rarely comparable across labs, as it depends on fleet scale, autonomy tooling, human supervision, safety procedures, and wear. In the absence of consistent compute and cost reporting, it is difficult to infer whether observed gains are primarily due to larger models, more optimization steps, better data quality, or simply greater total investment. This motivates our explicit cost model, which converts training and data generation into a common budget unit and thereby makes allocation questions well-posed.

The broader machine learning community has argued for reporting training compute (e.g., FLOPs, accelerator-hours, or energy) to enable reproducibility and to quantify efficiency improvements. Robotics adds an additional axis: interaction data can be rather than merely , and the marginal cost of generation differs sharply between simulation and the real world. Simulation entails simulator compute, rendering, and engineering overhead, while real collection entails robot time, human teleoperation or oversight, maintenance, and opportunity cost. These costs can dominate even when training compute is moderate, or conversely be negligible relative to training compute when models and optimization are very large. Our accounting therefore treats the cost of training as proportional to N times the total number of episodes used in optimization (captured by κ), and adds linear per-episode costs cs and cr for simulation and real collection. This is not intended as a complete micro-economic model; rather, it is a minimal abstraction that renders the allocation problem explicit and allows sensitivity analysis with respect to the coefficients.

A central theme in robotics is the sim-to-real gap: policies trained heavily in simulation can fail in the real world due to unmodeled dynamics, contact phenomena, sensing artifacts, or distributional mismatch. Classical approaches include system identification, dynamics randomization, domain randomization, privileged learning, and residual adaptation; modern approaches increasingly combine simulation and real datasets, sometimes with fine-tuning or online adaptation. From the perspective of allocation, the key quantity is not whether simulation can eventually match reality, but the between modalities as measured by downstream error reduction per unit cost. In many regimes, a small amount of real data can disproportionately reduce error by revealing failure modes absent in simulation, while in other regimes the real data may be too narrow or too expensive to justify large investment. Our model encodes this phenomenon via a real-data quality multiplier ρ, which summarizes the average effective contribution of one real episode relative to one simulated episode for a fixed benchmark distribution and training recipe. We emphasize that ρ is an parameter; it does not assert pointwise interchangeability, but it enables a tractable optimization that can be revisited as ρ is re-estimated under improved simulators or collection tooling.

Recent robotics systems increasingly treat data collection as an adaptive process driven by ``data engines’’: procedural task generation, automatic curriculum construction, domain randomization schedules, shared autonomy interfaces, and active selection of scenarios for teleoperation or intervention. These mechanisms alter not only the of data but its and marginal cost, often in ways that are difficult to capture with a single scalar. Nonetheless, from a budget-allocation standpoint, one must ultimately decide how much to spend on each engine and when to switch modalities. Our formulation isolates the simplest decision boundary: given an intended effective data volume Deff and a fixed model size N, the cost-minimizing split between simulation and real collection reduces to a threshold rule determined by (cs, cr, κ, ρ). This provides a baseline that can be embedded into richer data-engine controllers, where ρ may itself be a function of collection policy and task mix.

Finally, we distinguish compute-optimal allocation from constraints. In many robotics applications, inference latency, memory footprint, and on-device power impose an upper bound Nmax on feasible model size, and techniques such as distillation or quantization are used to reconcile large training models with small deployment models. These considerations are orthogonal to our main question, which is how to allocate a training budget across model size and data modalities to minimize benchmark error for a fixed recipe. We therefore treat inference constraints only as feasibility restrictions on N, and we do not attempt to optimize the training–inference trade directly. Within this scope, the prior work above motivates our approach: adopt a low-dimensional scaling model, fit it with limited sweeps, and use it to compute an allocation that is explicitly optimal under a stated cost model.


3. Problem Setup and Compute Accounting: define resources (N, Ds, Dr), success/error metrics, and total cost including training FLOPs and data-generation costs; justify cost units and conversion assumptions.

We study the following budgeted allocation problem: given a fixed training recipe (architecture family, optimizer, augmentation, rollout processing, and evaluation protocol), we choose three primary resources—model size and two data modalities—in order to minimize downstream error on a fixed benchmark distribution. Concretely, an is a triple
(N, Ds, Dr),
where N denotes the number of trainable parameters of the policy (or more generally a scalar proxy for model capacity, treated as discrete in practice), Ds is the number of simulated training episodes or demonstrations available to the learner, and Dr is the number of real-world training episodes or demonstrations. We take Ds, Dr ∈ ℤ ≥ 0, with the understanding that the same formalism covers transition counts, trajectory segments, or other episode-like units provided the training and evaluation pipelines use a consistent notion of ``one unit of data.’’

Let T denote the benchmark task family used to evaluate generalization. In the simplest case, T is a single task and evaluation returns a success rate SR ∈ [0, 1] computed over a fixed number of trials. In the multi-task case, we index tasks by t and consider SRt per task; an aggregate score may be an unweighted average $\frac{1}{|T|}\sum_t \mathrm{SR}_t$ or a weighted average reflecting deployment priorities. Our optimization objective is expressed in terms of an error functional Err that is monotone decreasing in success rate. For definiteness, we may take
Err := 1 − SR,
or, if the community reports percentages, Err := (100 − SR)/100. All subsequent analysis is invariant to such affine rescalings, and we only assume Err ∈ [0, 1] after normalization.

Because robotics evaluations can exhibit substantial stochasticity (sensor noise, stochastic resets, randomized task parameters, and policy stochasticity), we distinguish the latent expected error from observed estimates. If an evaluation uses m trials per task, then conditional on a fixed trained policy the empirical success rate $\widehat{\mathrm{SR}}_t$ is approximately binomial, hence $\widehat{\mathrm{Err}}_t = 1-\widehat{\mathrm{SR}}_t$ has standard error on the order of $\sqrt{\mathrm{SR}_t(1-\mathrm{SR}_t)/m}$. In addition, training randomness (initialization, data shuffling, stochastic optimization) induces run-to-run variability; in our experimental design we therefore treat each training run as producing a noisy observation $\widehat{\mathrm{Err}}(N,D_s,D_r)$ of an underlying mean error Err(N, Ds, Dr) for the fixed recipe and task distribution.

The pair (Ds, Dr) describes the number of episodes collected in each modality. In many pipelines, the learner performs multiple epochs over a dataset, uses replay buffers, or performs off-policy updates that reuse transitions. We absorb such reuse into the training-compute coefficient introduced below; equivalently, one may interpret Ds and Dr as the number of episodes while the total number of gradient updates is controlled by fixed recipe hyperparameters. If, instead, the practitioner explicitly chooses the number of optimization steps per episode, then our accounting can be extended by allowing κ to depend on the step schedule; we keep the minimal abstraction to isolate the primary budget tradeoff.

We emphasize that the allocation (N, Ds, Dr) is not intended to capture all determinants of performance. It is a controlled decision space in which other factors (architecture shape, observation encoding, loss, regularization, environment randomization settings) are held fixed during the scaling sweeps. The point of the accounting is not to model all sources of variation but to enable a meaningful constrained optimization once a stable experimental protocol is fixed.

We convert heterogeneous expenditures (accelerator compute, simulator throughput, fleet time, and human labor) into a single scalar budget B measured in an arbitrary but fixed cost unit (e.g., dollars, GPU-hours multiplied by a monetary rate, or any internal accounting unit). The only requirement is that all cost coefficients be expressed in the same unit.

Our total cost decomposes into training compute and data-generation costs:

and the allocation must satisfy Cost(N, Ds, Dr) ≤ B. Here κ is a training-compute coefficient converting the product N(Ds + Dr) into cost, while cs and cr are per-episode data-generation costs in simulation and in the real world, respectively.

The term κN(Ds + Dr) should be read as a first-order model of training FLOPs. Under a fixed recipe, the work per episode scales approximately linearly with model size (forward/backward passes) and linearly with the number of episodes consumed by optimization. If sequence length varies, one may interpret Ds + Dr as the total number of fixed-length chunks; if the episode length distribution is stable across modalities, the distinction is inessential. More detailed accounting (e.g., attention quadratic costs in context length, or modality-specific encoders) can be incorporated by replacing N with a measured per-step FLOP estimate; we retain N as a simple proxy because it is the primary controllable axis in typical scaling sweeps.

The coefficients cs and cr capture marginal costs of obtaining data. For simulation, cs includes simulator compute, rendering, logging, storage, and any amortized engineering overhead attributable to generating an additional episode. For real-world collection, cr includes robot depreciation and maintenance attributable to use, operator or teleoperation time, safety supervision, lab overhead, and opportunity cost of tying up hardware. When data are harvested opportunistically (e.g., from an existing deployment), cr may be small; when collection requires dedicated teleoperation, cr can dominate all other terms. Our formulation is designed to make such regime changes explicit through the coefficients rather than through informal narrative.

The coefficients (κ, cs, cr) are inputs to the allocation problem and can be set either by direct accounting or by measurement. For example, κ can be estimated from a pilot training run by recording wall-clock time or accelerator-hours and dividing by N(Ds + Dr) for the fixed recipe; the resulting κ implicitly incorporates optimizer overhead, communication cost, and the chosen number of passes through the data. Similarly, cs can be estimated by measuring simulator throughput and cost per hour, and cr by measuring the marginal labor and robot time required to collect and validate one additional episode under the lab’s procedures.

We do not claim that is a complete micro-economic model. It is a minimal abstraction that (i) is linear in the controllable quantities, (ii) makes tradeoffs between training compute and data collection explicit, and (iii) supports sensitivity analysis: one may vary (κ, cs, cr) to understand which axis is limiting in a given operational setting.

In practice, N is restricted to a finite grid 𝒩 determined by architectural choices and engineering constraints, and the data volumes are integers. We therefore regard the allocation problem as a mixed discrete optimization. For analysis and for deriving prescriptions, it is convenient to consider a continuous relaxation in which N > 0 and Ds, Dr ≥ 0 are real-valued; one may then round to feasible values and verify that the rounded allocation respects the budget constraint. Additional feasibility constraints may be imposed without changing the accounting structure, e.g., an upper bound N ≤ Nmax due to inference-time memory or latency, or a wall-clock bound that effectively limits κ or the maximal N(Ds + Dr) achievable within a deadline. The remainder of the paper treats as the binding constraint and analyzes how to allocate B across (N, Ds, Dr) once a predictive model of Err(N, Ds, Dr) is specified.


4. Joint Scaling Model: postulate/fit Err(N, Ds, Dr) with effective data Deff; discuss identifiability, offsets, and when power laws fail (broken power laws, saturation).

We now posit a parametric model for the latent mean error Err(N, Ds, Dr) and describe how it is fit from noisy training-and-evaluation runs. The role of this section is not to argue that power laws are universally valid, but to define a tractable hypothesis class that (i) is expressive enough to capture the dominant empirical trends in controlled scaling sweeps, and (ii) renders the downstream allocation problem algorithmically solvable. The subsequent allocation theory in should be read as conditional on the adequacy of the present model.

We model the dependence on data modality through a single effective data volume
Deff := Ds + ρDr,
where ρ ≥ 1 is a real-data quality multiplier. The intended meaning is that, at fixed recipe and task distribution, one additional real episode yields the same reduction in generalization error as ρ additional simulated episodes, after averaging over the benchmark distribution. We then postulate the joint scaling form

with constants a, b > 0, offset E ∈ [0, 1), and exponents α, β ∈ (0, 1). The additive structure in is a deliberate simplification: it asserts approximate separability between the effect of more data (after modality aggregation) and the effect of larger models. Empirically, such separability is often accurate over a moderate range of resources and is sufficient to derive actionable prescriptions; we treat deviations as model mismatch to be diagnosed, rather than as a priori disproof.

The offset E plays two roles. First, it captures irreducible error due to partial observability, actuator limits, benchmark stochasticity, or recipe misspecification. Second, it prevents the model from spuriously forcing Err → 0 as N, Deff → ∞ within a range where the benchmark saturates. For identifiability and numerical stability we constrain E away from 1 and treat Err ∈ [0, 1] as in .

Each scaling point (Ni, Ds, i, Dr, i) yields an observed error estimate $\widehat{\mathrm{Err}}_i$ obtained by evaluating a trained policy on a finite number of trials (and typically over multiple random seeds). Because success is binary at the trial level, it is natural to model $\widehat{\mathrm{SR}}_{t,i}$ as binomial (or beta-binomial to account for overdispersion), hence $\widehat{\mathrm{Err}}_{t,i}=1-\widehat{\mathrm{SR}}_{t,i}$ is noisy even when training randomness is absent. We therefore fit under a heteroscedastic noise model, weighting points by their estimated standard errors when using approximate likelihoods.

In the multi-task setting we allow task-specific constants while sharing exponents and ρ. Concretely, for tasks t ∈ T we write
Errt(N, Ds, Dr) = atDeffα + btNβ + Et,
with hierarchical priors on (at, bt, Et) and shared (α, β, ρ). This pooling is not cosmetic: it reduces variance in estimating α, β, ρ by leveraging that slopes in log-space are often approximately invariant across tasks within a benchmark family, whereas vertical shifts vary substantially due to intrinsic difficulty.

The parametrization (a, α, ρ) is only partially identifiable without targeted variation in (Ds, Dr). Indeed, if all training runs satisfy a fixed ratio Dr/Ds = λ, then Deff = (1 + ρλ)Ds and the data term becomes a(1 + ρλ)αDsα; only the product a(1 + ρλ)α is identified, not ρ itself. Thus, to estimate ρ we must include runs that vary Dr at (approximately) fixed Ds, or vary Ds at fixed Dr, so that the model observes differential returns to the two modalities.

Similarly, E is weakly identified unless the sweep includes points near saturation. If all observed errors are far from the floor, then E trades off against a (and against b if N is small), producing broad posterior uncertainty. For this reason, in pilot sweeps we prefer to include at least one high-resource point (large N and large Deff) to anchor E, even if that point is not itself cost-effective. Conversely, if evaluations are so noisy that several points appear to outperform the plausible floor, unconstrained fits may drive E < 0; we therefore impose E ≥ 0 and treat residual optimism as noise.

The separation between N and Deff also requires that sweeps vary both axes. If, for example, one only scales N while holding (Ds, Dr) fixed, then the data term is a constant and α is unidentifiable; likewise, only scaling data leaves β unidentifiable. We therefore interpret as a model whose parameters are meaningful only when estimated from a factorial (or otherwise sufficiently rich) design.

When E is negligible on the observed range, implies approximately linear relationships on log-log plots:
log Err ≈ log a − αlog Deff   (at large N),   log Err ≈ log b − βlog N   (at large Deff).
However, once E is non-negligible, naive log transforms are biased because log (Err − E) is not observed. We therefore fit in the original error domain (or via a likelihood on success counts), while using log-log plots only diagnostically. In practice, we regularize E toward small values unless saturation is clearly supported, since otherwise E may absorb variance and flatten estimated slopes.

We emphasize that is expected to hold only on a regime where the training recipe is stable and the benchmark distribution does not induce qualitative phase changes. Two common violations are as follows.

First, : the effective exponent may change once the data distribution shifts (e.g., the simulator domain randomization becomes sufficiently broad) or once the model crosses a capacity threshold that enables qualitatively new behaviors (e.g., long-horizon credit assignment begins to succeed). A parsimonious extension is a piecewise model
a1Deffα11{Deff ≤ D0} + a2Deffα21{Deff > D0},
with continuity at D0, and similarly for N. We do not adopt this as the default because it complicates identifiability and can overfit sparse sweeps; rather, we use posterior predictive checks to detect systematic curvature in residuals versus log Deff or log N.

Second, : success rates may saturate due to deterministic benchmark structure or evaluation artifacts, in which case the apparent α and β shrink toward 0 at high resources. The offset E partially models this, but if saturation occurs sharply the additive floor is insufficient. In such cases we treat as locally valid below the ceiling and restrict allocation recommendations to budgets that remain in that regime.

The end product of this section is an estimated parameter vector θ̂ = (, , , α̂, β̂, ρ̂), or more usefully a posterior over θ under the hierarchical model. In we will treat these parameters as defining a predictive objective Err(N, Ds, Dr) and derive compute-optimal allocations under the budget constraint, with explicit sensitivity to uncertainty in (α, β, ρ).


5. Compute-Optimal Allocation Theory: derive KKT conditions; give closed forms in special cases; prove convexity after log-transform in the relaxed problem; give rounding/feasibility procedures for discrete budgets.

We now study the budgeted allocation problem induced by under the cost model of (H2). Our goal is to characterize, and in suitable regimes explicitly compute, an allocation (N, Ds, Dr) with Cost(N, Ds, Dr) ≤ B that minimizes Err(N, Ds, Dr). Throughout this section we treat the scaling parameters as fixed; uncertainty and its propagation are deferred to the subsequent robustness analysis.

The discrete problem is
minN ∈ 𝒩, Ds, Dr ∈ ℤ ≥ 0a(Ds + ρDr)α + bNβ + E  s.t.  κN(Ds + Dr) + csDs + crDr ≤ B.
Since the objective depends on data only through Deff := Ds + ρDr, it is natural to separate (i) choosing the pair (N, Deff) from (ii) choosing the cheapest (Ds, Dr) that achieves Deff at the selected N. Fix N and Deff. Writing Ds = Deff − ρDr with Dr ∈ [0, Deff/ρ], the cost becomes an affine function of Dr:
Cost = κN(Deff − (ρ − 1)Dr) + cs(Deff − ρDr) + crDr = (κN + cs)Deff + (cr − ρcs − (ρ − 1)κN)Dr.
Hence the minimum-cost split occurs at an endpoint. Concretely, the cheapest way to purchase one unit of effective data is
$$ \tilde c(N)\;:=\;\min\Bigl\{\kappa N+c_s,\;\frac{\kappa N+c_r}{\rho}\Bigr\}, $$
corresponding respectively to all-sim (Dr = 0) or all-real (Ds = 0). We therefore reduce the continuous relaxation to the two-variable program

and then recover (Ds, Dr) by the threshold rule implied by the minimizer of (N). The boundary case (κN + cr)/ρ = (κN + cs) admits any convex combination achieving Deff at equal cost; in discrete settings we may then choose the split that best matches operational constraints (e.g. minimum required real coverage).

Consider the interior optimum of where the budget constraint is tight. Let λ > 0 denote the KKT multiplier for Deff(N) ≤ B. The Lagrangian is
ℒ(N, Deff, λ) = aDeffα + bNβ + E + λ(Deff(N) − B).
Stationarity yields

together with feasibility and complementary slackness λ(Deff(N) − B) = 0. In regimes where is differentiable (or piecewise differentiable with the optimum away from the kink), – characterize the unique continuous optimum.

The simplest closed form occurs when data-generation costs are negligible, cs = cr = 0, and we ignore the distinction between D and Deff at the level of constants (e.g. when ρ = 1 or when we directly control effective coverage). Then (N) = κN and the constraint reads κNDeff ≤ B. Solving – with Deff = B/(κN) yields the familiar ``Chinchilla’’ balance:
Deff* ∝ (N*)β/α,   N* ∝ Bα/(α + β),   Deff* ∝ Bβ/(α + β),
and the excess error above the floor scales as Err*(B) − E = Θ(Bαβ/(α + β)). The key interpretation is that at the compute-optimal point, marginal returns per unit budget from scaling data and scaling model size are equalized; the power-law exponents determine the allocation ratio.

Two further limiting regimes are immediate. If bNβ is already negligible over feasible N ∈ 𝒩, then the optimum places essentially all budget into effective data: Deff* ≈ B/(Nmax) at the largest deployable N. Conversely, if data is plentiful and the data term is small, the objective is dominated by bNβ and the solution pushes N upward subject to the compute term in (N); in that case the optimal Deff is the minimal amount required by feasibility constraints (if any) or by stability of training.

Although the original variables have a non-linear constraint, is well behaved after a standard change of coordinates. Let x = log N and y = log Deff. The objective becomes
f(x, y) = aeαy + beβx + E,
which is convex in (x, y) since it is a nonnegative weighted sum of convex exponentials plus a constant. The constraint becomes
y + log (ex) ≤ log B.
When (N) is of the form min {u1N + v1, u2N + v2} with ui, vi > 0, the function log (ex) is the pointwise minimum of two convex functions log (uiex + vi); while a pointwise minimum of convex functions need not be convex globally, the feasible set remains a union of two convex sets corresponding to the two modalities (all-sim or all-real). Consequently we may solve two convex programs,
$$ \min f(x,y)\ \ \text{s.t.}\ \ y+\log(\kappa e^x+c_s)\le\log B \quad\text{and}\quad \min f(x,y)\ \ \text{s.t.}\ \ y+\log\Bigl(\tfrac{\kappa e^x+c_r}{\rho}\Bigr)\le\log B, $$
and then take the better solution. This recovers the same threshold rule as the endpoint argument above, while ensuring polynomial-time solvability of each branch by standard interior-point methods.

We finally return to discrete feasibility: N ∈ 𝒩 and Ds, Dr ∈ ℤ ≥ 0. We implement rounding as a projection that preserves the budget constraint.

First, we compute a continuous candidate (N, Deff) by solving the appropriate convex branch(es). Second, we choose N by rounding N to nearby grid points in 𝒩 (typically the two nearest values), and for each candidate N we set the largest feasible effective data
$$ D_{\mathrm{eff}}(N)\;=\;\frac{B}{\tilde c(N)}, $$
or its floored integer counterpart after converting to (Ds, Dr) as below. Third, given (N, Deff) we choose the cost-minimizing modality endpoint: if (κN + cr)/ρ ≤ κN + cs we set Ds = 0 and Dr = ⌊Deff/ρ; otherwise we set Dr = 0 and Ds = ⌊Deff. Fourth, because flooring can create slack budget, we optionally spend remaining budget by incrementing Ds or Dr greedily according to the currently cheaper effective cost per unit, while maintaining Cost ≤ B. This procedure maintains invariants: feasibility is preserved at every step, and objective degradation relative to the continuous optimum vanishes as the grids in 𝒩 and episode counts become fine compared to the scale of the optimum.

In summary, the scaling-law structure reduces allocation to a small convex optimization plus a one-dimensional modality decision, and the discrete implementation amounts to rounding followed by a budget-respecting projection.


6. Robustness to Estimation Error: quantify how errors in α̂, β̂, ρ̂ propagate into allocation and final error; provide stability bounds and confidence intervals.

We now quantify how estimation errors in θ̂ = (, , , α̂, β̂, ρ̂) propagate through the allocation map θ ↦ (N*, Ds*, Dr*) and into the achieved downstream error. The guiding point is that our optimizer is a smooth (indeed, piecewise-smooth) transformation of θ once we work in log-coordinates and stay away from degeneracies (active-set changes and boundary optima). This permits standard sensitivity analysis, yielding stability bounds of the form stated in Theorem~4.

Consider first the relaxed two-variable program in (N, Deff) on a fixed modality branch (all-sim or all-real), where (N) is replaced by an explicit affine function, say s(N) = κN + cs or r(N) = (κN + cr)/ρ. With x = log N, y = log Deff, we may write the constrained problem as

On each branch, the feasible set is convex and the objective is convex in (x, y), so the optimizer (x*(θ), y*(θ)) is well defined whenever the optimum is unique. Moreover, if the optimum is interior to the branch (i.e. the constraint is active and we are not at a kink where the preferred modality changes), then the KKT system is differentiable in θ. Denoting by F(x, y, λ; θ) = 0 the stationarity equations plus complementary slackness with active constraint, the implicit function theorem yields a local Lipschitz dependence of (x*, y*, λ*) on θ, with Lipschitz constant controlled by the inverse Jacobian F/∂(x, y, λ). In particular, away from boundary regimes where either the data term or the model term vanishes, the Hessian of the Lagrangian is well conditioned in log-coordinates, and we obtain

for some L(θ, B) that grows at most polylogarithmically in B in the regimes of interest (precisely because x*, y* themselves scale like Θ(log B)).

To make the dependence transparent, consider the compute-dominated special case cs = cr = 0 and treat Deff as directly purchasable at cost κNDeff ≤ B. The closed-form relations are
$$ \log N^* \;=\; \frac{\alpha}{\alpha+\beta}\log B + O(1),\qquad \log D_{\mathrm{eff}}^* \;=\; \frac{\beta}{\alpha+\beta}\log B + O(1). $$
Differentiating with respect to α, β shows that a perturbation |α̂ − α| ≤ ε, |β̂ − β| ≤ ε induces an O(εlog B) perturbation in log N* and log Deff*, hence a multiplicative factor exp (O(εlog B)) = BO(ε) in N* and Deff*. Substituting into the power law yields the corresponding error inflation:

up to lower-order additive effects from errors in , , . This recovers the qualitative content of Theorem~4: while the optimizer depends on exponents through log B, the induced degradation in achieved error is only linear in εlog B for small ε.

The dependence on ρ is structurally different: ρ affects only the conversion between Dr and Deff, and it affects the effective cost of real data through (κN + cr)/ρ. The endpoint rule implies that the modality decision changes only when the sign of
$$ g(N;\rho)\;:=\; \frac{\kappa N+c_r}{\rho}-(\kappa N+c_s) $$
changes. Thus, if at the true optimum (N*, Deff*) we have a margin |g(N*; ρ)| ≥ m > 0, then any ρ̂ satisfying |ρ̂ − ρ| ≤ η with η ≲ mρ2/(κN* + cr) preserves the modality choice, and the only effect of ρ̂ is a smooth rescaling of the realized (Ds, Dr) after the split. Conversely, when g(N*; ρ) ≈ 0, the two modalities are nearly cost-equivalent; in that boundary case, even if a small estimation error flips the decision, the cost penalty is second-order (because the two branches coincide to first order). This is precisely the regime in which we may safely incorporate operational constraints (e.g. minimum real-world coverage) without materially affecting optimality.

In practice we do not only have deterministic bounds on θ̂ − θ, but a posterior (or approximate sampling distribution) from the regression stage. We therefore report uncertainty in two layers: (i) uncertainty in predicted error at a allocation, and (ii) uncertainty induced by optimizing under uncertain parameters.

For (i), conditional on an allocation (N, Ds, Dr), the mapping θ ↦ Err(N, Ds, Dr) is smooth, and a delta-method approximation gives
Var[Err(N, Ds, Dr) ∣ data] ≈ ∇θErr Cov(θ) ∇θErr,
where θErr is evaluated at a posterior mean (or MAP). This yields a simple approximate (1 − δ)-interval
$$ \widehat{\mathrm{Err}}\pm z_{1-\delta/2}\sqrt{\widehat{\mathrm{Var}}(\mathrm{Err})}, $$
with the understanding that heavy-tailed posteriors for exponents are better handled by posterior sampling.

For (ii), we generate posterior draws θ(m), compute the corresponding optimizer (N*(m), Ds*(m), Dr*(m)), and evaluate either the plug-in error Err(N*(m), Ds*(m), Dr*(m); θ(m)) or, more conservatively, Err(N*(m), Ds*(m), Dr*(m); θtrue) approximated by held-out evaluations. The resulting empirical quantiles provide credible intervals for achievable performance under allocation uncertainty. When one desires a one-shot robust decision, we may instead choose the allocation minimizing a risk-averse criterion such as the posterior (1 − δ)-quantile of Err, which directly trades mean performance against robustness to exponent misestimation.

Combining the local Lipschitz property with the smoothness of Err in (x, y) yields the following practical consequence. Suppose |α̂ − α|, |β̂ − β|, and the relative errors in , , ρ̂ are all at most ε, and assume we are away from degeneracies (unique interior optimum on one branch, or near-equality of branches). Then, for sufficiently small ε, the allocation computed under θ̂ satisfies
Err(, s, r) ≤ (1 + O(εlog B)) Err(N*, Ds*, Dr*) + O(ε),
where the O(ε) term accounts for additive floors and constant-factor misspecification. This justifies allocating only a modest fraction of budget to exponent estimation: once ε is driven to the point where εlog B ≪ 1, further improvements in exponent accuracy have diminishing returns compared to simply spending on N and Deff. The remaining question is how large a pilot sweep is required to reach this regime, which we address next via sample complexity and experiment design.


7. Sample Complexity and Experiment Design: show how many scaling points/tasks are needed to estimate exponents to tolerance; propose a sequential design strategy for efficient sweeps.

We turn to the question left implicit so far: given budget B, how much of it must be spent on pilot sweeps in order to estimate the shared exponents (α, β) and the real-data multiplier ρ to a tolerance that makes the downstream allocation reliable. The central object is the statistical efficiency of our scaling experiment design, since θ̂ is obtained from finitely many training-and-evaluation runs on a benchmark family.

A single scaling point consists of training at (N, Ds, Dr) and evaluating on tasks t ∈ {1, …, T} using M evaluation episodes per task (or per task-seed pair). Writing pt(N, Ds, Dr) = 1 − Errt(N, Ds, Dr) for the success probability, a standard model is
$$ S_{t}\;\sim\;\mathrm{Binomial}(M,p_t), \qquad \widehat{\mathrm{Err}}_t\;=\;1-\frac{S_t}{M}, $$
possibly augmented with an additional task/seed dispersion term. In any case, the conditional variance satisfies $\mathrm{Var}(\widehat{\mathrm{Err}}_t\mid p_t)\le 1/(4M)$. When we aggregate across T tasks via a hierarchical model with task random effects (at, bt, Et) and shared (α, β, ρ), the effective noise for estimating the shared exponents decreases roughly like 1/(TM) (up to a multiplicative factor reflecting task heterogeneity). This immediately yields a design principle: if training runs are expensive, it is often cheaper to increase T and M (evaluation) until evaluation cost is negligible, thereby reducing posterior uncertainty in (α, β, ρ) without additional training cost.

Although our error model is additive in two power laws plus a floor,
Err = aDeffα + bNβ + E,
local sample complexity can be read off from a first-order (Fisher-information) approximation. To make this explicit, suppose for the moment that we operate in a regime where E is known or negligible relative to the non-floor error, and we consider a branch where Deff is directly parameterized (all-sim or all-real). If we hold N fixed and sweep Deff over a log-range RD := log (Dmax/Dmin), then locally ∂Err/∂α scales like aDeffαlog Deff, hence the information about α grows with the dispersion of log Deff. A crude but useful proxy is the standard linear-regression formula: for KD distinct data-scale points with homoscedastic error variance σ2,

where the second approximation assumes points roughly uniformly spaced in log Deff. An analogous estimate holds for β with RN := log (Nmax/Nmin) and KN distinct model-scale points. Thus, for a target sd(α̂) ≤ ε we require
$$ K_D\;\gtrsim\;\frac{12\sigma^2}{\varepsilon^2 R_D^2}, \qquad K_N\;\gtrsim\;\frac{12\sigma^2}{\varepsilon^2 R_N^2}. $$
While ignores the additive two-term structure, it correctly captures the two levers we control: (i) increase the number of scaling points K, (ii) increase the log-range of the sweep, and (iii) reduce σ2 by more tasks and more evaluation episodes.

The additive structure implies an additional, practically important constraint: α and are poorly identified if one term dominates the other across the sweep. Concretely, if aDeffα ≫ bNβ for all chosen points, then the likelihood is nearly invariant to β, and any estimate of β will be driven by prior assumptions rather than data. Therefore, we should ensure that the sweep includes a neighborhood of the ``balance curve’’

since there the gradients with respect to both α and β have comparable magnitude, yielding high joint information. Operationally, we do not know α, β a priori, but even a coarse initial sweep allows us to locate the approximate intersection region, after which subsequent points can be concentrated near the predicted optimum and near .

The multiplier ρ enters only through Deff = Ds + ρDr and through the real-data cost comparison. If we never train on real data (Dr = 0 always), then ρ is not identifiable; if we always train on real data only (Ds = 0 always), then ρ is confounded with a. Hence we must include mixed-modality points. A simple and efficient pattern is a design at fixed (N, D) (total episodes) in which we train one run with (Ds = D, Dr = 0) and another with (Ds = 0, Dr = D). Under the model, the difference in non-floor error is approximately
ΔErr ≈ a(Dα − (ρD)α) = aDα(1 − ρα),
which is informative about ρ when aDα is not too small (i.e. away from the floor) and when ρ is not extremely close to 1. In view of the threshold rule for the modality split, we additionally want to know whether we are near the indifference boundary g(N; ρ) = 0; accordingly, a practical goal is not sd(ρ̂) ≤ ε in absolute terms, but rather sd(g(N; ρ̂)) small compared to the margin |g(N; ρ)| at the candidate optimum. This aligns the experiment design with the decision it must support.

These considerations lead to a sequential sweep that is more budget-efficient than a static grid.


We choose K0 points that span (log N, log D) over the feasible engineering range (e.g. a 3 × 3 or 4 × 4 factorial design), and we include at least two paired sim-vs-real points to seed identification of ρ. We allocate evaluation budget so that binomial noise is small compared to between-point differences, e.g. choose TM so that $1/\sqrt{TM}\ll$ the anticipated error drop across adjacent points.


We fit the hierarchical model and compute a posterior over (α, β, ρ). For selecting the next scaling point, we evaluate an approximate expected information gain criterion, such as maximizing the determinant of the Fisher information for (α, β, ρ) under the current posterior (a Bayesian D-optimal rule), subject to the remaining budget and feasibility constraints. A simpler surrogate is to sample near (i) the predicted balance curve , and (ii) the predicted optimizer under the posterior mean, since these points are simultaneously informative and decision-relevant.


Rather than targeting exponent accuracy in isolation, we stop when uncertainty in the optimal allocation is small, e.g. when posterior draws of (α, β, ρ) imply that log N* and log Deff* have standard deviation at most τ (for a user-chosen τ), or equivalently when εlog B is empirically small in the sense of the robustness bound. This makes explicit the diminishing-returns phenomenon: once the posterior uncertainty is such that alternative allocations predicted by the posterior are near-indifferent in achieved error, further sweeps are dominated by spending the remaining budget on training at the chosen scale.

In aggregate, the pilot sample complexity is controlled by three quantities: the number of distinct training runs K (expensive), the total evaluation mass TM (comparatively cheap), and the log-ranges RD, RN (engineering-limited but crucial). Sequentially concentrating points near the balance region and near the decision boundary for the sim-vs-real split yields exponent estimates accurate enough for allocation with a small pilot fraction of B. The next section clarifies why such parametric structure is not merely convenient: without it, even formulating an efficient allocation rule is computationally intractable in the worst case.


8. Hardness Beyond Power Laws: NP-hardness for generic monotone error functions; implications for what assumptions are necessary for actionable prescriptions.

The preceding allocation rule relies on the parametric scaling assumption (H1), which converts budget allocation into a low-dimensional, well-behaved optimization problem. We now make precise why some such structure is not merely aesthetically convenient: if we remove (H1) and only assume that error improves monotonically with additional resources, then the allocation problem becomes computationally intractable in the worst case. The correct interpretation is not that practitioners ``should not’’ do allocation without power laws, but rather that any actionable prescription must implicitly exploit additional regularity (parametric form, convexity, smoothness, submodularity, etc.); otherwise no efficient algorithm can be guaranteed.

Fix a finite set of feasible model sizes 𝒩 and consider the decision variables (N, Ds, Dr) ∈ 𝒩 × ℤ ≥ 02 under the linear budget constraint κN(Ds + Dr) + csDs + crDr ≤ B. Suppose that the downstream error Err(N, Ds, Dr) is assumed to be coordinate-wise non-increasing in each argument (more model capacity and more data cannot worsen performance). Further assume that Err is presented by a value oracle: given (N, Ds, Dr), we can train/evaluate and obtain Err(N, Ds, Dr) (or a sufficiently accurate estimate thereof). This formalizes the strongest ``black-box’’ abstraction one might hope to use when no parametric scaling law is trusted.

Consider the decision problem: given (B, κ, cs, cr) and a threshold η, determine whether there exists a feasible allocation with Err(N, Ds, Dr) ≤ η. Even when N is fixed (so only (Ds, Dr) remain), and even when we allow Err to take only finitely many values, this decision problem is NP-hard. The core reason is that monotonicity alone permits Err to encode arbitrary combinatorial ``step improvements’’ that behave like selecting items under a knapsack constraint.

We outline a reduction that captures the essence of Theorem~5 in the global context. Let a 01 knapsack instance be given by item weights wi ∈ ℤ > 0, item values vi ∈ ℤ > 0, a capacity W, and a target value V. The question is whether there exists a subset S ⊆ {1, …, n} such that i ∈ Swi ≤ W and i ∈ Svi ≥ V.

We construct an allocation instance with fixed N and a single data modality for simplicity; take Dr ≡ 0, κ = 0, cs = 1, and set the budget B := W. Thus feasibility is simply Ds ≤ W. The only remaining task is to define a monotone non-increasing error function of Ds that encodes the knapsack objective. To do so, we introduce data increments that can be purchased only in certain bundles: for each subset S define a special dataset size
DS := ∑i ∈ Swi.
Define Err(Ds) to be a step function that attains a low value if and only if Ds equals (or exceeds) some DS whose corresponding value is large. Concretely, set
$$ \mathrm{Err}(D_s) := 1 - \max\Bigl\{ \frac{1}{C}\sum_{i\in S} v_i \,:\, D_S \le D_s \Bigr\}, $$
where $C:=\sum_{i=1}^n v_i$ is a normalizing constant ensuring Err ∈ [0, 1], and the maximum over an empty set is defined as 0. This function is monotone non-increasing in Ds by construction. Moreover, there exists Ds ≤ W with Err(Ds) ≤ 1 − V/C if and only if there exists a subset S with DS ≤ W and i ∈ Svi ≥ V, which is exactly the knapsack decision problem. Hence deciding feasibility under an error threshold is NP-hard.

The same construction can be embedded in the original (Ds, Dr) formulation with linear costs by assigning separate costs to the two modalities and forcing all effective purchases to occur in, say, the real-data coordinate; fixing N removes any dependence on model size. The conclusion is that hardness is not an artifact of model-selection: it already appears in the data-allocation subproblem.

The above reduction uses an error function with discontinuous steps, which might seem unrealistic. However, the point is not that real learning curves are adversarial, but that without additional assumptions one cannot preclude adversarial instances. In particular, any algorithm that claims to output a near-optimal allocation for monotone error functions would imply P = NP, even if it is allowed to adaptively query the oracle. Moreover, simple approximation guarantees are also blocked in the worst case: by making the steps sufficiently sharp (or by introducing plateaus separated by narrow transition regions), one can force any polynomial number of oracle queries to be uninformative about where the next improvement occurs. Thus, absent structure, the sample complexity of exploration and the computational complexity of optimization are coupled in an unfavorable way.

To escape this impossibility, we must restrict the function class. The scaling-law hypothesis (H1) is one such restriction, and it is particularly convenient because it yields (i) identifiability from a modest number of scaling points (Section~), (ii) a convex (after change-of-variables) allocation problem with global optima (Theorem~3), and (iii) robustness of the optimizer to small parameter errors (Theorem~4). More broadly, any allocation theory with guarantees must assume some combination of:

Power laws are not uniquely privileged, but they instantiate all three properties in a form that is empirically plausible for many families of representation-learning systems and, crucially, admits transparent budget trade-offs.

We therefore treat (H1) as an : it is the minimal structural hypothesis under which we can both (i) estimate the relevant quantities from pilot sweeps and (ii) compute a recommended allocation with a correctness story. This viewpoint also sharpens what it means to validate the approach experimentally. It is not enough to report that larger N or larger Deff helps; rather, we must verify that within the operating range the measured errors are consistent with a model whose induced optimizer is stable. If the observed learning curves substantially violate the assumed smooth trade-off (e.g. exhibit abrupt regime changes not captured by a single pair (α, β)), then the hardness discussion predicts that allocation will be intrinsically fragile, and any claimed optimality should be regarded as heuristic.

In the next section we therefore specify an experimental protocol whose purpose is implementation-strengthening: it standardizes the benchmark suite and the accounting of Cost, fits the hierarchical scaling model on controlled grids, and compares the resulting allocations against simple baselines under equal budget, thereby testing whether the structural assumption required to avoid worst-case hardness is empirically justified.


9. Experimental Protocol (Flagged as Implementation-Strengthening): standardized procedural suite, grid over (N, Ds, Dr) at constant budget; hierarchical fits; compare against heuristics; report compute with full accounting.

We now specify an experimental protocol whose purpose is not to discover a new learning algorithm, but to make the allocation theory operational and falsifiable under controlled accounting. The protocol is designed to (i) produce a dataset of noisy observations $\widehat{\mathrm{Err}}(N,D_s,D_r)$ over a budget-feasible grid, (ii) fit the hierarchical scaling model in a manner that separates shared exponents from task idiosyncrasies, and (iii) evaluate whether the resulting optimizer yields improvements over simple, widely used heuristics at matched total cost.

We fix a benchmark family T consisting of tasks indexed by t ∈ {1, …, |T|}. Each task t is given by a procedural generator Gt(ω) producing initial states, goal specifications, and nuisance variation (textures, lighting, object instances, layouts) from a seed ω. We require that both simulation and real-world episodes admit a common episodic interface: an episode is a finite horizon trajectory with a standardized observation and action space, and success is a Boolean event measurable at termination. We report error as Err := 1 − SR, where SR is success rate averaged over a fixed number of evaluation seeds per task. The only role of this standardization is to make the unit ``episode’’ comparable across data sources so that (Ds, Dr) and the accounting in Cost are meaningful.

Prior to any sweeps, we publish a budget B and coefficients (κ, cs, cr) in a common cost unit. We treat κ as the conversion factor from training compute to cost, where training compute is proportional to N(Ds + Dr) (with the proportionality fixed by optimizer settings and sequence length/horizon). We measure cs as the amortized simulator cost per episode (including physics stepping, rendering if applicable, storage, and any domain randomization overhead). We measure cr as the amortized real episode cost (fleet time, operator time, resets, maintenance, and expected wear). For transparency, we additionally log a decomposed bill of materials for each run:
$$ \mathrm{Cost}=\underbrace{\kappa N(D_s+D_r)}_{\text{train}}+\underbrace{c_s D_s}_{\text{sim}}+\underbrace{c_r D_r}_{\text{real}}, $$
and we report each term separately. This decomposition is not used by the optimizer beyond the linear model, but it is essential for reproducing conclusions under alternative accounting choices.

We choose a discrete model-size set 𝒩 (e.g. a logarithmic grid spanning a plausible deployment range) and a finite set of . For each N ∈ 𝒩 and for each target effective-data level Deff on a logarithmic grid, we construct one or more allocations (Ds, Dr) that (a) achieve the desired Deff = Ds + ρ0Dr using a provisional ρ0 (e.g. ρ0 = 1 for design), and (b) satisfy Cost(N, Ds, Dr) ≤ Bslice for a designated slice budget Bslice ≤ B. In practice, because (Ds, Dr) are integers and because cs, cr may not permit exact equality, we target a narrow band Cost ∈ [(1 − ξ)Bslice, Bslice] for a small ξ (e.g. ξ = 0.02) to minimize wasted budget while maintaining feasibility. This construction ensures that comparisons across points reflect a controlled trade-off between model capacity and data, rather than confounding from different total spend.

At each grid point (N, Ds, Dr) we run multiple independent training seeds. Each trained policy is evaluated on a fixed evaluation set: for each task t, we sample m procedural seeds and compute $\widehat{\mathrm{SR}}_{t}$, hence $\widehat{\mathrm{Err}}_{t}=1-\widehat{\mathrm{SR}}_{t}$. We retain the binomial standard error proxy $\widehat{s}_{t}=\sqrt{\widehat{\mathrm{SR}}_{t}(1-\widehat{\mathrm{SR}}_{t})/m}$ (or a beta-binomial posterior interval if preferred). We also report aggregate error $\widehat{\mathrm{Err}}=\frac{1}{|T|}\sum_t \widehat{\mathrm{Err}}_{t}$ with uncertainty obtained by task-level bootstrap. The repetition count is chosen so that the posterior over exponents concentrates sufficiently to make allocation decisions stable; we operationalize this by requiring that the induced optimizer (Step~4 of the algorithm) changes by less than a preset tolerance when refitting with a held-out subset of seeds.

Using all collected points, we fit a hierarchical model of the form
Errt(N, Ds, Dr) = at(Ds + ρDr)α + btNβ + Et,
with shared (α, β, ρ) and task-specific (at, bt, Et). We implement inference with either Hamiltonian Monte Carlo or variational approximations, but we require posterior predictive checks: we draw replicated errors from the fitted model and verify that observed error trends across each axis (N, Deff) are within credible bands. We also perform held-out validation by withholding an entire budget slice (or an entire model size) and testing predictive accuracy; failure of this test is treated as evidence against a single-regime power law over the swept range, in which case we either restrict the operating region or fit a structured mixture (e.g. piecewise exponents) and rerun the allocation.

Given the posterior over parameters, we compute an allocation (, s, r) by solving the relaxed convex program and rounding to feasible integers and feasible N ∈ 𝒩. We compare this allocation against matched-budget baselines that reflect common practice: (i) scaling where Ds, Dr are fixed proportions and N is maximized; (ii) scaling where N is fixed and data is maximized; (iii) (Dr = 0) and (Ds = 0) strategies; and (iv) a ``Chinchilla-style’’ heuristic that enforces a fixed power-law ratio between N and total data Ds + Dr ignoring ρ. Each baseline is instantiated by explicitly solving max  over its restricted family subject to Cost ≤ B so that all comparisons are exact in cost, not approximate in wall-clock.

For each method we report: the chosen (N, Ds, Dr); the achieved Cost and its decomposition; the measured $\widehat{\mathrm{Err}}_t$ for all t ∈ T; and aggregate $\widehat{\mathrm{Err}}$ with uncertainty. We additionally report the fitted posterior for (α, β, ρ), including credible intervals and posterior correlations, since these quantities determine the qualitative regime (model-limited versus data-limited, and sim versus real preference). Finally, we publish the exact grid specification, procedural seeds for evaluation, and the cost accounting script. Under this protocol, the theory makes a concrete, refutable prediction: if (H1)–(H2) are a good local model of the learning curves, then the allocation computed from the fit should outperform the baselines at equal total cost, and the improvement should be stable under small perturbations of the fitted parameters.


10. Discussion: implications for 2026 robot data engines, sim-to-real strategy, benchmark standardization; limitations and extensions (multi-modal mixtures, test-time compute, safety constraints).

Our primary claim is that, once one accepts a locally valid scaling model of the form
Err(N, Ds, Dr) = a(Ds + ρDr)α + bNβ + E
together with an explicit accounting
Cost(N, Ds, Dr) = κN(Ds + Dr) + csDs + crDr,
then allocation ceases to be a matter of taste and becomes an optimization problem with falsifiable predictions. We regard this as directly relevant to the design of ``robot data engines’’ in 2026, where the dominant engineering question is no longer whether to collect data, but , , and under a shared budget across simulation infrastructure, fleet operations, and training compute.

A practical data engine must internalize that data are not free even in simulation: increasing Ds increases both generation cost csDs and training cost κNDs. Likewise, increasing Dr increases generation cost crDr and training cost κNDr, and in many realistic systems κN(Ds + Dr) becomes dominant once N is large. Our formulation suggests that a mature data engine should surface a single ledger of across all levers, rather than reporting separate dataset size'' andmodel size’’ milestones. In particular, the quantity (N) := min {κN + cs, (κN + cr)/ρ} plays the role of an effective price per unit Deff, and it is this price—not raw episode counts—that governs the optimal spend in the relaxed program. Thus, the relevant KPI for an organization is not merely episodes collected per day'' buteffective episodes per unit cost at the current N,’’ which changes as architectures and training stacks evolve.

The threshold rule implicit in the split optimization clarifies how we should reason about sim-to-real strategy. For fixed (N, Deff), the minimum-cost way to realize Deff is either all-sim or all-real except at a measure-zero boundary; convex mixtures are cost-optimal only when the effective per-unit prices match:
$$ \kappa N + c_s \;=\; \frac{\kappa N + c_r}{\rho}. $$
This observation is often obscured in informal discussions that treat ``some real data’’ as intrinsically necessary. In our model, real data are necessary only insofar as they lower the effective price of Deff or change the functional form of Err (i.e., violate the single-ρ assumption). Operationally, the rule implies that efforts to improve simulation fidelity, domain randomization, or synthetic-to-real alignment should be evaluated through their impact on the ρ (and possibly on the validity of the shared exponent α), because a modest increase in ρ can flip the inequality and thus change the optimal split discontinuously. Conversely, reducing cr through better resets, autonomy in data collection, or lower-latency teleoperation can matter as much as increasing ρ; both act through the same threshold comparison once κN is accounted for.

A second implication is methodological: if we wish to speak meaningfully about allocations, we must make an episode'' anda unit of cost’’ comparable across labs and platforms. The protocol’s insistence on a common episodic interface, published (κ, cs, cr), and explicit decomposition of Cost is not bureaucratic; it is the minimum structure required for an allocation claim to be transportable. In the absence of such standardization, apparent wins can be artifacts of unreported simulator amortization, different reset labor, or inconsistent evaluation seeds. For 2026-era benchmarks, we therefore expect that the benchmark specification will include not only task generators Gt and success criteria, but also a recommended accounting rubric (what is counted in cr, whether rendering is included in cs, how κ is computed), so that scaling sweeps across organizations can be compared on a common axis.

The most immediate limitation is the assumption that a single pair of exponents (α, β) and a single ρ govern the entire swept region and all tasks. In robotics, the learning curve can exhibit regime changes: exploration-limited behavior at small D, representation-limited behavior at small N, and saturation effects as E dominates. Moreover, ``real data’’ is itself heterogeneous: teleoperation demonstrations, autonomous rollouts, failure cases, and safety interventions may have different effective multipliers. A single ρ may therefore average over qualitatively distinct contributions, and the threshold rule may become misleading if Dr is not a scalar commodity. One principled extension is to replace Deff = Ds + ρDr by a multi-component effective volume Deff = ∑jρjDr, j + Ds with type-specific costs cr, j, yielding a linear program for the split at fixed (N, Deff); another is to adopt a latent-mixture scaling model in which different tasks or data types belong to different regimes with different (α, β, ρ). The latter sacrifices some convexity but remains tractable with structured priors and posterior risk minimization.

Our analysis treats N as the sole model-side lever. In practice, deployed performance can also improve with test-time compute: longer horizons, model-predictive rollouts, diffusion sampling steps, or retrieval over large episodic memories. These introduce an additional resource Ctest with an associated cost and a potentially different scaling exponent. A minimal extension is to augment the error model by an additive term dCtestγ and add a deployment budget constraint, or to treat Ctest as bounded and incorporate it into E as a deployment-induced offset. Either way, the allocation problem becomes multi-dimensional but retains the same logic: we equate marginal error reductions per unit cost across training compute, data, and inference compute, subject to feasibility constraints such as latency bounds and on-device memory.

Finally, robotics forces us to confront safety as a first-class constraint. The objective Err = 1 − SR collapses multiple failure modes into a scalar and does not distinguish benign failures from hazardous ones. A safety-aware allocation should instead optimize a composite objective, for example
Errλ := (1 − SR) + λ Risk,
or impose chance constraints Pr (unsafe event) ≤ η on the learned policy. Real-world data collection itself can be constrained by safety, creating a coupling between Dr and admissible policies during data gathering. These considerations suggest replacing the single constrained program by a constrained risk minimization problem (e.g. with CVaR penalties) and treating safety events as separate observables in the scaling fit. The main conceptual point persists: once risk metrics are logged and priced, the allocator can trade off safer (but more expensive) real data, improved simulation safety filters, and additional training compute in a common currency, and the resulting policy choice becomes auditable.

In summary, we view the scaling-law allocator not as a final theory of robot learning, but as a disciplined between empirical learning curves and organizational resource decisions. Its value is greatest when it forces explicit declarations: what is counted as cost, what is treated as data, and what constitutes success. The appropriate next step is not to make the model more complicated by default, but to expand it only where posterior predictive checks fail, thereby maintaining the central benefit of the approach: allocation decisions that are both computationally solvable and empirically refutable.