We consider the practical problem of allocating a finite budget across three coupled resources: model capacity, simulated experience, and real-world experience. In large language modeling, this allocation problem is now routinely addressed by empirical scaling laws together with compute-optimal prescriptions (notably the data–parameter tradeoffs highlighted by -style analyses). In robotics, the analogous question is both more constrained and less settled. It is more constrained because real interaction data carries nontrivial marginal cost, operational risk, and latency; it is less settled because the field lacks standardized reporting of training compute, data volumes, and evaluation protocols across sufficiently broad sweeps. Our objective in this work is to place robotics training runs under an explicit budget model and to derive an allocation rule that is compute-optimal under a minimal scaling-law hypothesis, while remaining actionable when only noisy measurements are available.
The robotics setting differs from the language setting in several ways that make naive transposition of language scaling heuristics unreliable. First, the unit of data is not a token but an episode, trajectory segment, or transition, and the effective informational content of such units varies substantially across collection modalities. An episode acquired via on-robot teleoperation may include high-fidelity contact dynamics, sensor idiosyncrasies, and safety-limited exploration; a nominally similar episode in simulation may be cheaper and more diverse, but may omit salient disturbances and lead to systematic sim-to-real gaps. Consequently, a data point cannot be priced solely by storage or bandwidth: it must be priced by end-to-end marginal cost (robot time, human time, wear, and opportunity cost) and valued by its contribution to downstream success. Second, robotics policies are trained under a range of objectives and algorithms (behavior cloning, offline reinforcement learning, online reinforcement learning, hybrid methods), each with different sensitivities to model size and data quality. Third, evaluation is intrinsically multi-task and nonstationary: minor changes in embodiment, lighting, or object geometry can shift performance, and long-horizon tasks can fail through compounding error. Any compute-optimal prescription must therefore be phrased in terms of expected downstream error over a benchmark family rather than a single curated test set.
Despite these differences, we claim that the underlying resource-allocation problem is structurally similar: diminishing returns in both data and model capacity suggest that a small number of power-law exponents can govern the tradeoff between investing in larger models and investing in additional experience, once the experience is expressed in an appropriate effective unit. The central methodological question is then not whether scaling exists in some idealized limit, but whether a sufficiently accurate parametric approximation can be fit over the feasible budget range to guide decisions. We adopt the position that an explicit parametric model is an algorithmic prior: without it, the budget allocation problem becomes combinatorial and, in general, intractable to optimize adaptively. With it, we can reduce the allocation problem to a low-dimensional program whose solution can be interpreted and stress-tested.
A further robotics-specific issue is that the data pipeline itself is
a controllable system. Modern robotics training relies increasingly on :
procedural simulation, domain randomization, auto-curricula, fleet
collection with shared autonomy, and teleoperation interfaces with
varying levels of assistance. These engines do not merely produce more
data; they change the distribution of data, the cost per episode, and
the effective coverage of the task family. Accordingly, the question
how much data should we collect?'' is inseparable fromin
what modality, at what marginal cost, and with what quality multiplier
relative to simulation?’’ We therefore model simulation and real data as
distinct resources coupled through an effective-data mapping. This
mapping is intentionally simple: we aim for a parameterization that can
be estimated from limited sweeps and is stable enough to support
optimization under budget uncertainty.
We emphasize that the allocation problem must include training compute. In robotics, it is common to report data volume and success rates while leaving the training compute implicit (or incomparable) due to differences in architectures, batch sizes, and wall-clock constraints. Yet compute is often the dominant cost in large-scale training regimes, and it interacts with data choices: increasing data without increasing compute can undertrain, while increasing compute without data can overfit or saturate. We therefore measure compute cost in a common budget unit and include it explicitly in the total cost. This yields a single constrained optimization problem that can be solved end-to-end, rather than a sequence of ad hoc decisions (e.g., fixing model size by convenience, then collecting as much data as possible, then tuning for wall-clock).
From an engineering perspective, the decision-maker typically faces the following trade: simulation episodes can be generated cheaply and at scale, but with imperfect fidelity; real episodes are expensive, slow, and subject to safety constraints, but can collapse model error that simulation fails to expose. The effective value of real data is thus not only higher on average but also potentially nonlinear in the regime where sim-to-real mismatch dominates. Nonetheless, for purposes of allocation, we require a tractable summary. We adopt a regime in which the marginal return of additional real data can be modeled through a constant quality multiplier over simulation, interpreted as an average exchange rate for coverage of the benchmark distribution. This is not a statement that real and simulated trajectories are interchangeable pointwise; it is a statement that, as an aggregate resource for reducing benchmark error, one unit of real experience can be worth multiple units of simulated experience.
The introduction of an effective-data exchange rate allows us to disentangle two decisions that are otherwise coupled: (i) how much data to acquire in total, and (ii) how to split that effective data between simulation and real-world collection given their costs. This separation is critical in robotics because the split is often the highest-leverage decision under fixed budget: a small amount of well-chosen real data can dominate large increments of simulation if the real data targets failure modes induced by model mismatch. Conversely, if the quality multiplier is modest or real collection is prohibitively expensive, it is rational to allocate almost entirely to simulation and accept the residual gap. Our contribution is to formalize this trade and to show that, under the assumed structure, the split reduces to a threshold comparison of effective cost per unit of coverage.
We also require that the prescription be usable in the common situation where scaling parameters are unknown. Robotics groups rarely have the resources to run exhaustive sweeps over model size and both data modalities. Instead, we advocate a pilot design that spans the feasible log-space with a modest budget, fits a shared-exponent model across tasks with random effects, and then allocates the remaining budget according to the implied compute-optimal solution. The intent is not to treat the scaling law as ground truth, but to treat it as a compact hypothesis that can be falsified and updated: when predictions fail, the remedy is additional pilot points targeted to the region of disagreement.
Finally, we delimit scope. Our focus is on training-time allocation for a fixed embodiment and sensor stack, and on downstream task error measured on a predefined benchmark family. We do not attempt to optimize inference-time latency, memory constraints, or on-device deployment costs, except insofar as they induce feasibility constraints on model size. We also do not claim universality of the exponents across all robot morphologies or learning paradigms; rather, we claim that within a coherent benchmark family and training recipe, the scaling model can be estimated and exploited to yield a principled allocation that improves over naive heuristics. The next section situates this viewpoint within prior work on scaling laws, compute-optimal training, and sim-to-real data pipelines.
Empirical power-law relations between downstream error and training resources have been documented most prominently in language modeling, beginning with studies that fit error as a separable function of model size and data size over multi-order-of-magnitude sweeps . A key methodological lesson from this literature is that, once a sufficiently stable parametric form is available, one may treat training as a constrained optimization problem in which compute, parameters, and data are traded off to minimize error under a fixed budget. Subsequent analyses refined the compute-optimal prescription by emphasizing that, under fixed training compute, under-training on too little data can dominate returns from scaling parameters, leading to an optimal data–parameter frontier (the ``Chinchilla’’ regime) that differs from earlier heuristic allocations . While the details of the exponents and constants depend on architecture, optimizer, and data curation, the allocation principle is robust: diminishing returns imply that marginal gains per unit cost should be balanced across the controllable resources.
The scaling-law viewpoint has also been pursued in other domains, including vision, multimodal learning, and reinforcement learning, with varying degrees of regularity and with additional confounders such as distribution shift and evaluation protocol changes. Two observations are relevant to our setting. First, scaling fits are most predictive when evaluated on a fixed benchmark distribution and a fixed training recipe; changing either can alter the effective exponents and, in extreme cases, the functional form. Second, even when a parametric form is not exact, it can serve as an actionable prior for experiment design: low-dimensional models enable adaptive sampling of resource configurations and provide a principled alternative to ad hoc ablations. Our work adopts this latter stance, treating the scaling law as a hypothesis to be fit, stress-tested, and used for allocation within the feasible budget regime.
In robotics, large-scale policies trained from heterogeneous data
(e.g., multi-task behavior cloning from demonstrations, or hybrid
imitation–reinforcement learning systems) have shown substantial gains
from increasing dataset size and model capacity, and several systems
papers have reported monotone improvements when scaling data engines and
policy networks. However, the robotics evidence base differs from
language in two structural ways. First, robotics evaluations are
typically multi-task with significant heterogeneity across tasks;
aggregating success rates can obscure task-dependent saturation and can
induce apparent scaling even when only a subset of tasks improve.
Second, robotics training pipelines vary widely in observation
modalities, action representations, and learning objectives, making
cross-paper comparisons of data size'' andmodel size’’ less
meaningful without a common accounting of compute and collection cost.
Consequently, while there are compelling demonstrations that ``more data
and larger models help,’’ there are fewer studies that execute
systematic sweeps over (N, D) while holding the
rest of the pipeline fixed, and fewer still that incorporate the cost of
acquiring real-world interaction data as a first-class constraint.
A number of community efforts have emphasized the need for standardized robotics benchmarks and for clearer reporting of experimental details, including dataset composition, number of environment steps, and evaluation protocol. Yet even when environment steps or episodes are reported, the mapping from those counts to training cost is often opaque: the same number of transitions may correspond to vastly different compute depending on architecture, optimizer settings, sequence length, replay strategy, augmentation, and number of gradient updates per transition. Moreover, real-world collection cost is rarely comparable across labs, as it depends on fleet scale, autonomy tooling, human supervision, safety procedures, and wear. In the absence of consistent compute and cost reporting, it is difficult to infer whether observed gains are primarily due to larger models, more optimization steps, better data quality, or simply greater total investment. This motivates our explicit cost model, which converts training and data generation into a common budget unit and thereby makes allocation questions well-posed.
The broader machine learning community has argued for reporting training compute (e.g., FLOPs, accelerator-hours, or energy) to enable reproducibility and to quantify efficiency improvements. Robotics adds an additional axis: interaction data can be rather than merely , and the marginal cost of generation differs sharply between simulation and the real world. Simulation entails simulator compute, rendering, and engineering overhead, while real collection entails robot time, human teleoperation or oversight, maintenance, and opportunity cost. These costs can dominate even when training compute is moderate, or conversely be negligible relative to training compute when models and optimization are very large. Our accounting therefore treats the cost of training as proportional to N times the total number of episodes used in optimization (captured by κ), and adds linear per-episode costs cs and cr for simulation and real collection. This is not intended as a complete micro-economic model; rather, it is a minimal abstraction that renders the allocation problem explicit and allows sensitivity analysis with respect to the coefficients.
A central theme in robotics is the sim-to-real gap: policies trained heavily in simulation can fail in the real world due to unmodeled dynamics, contact phenomena, sensing artifacts, or distributional mismatch. Classical approaches include system identification, dynamics randomization, domain randomization, privileged learning, and residual adaptation; modern approaches increasingly combine simulation and real datasets, sometimes with fine-tuning or online adaptation. From the perspective of allocation, the key quantity is not whether simulation can eventually match reality, but the between modalities as measured by downstream error reduction per unit cost. In many regimes, a small amount of real data can disproportionately reduce error by revealing failure modes absent in simulation, while in other regimes the real data may be too narrow or too expensive to justify large investment. Our model encodes this phenomenon via a real-data quality multiplier ρ, which summarizes the average effective contribution of one real episode relative to one simulated episode for a fixed benchmark distribution and training recipe. We emphasize that ρ is an parameter; it does not assert pointwise interchangeability, but it enables a tractable optimization that can be revisited as ρ is re-estimated under improved simulators or collection tooling.
Recent robotics systems increasingly treat data collection as an adaptive process driven by ``data engines’’: procedural task generation, automatic curriculum construction, domain randomization schedules, shared autonomy interfaces, and active selection of scenarios for teleoperation or intervention. These mechanisms alter not only the of data but its and marginal cost, often in ways that are difficult to capture with a single scalar. Nonetheless, from a budget-allocation standpoint, one must ultimately decide how much to spend on each engine and when to switch modalities. Our formulation isolates the simplest decision boundary: given an intended effective data volume Deff and a fixed model size N, the cost-minimizing split between simulation and real collection reduces to a threshold rule determined by (cs, cr, κ, ρ). This provides a baseline that can be embedded into richer data-engine controllers, where ρ may itself be a function of collection policy and task mix.
Finally, we distinguish compute-optimal allocation from constraints. In many robotics applications, inference latency, memory footprint, and on-device power impose an upper bound Nmax on feasible model size, and techniques such as distillation or quantization are used to reconcile large training models with small deployment models. These considerations are orthogonal to our main question, which is how to allocate a training budget across model size and data modalities to minimize benchmark error for a fixed recipe. We therefore treat inference constraints only as feasibility restrictions on N, and we do not attempt to optimize the training–inference trade directly. Within this scope, the prior work above motivates our approach: adopt a low-dimensional scaling model, fit it with limited sweeps, and use it to compute an allocation that is explicitly optimal under a stated cost model.
We study the following budgeted allocation problem: given a fixed
training recipe (architecture family, optimizer, augmentation, rollout
processing, and evaluation protocol), we choose three primary
resources—model size and two data modalities—in order to minimize
downstream error on a fixed benchmark distribution. Concretely, an is a
triple
(N, Ds, Dr),
where N denotes the number of
trainable parameters of the policy (or more generally a scalar proxy for
model capacity, treated as discrete in practice), Ds is the number
of simulated training episodes or demonstrations available to the
learner, and Dr is the number
of real-world training episodes or demonstrations. We take Ds, Dr ∈ ℤ ≥ 0,
with the understanding that the same formalism covers transition counts,
trajectory segments, or other episode-like units provided the training
and evaluation pipelines use a consistent notion of ``one unit of
data.’’
Let T denote the benchmark
task family used to evaluate generalization. In the simplest case, T is a single task and evaluation
returns a success rate SR ∈ [0, 1]
computed over a fixed number of trials. In the multi-task case, we index
tasks by t and consider SRt per task; an
aggregate score may be an unweighted average $\frac{1}{|T|}\sum_t \mathrm{SR}_t$ or a
weighted average reflecting deployment priorities. Our optimization
objective is expressed in terms of an error functional Err that is monotone decreasing in success
rate. For definiteness, we may take
Err := 1 − SR,
or, if the community reports percentages, Err := (100 − SR)/100. All subsequent
analysis is invariant to such affine rescalings, and we only assume
Err ∈ [0, 1] after normalization.
Because robotics evaluations can exhibit substantial stochasticity (sensor noise, stochastic resets, randomized task parameters, and policy stochasticity), we distinguish the latent expected error from observed estimates. If an evaluation uses m trials per task, then conditional on a fixed trained policy the empirical success rate $\widehat{\mathrm{SR}}_t$ is approximately binomial, hence $\widehat{\mathrm{Err}}_t = 1-\widehat{\mathrm{SR}}_t$ has standard error on the order of $\sqrt{\mathrm{SR}_t(1-\mathrm{SR}_t)/m}$. In addition, training randomness (initialization, data shuffling, stochastic optimization) induces run-to-run variability; in our experimental design we therefore treat each training run as producing a noisy observation $\widehat{\mathrm{Err}}(N,D_s,D_r)$ of an underlying mean error Err(N, Ds, Dr) for the fixed recipe and task distribution.
The pair (Ds, Dr) describes the number of episodes collected in each modality. In many pipelines, the learner performs multiple epochs over a dataset, uses replay buffers, or performs off-policy updates that reuse transitions. We absorb such reuse into the training-compute coefficient introduced below; equivalently, one may interpret Ds and Dr as the number of episodes while the total number of gradient updates is controlled by fixed recipe hyperparameters. If, instead, the practitioner explicitly chooses the number of optimization steps per episode, then our accounting can be extended by allowing κ to depend on the step schedule; we keep the minimal abstraction to isolate the primary budget tradeoff.
We emphasize that the allocation (N, Ds, Dr) is not intended to capture all determinants of performance. It is a controlled decision space in which other factors (architecture shape, observation encoding, loss, regularization, environment randomization settings) are held fixed during the scaling sweeps. The point of the accounting is not to model all sources of variation but to enable a meaningful constrained optimization once a stable experimental protocol is fixed.
We convert heterogeneous expenditures (accelerator compute, simulator throughput, fleet time, and human labor) into a single scalar budget B measured in an arbitrary but fixed cost unit (e.g., dollars, GPU-hours multiplied by a monetary rate, or any internal accounting unit). The only requirement is that all cost coefficients be expressed in the same unit.
Our total cost decomposes into training compute and data-generation
costs:
and the allocation must satisfy Cost(N, Ds, Dr) ≤ B.
Here κ is a training-compute
coefficient converting the product N(Ds + Dr)
into cost, while cs and cr are
per-episode data-generation costs in simulation and in the real world,
respectively.
The term κN(Ds + Dr) should be read as a first-order model of training FLOPs. Under a fixed recipe, the work per episode scales approximately linearly with model size (forward/backward passes) and linearly with the number of episodes consumed by optimization. If sequence length varies, one may interpret Ds + Dr as the total number of fixed-length chunks; if the episode length distribution is stable across modalities, the distinction is inessential. More detailed accounting (e.g., attention quadratic costs in context length, or modality-specific encoders) can be incorporated by replacing N with a measured per-step FLOP estimate; we retain N as a simple proxy because it is the primary controllable axis in typical scaling sweeps.
The coefficients cs and cr capture marginal costs of obtaining data. For simulation, cs includes simulator compute, rendering, logging, storage, and any amortized engineering overhead attributable to generating an additional episode. For real-world collection, cr includes robot depreciation and maintenance attributable to use, operator or teleoperation time, safety supervision, lab overhead, and opportunity cost of tying up hardware. When data are harvested opportunistically (e.g., from an existing deployment), cr may be small; when collection requires dedicated teleoperation, cr can dominate all other terms. Our formulation is designed to make such regime changes explicit through the coefficients rather than through informal narrative.
The coefficients (κ, cs, cr) are inputs to the allocation problem and can be set either by direct accounting or by measurement. For example, κ can be estimated from a pilot training run by recording wall-clock time or accelerator-hours and dividing by N(Ds + Dr) for the fixed recipe; the resulting κ implicitly incorporates optimizer overhead, communication cost, and the chosen number of passes through the data. Similarly, cs can be estimated by measuring simulator throughput and cost per hour, and cr by measuring the marginal labor and robot time required to collect and validate one additional episode under the lab’s procedures.
We do not claim that is a complete micro-economic model. It is a minimal abstraction that (i) is linear in the controllable quantities, (ii) makes tradeoffs between training compute and data collection explicit, and (iii) supports sensitivity analysis: one may vary (κ, cs, cr) to understand which axis is limiting in a given operational setting.
In practice, N is restricted to a finite grid 𝒩 determined by architectural choices and engineering constraints, and the data volumes are integers. We therefore regard the allocation problem as a mixed discrete optimization. For analysis and for deriving prescriptions, it is convenient to consider a continuous relaxation in which N > 0 and Ds, Dr ≥ 0 are real-valued; one may then round to feasible values and verify that the rounded allocation respects the budget constraint. Additional feasibility constraints may be imposed without changing the accounting structure, e.g., an upper bound N ≤ Nmax due to inference-time memory or latency, or a wall-clock bound that effectively limits κ or the maximal N(Ds + Dr) achievable within a deadline. The remainder of the paper treats as the binding constraint and analyzes how to allocate B across (N, Ds, Dr) once a predictive model of Err(N, Ds, Dr) is specified.
We now posit a parametric model for the latent mean error Err(N, Ds, Dr) and describe how it is fit from noisy training-and-evaluation runs. The role of this section is not to argue that power laws are universally valid, but to define a tractable hypothesis class that (i) is expressive enough to capture the dominant empirical trends in controlled scaling sweeps, and (ii) renders the downstream allocation problem algorithmically solvable. The subsequent allocation theory in should be read as conditional on the adequacy of the present model.
We model the dependence on data modality through a single effective
data volume
Deff := Ds + ρDr,
where ρ ≥ 1 is a real-data
quality multiplier. The intended meaning is that, at fixed recipe and
task distribution, one additional real episode yields the same reduction
in generalization error as ρ
additional simulated episodes, after averaging over the benchmark
distribution. We then postulate the joint scaling form
with constants a, b > 0, offset E ∈ [0, 1), and exponents α, β ∈ (0, 1). The additive
structure in is a deliberate simplification: it asserts approximate
separability between the effect of more data (after modality
aggregation) and the effect of larger models. Empirically, such
separability is often accurate over a moderate range of resources and is
sufficient to derive actionable prescriptions; we treat deviations as
model mismatch to be diagnosed, rather than as a priori disproof.
The offset E plays two roles. First, it captures irreducible error due to partial observability, actuator limits, benchmark stochasticity, or recipe misspecification. Second, it prevents the model from spuriously forcing Err → 0 as N, Deff → ∞ within a range where the benchmark saturates. For identifiability and numerical stability we constrain E away from 1 and treat Err ∈ [0, 1] as in .
Each scaling point (Ni, Ds, i, Dr, i) yields an observed error estimate $\widehat{\mathrm{Err}}_i$ obtained by evaluating a trained policy on a finite number of trials (and typically over multiple random seeds). Because success is binary at the trial level, it is natural to model $\widehat{\mathrm{SR}}_{t,i}$ as binomial (or beta-binomial to account for overdispersion), hence $\widehat{\mathrm{Err}}_{t,i}=1-\widehat{\mathrm{SR}}_{t,i}$ is noisy even when training randomness is absent. We therefore fit under a heteroscedastic noise model, weighting points by their estimated standard errors when using approximate likelihoods.
In the multi-task setting we allow task-specific constants while
sharing exponents and ρ.
Concretely, for tasks t ∈ T we write
Errt(N, Ds, Dr) = atDeff−α + btN−β + Et,
with hierarchical priors on (at, bt, Et)
and shared (α, β, ρ). This
pooling is not cosmetic: it reduces variance in estimating α, β, ρ by
leveraging that slopes in log-space are often approximately invariant
across tasks within a benchmark family, whereas vertical shifts vary
substantially due to intrinsic difficulty.
The parametrization (a, α, ρ) is only partially identifiable without targeted variation in (Ds, Dr). Indeed, if all training runs satisfy a fixed ratio Dr/Ds = λ, then Deff = (1 + ρλ)Ds and the data term becomes a(1 + ρλ)−αDs−α; only the product a(1 + ρλ)−α is identified, not ρ itself. Thus, to estimate ρ we must include runs that vary Dr at (approximately) fixed Ds, or vary Ds at fixed Dr, so that the model observes differential returns to the two modalities.
Similarly, E is weakly identified unless the sweep includes points near saturation. If all observed errors are far from the floor, then E trades off against a (and against b if N is small), producing broad posterior uncertainty. For this reason, in pilot sweeps we prefer to include at least one high-resource point (large N and large Deff) to anchor E, even if that point is not itself cost-effective. Conversely, if evaluations are so noisy that several points appear to outperform the plausible floor, unconstrained fits may drive E < 0; we therefore impose E ≥ 0 and treat residual optimism as noise.
The separation between N and Deff also requires that sweeps vary both axes. If, for example, one only scales N while holding (Ds, Dr) fixed, then the data term is a constant and α is unidentifiable; likewise, only scaling data leaves β unidentifiable. We therefore interpret as a model whose parameters are meaningful only when estimated from a factorial (or otherwise sufficiently rich) design.
When E is negligible on the
observed range, implies approximately linear relationships on log-log
plots:
log Err ≈ log a − αlog Deff (at
large
N), log Err ≈ log b − βlog N (at
large Deff).
However, once E is
non-negligible, naive log transforms are biased because log (Err − E) is not observed. We
therefore fit in the original error domain (or via a likelihood on
success counts), while using log-log plots only diagnostically. In
practice, we regularize E
toward small values unless saturation is clearly supported, since
otherwise E may absorb
variance and flatten estimated slopes.
We emphasize that is expected to hold only on a regime where the training recipe is stable and the benchmark distribution does not induce qualitative phase changes. Two common violations are as follows.
First, : the effective exponent may change once the data distribution
shifts (e.g., the simulator domain randomization becomes sufficiently
broad) or once the model crosses a capacity threshold that enables
qualitatively new behaviors (e.g., long-horizon credit assignment begins
to succeed). A parsimonious extension is a piecewise model
a1Deff−α11{Deff ≤ D0} + a2Deff−α21{Deff > D0},
with continuity at D0, and similarly for
N. We do not adopt this as the
default because it complicates identifiability and can overfit sparse
sweeps; rather, we use posterior predictive checks to detect systematic
curvature in residuals versus log Deff or log N.
Second, : success rates may saturate due to deterministic benchmark structure or evaluation artifacts, in which case the apparent α and β shrink toward 0 at high resources. The offset E partially models this, but if saturation occurs sharply the additive floor is insufficient. In such cases we treat as locally valid below the ceiling and restrict allocation recommendations to budgets that remain in that regime.
The end product of this section is an estimated parameter vector θ̂ = (â, b̂, Ê, α̂, β̂, ρ̂), or more usefully a posterior over θ under the hierarchical model. In we will treat these parameters as defining a predictive objective Err(N, Ds, Dr) and derive compute-optimal allocations under the budget constraint, with explicit sensitivity to uncertainty in (α, β, ρ).
We now study the budgeted allocation problem induced by under the cost model of (H2). Our goal is to characterize, and in suitable regimes explicitly compute, an allocation (N, Ds, Dr) with Cost(N, Ds, Dr) ≤ B that minimizes Err(N, Ds, Dr). Throughout this section we treat the scaling parameters as fixed; uncertainty and its propagation are deferred to the subsequent robustness analysis.
The discrete problem is
minN ∈ 𝒩, Ds, Dr ∈ ℤ ≥ 0 a(Ds + ρDr)−α + bN−β + E s.t. κN(Ds + Dr) + csDs + crDr ≤ B.
Since the objective depends on data only through Deff := Ds + ρDr,
it is natural to separate (i) choosing the pair (N, Deff) from
(ii) choosing the cheapest (Ds, Dr)
that achieves Deff
at the selected N. Fix N and Deff. Writing Ds = Deff − ρDr
with Dr ∈ [0, Deff/ρ],
the cost becomes an affine function of Dr:
Cost = κN(Deff − (ρ − 1)Dr) + cs(Deff − ρDr) + crDr = (κN + cs)Deff + (cr − ρcs − (ρ − 1)κN)Dr.
Hence the minimum-cost split occurs at an endpoint. Concretely, the
cheapest way to purchase one unit of effective data is
$$
\tilde c(N)\;:=\;\min\Bigl\{\kappa N+c_s,\;\frac{\kappa
N+c_r}{\rho}\Bigr\},
$$
corresponding respectively to all-sim (Dr = 0) or
all-real (Ds = 0). We
therefore reduce the continuous relaxation to the two-variable
program
and then recover (Ds, Dr)
by the threshold rule implied by the minimizer of c̃(N). The boundary case
(κN + cr)/ρ = (κN + cs)
admits any convex combination achieving Deff at equal cost; in
discrete settings we may then choose the split that best matches
operational constraints (e.g. minimum required real coverage).
Consider the interior optimum of where the budget constraint is
tight. Let λ > 0 denote the
KKT multiplier for Deffc̃(N) ≤ B.
The Lagrangian is
ℒ(N, Deff, λ) = aDeff−α + bN−β + E + λ(Deffc̃(N) − B).
Stationarity yields
together with feasibility and complementary slackness λ(Deffc̃(N) − B) = 0.
In regimes where c̃ is
differentiable (or piecewise differentiable with the optimum away from
the kink), – characterize the unique continuous optimum.
The simplest closed form occurs when data-generation costs are
negligible, cs = cr = 0,
and we ignore the distinction between D and Deff at the level of
constants (e.g. when ρ = 1 or
when we directly control effective coverage). Then c̃(N) = κN
and the constraint reads κNDeff ≤ B.
Solving – with Deff = B/(κN)
yields the familiar ``Chinchilla’’ balance:
Deff* ∝ (N*)β/α, N* ∝ Bα/(α + β), Deff* ∝ Bβ/(α + β),
and the excess error above the floor scales as Err*(B) − E = Θ(B−αβ/(α + β)).
The key interpretation is that at the compute-optimal point, marginal
returns per unit budget from scaling data and scaling model size are
equalized; the power-law exponents determine the allocation ratio.
Two further limiting regimes are immediate. If bN−β is already negligible over feasible N ∈ 𝒩, then the optimum places essentially all budget into effective data: Deff* ≈ B/c̃(Nmax) at the largest deployable N. Conversely, if data is plentiful and the data term is small, the objective is dominated by bN−β and the solution pushes N upward subject to the compute term in c̃(N); in that case the optimal Deff is the minimal amount required by feasibility constraints (if any) or by stability of training.
Although the original variables have a non-linear constraint, is well
behaved after a standard change of coordinates. Let x = log N and y = log Deff.
The objective becomes
f(x, y) = ae−αy + be−βx + E,
which is convex in (x, y) since it is a
nonnegative weighted sum of convex exponentials plus a constant. The
constraint becomes
y + log c̃(ex) ≤ log B.
When c̃(N) is of the
form min {u1N + v1, u2N + v2}
with ui, vi > 0,
the function log c̃(ex)
is the pointwise minimum of two convex functions log (uiex + vi);
while a pointwise minimum of convex functions need not be convex
globally, the feasible set remains a union of two convex sets
corresponding to the two modalities (all-sim or all-real). Consequently
we may solve two convex programs,
$$
\min f(x,y)\ \ \text{s.t.}\ \ y+\log(\kappa e^x+c_s)\le\log B
\quad\text{and}\quad
\min f(x,y)\ \ \text{s.t.}\ \ y+\log\Bigl(\tfrac{\kappa
e^x+c_r}{\rho}\Bigr)\le\log B,
$$
and then take the better solution. This recovers the same threshold rule
as the endpoint argument above, while ensuring polynomial-time
solvability of each branch by standard interior-point methods.
We finally return to discrete feasibility: N ∈ 𝒩 and Ds, Dr ∈ ℤ ≥ 0. We implement rounding as a projection that preserves the budget constraint.
First, we compute a continuous candidate (N∘, Deff∘)
by solving the appropriate convex branch(es). Second, we choose N by rounding N∘ to nearby grid points
in 𝒩 (typically the two nearest
values), and for each candidate N we set the largest feasible
effective data
$$
D_{\mathrm{eff}}(N)\;=\;\frac{B}{\tilde c(N)},
$$
or its floored integer counterpart after converting to (Ds, Dr)
as below. Third, given (N, Deff) we
choose the cost-minimizing modality endpoint: if (κN + cr)/ρ ≤ κN + cs
we set Ds = 0 and Dr = ⌊Deff/ρ⌋;
otherwise we set Dr = 0 and Ds = ⌊Deff⌋.
Fourth, because flooring can create slack budget, we optionally spend
remaining budget by incrementing Ds or Dr greedily
according to the currently cheaper effective cost per unit, while
maintaining Cost ≤ B. This
procedure maintains invariants: feasibility is preserved at every step,
and objective degradation relative to the continuous optimum vanishes as
the grids in 𝒩 and episode counts
become fine compared to the scale of the optimum.
In summary, the scaling-law structure reduces allocation to a small convex optimization plus a one-dimensional modality decision, and the discrete implementation amounts to rounding followed by a budget-respecting projection.
We now quantify how estimation errors in θ̂ = (â, b̂, Ê, α̂, β̂, ρ̂) propagate through the allocation map θ ↦ (N*, Ds*, Dr*) and into the achieved downstream error. The guiding point is that our optimizer is a smooth (indeed, piecewise-smooth) transformation of θ once we work in log-coordinates and stay away from degeneracies (active-set changes and boundary optima). This permits standard sensitivity analysis, yielding stability bounds of the form stated in Theorem~4.
Consider first the relaxed two-variable program in (N, Deff) on a
fixed modality branch (all-sim or all-real), where c̃(N) is replaced by an
explicit affine function, say c̃s(N) = κN + cs
or c̃r(N) = (κN + cr)/ρ.
With x = log N, y = log Deff, we
may write the constrained problem as
On each branch, the feasible set is convex and the objective is convex
in (x, y), so the
optimizer (x*(θ), y*(θ))
is well defined whenever the optimum is unique. Moreover, if the optimum
is interior to the branch (i.e. the constraint is active and we are not
at a kink where the preferred modality changes), then the KKT system is
differentiable in θ. Denoting
by F(x, y, λ; θ) = 0
the stationarity equations plus complementary slackness with active
constraint, the implicit function theorem yields a local Lipschitz
dependence of (x*, y*, λ*)
on θ, with Lipschitz constant
controlled by the inverse Jacobian ∂F/∂(x, y, λ).
In particular, away from boundary regimes where either the data term or
the model term vanishes, the Hessian of the Lagrangian is well
conditioned in log-coordinates, and we obtain
for some L(θ, B) that grows
at most polylogarithmically in B in the regimes of interest
(precisely because x*, y*
themselves scale like Θ(log B)).
To make the dependence transparent, consider the compute-dominated
special case cs = cr = 0
and treat Deff as
directly purchasable at cost κNDeff ≤ B.
The closed-form relations are
$$
\log N^* \;=\; \frac{\alpha}{\alpha+\beta}\log B + O(1),\qquad
\log D_{\mathrm{eff}}^* \;=\; \frac{\beta}{\alpha+\beta}\log B + O(1).
$$
Differentiating with respect to α, β shows that a
perturbation |α̂ − α| ≤ ε, |β̂ − β| ≤ ε
induces an O(εlog B)
perturbation in log N* and log Deff*,
hence a multiplicative factor exp (O(εlog B)) = BO(ε)
in N* and Deff*.
Substituting into the power law yields the corresponding error
inflation:
up to lower-order additive effects from errors in â, b̂, Ê. This
recovers the qualitative content of Theorem~4: while the optimizer
depends on exponents through log B, the induced degradation in
achieved error is only linear in εlog B for small ε.
The dependence on ρ is
structurally different: ρ
affects only the conversion between Dr and Deff, and it affects the
effective cost of real data through (κN + cr)/ρ.
The endpoint rule implies that the modality decision changes only when
the sign of
$$
g(N;\rho)\;:=\; \frac{\kappa N+c_r}{\rho}-(\kappa N+c_s)
$$
changes. Thus, if at the true optimum (N*, Deff*)
we have a margin |g(N*; ρ)| ≥ m > 0,
then any ρ̂ satisfying |ρ̂ − ρ| ≤ η with
η ≲ m ρ2/(κN* + cr)
preserves the modality choice, and the only effect of ρ̂ is a smooth rescaling of the
realized (Ds, Dr)
after the split. Conversely, when g(N*; ρ) ≈ 0,
the two modalities are nearly cost-equivalent; in that boundary case,
even if a small estimation error flips the decision, the cost penalty is
second-order (because the two branches coincide to first order). This is
precisely the regime in which we may safely incorporate operational
constraints (e.g. minimum real-world coverage) without materially
affecting optimality.
In practice we do not only have deterministic bounds on θ̂ − θ, but a posterior (or approximate sampling distribution) from the regression stage. We therefore report uncertainty in two layers: (i) uncertainty in predicted error at a allocation, and (ii) uncertainty induced by optimizing under uncertain parameters.
For (i), conditional on an allocation (N, Ds, Dr),
the mapping θ ↦ Err(N, Ds, Dr)
is smooth, and a delta-method approximation gives
Var[Err(N, Ds, Dr) ∣ data] ≈ ∇θErr⊤ Cov(θ) ∇θErr,
where ∇θErr is
evaluated at a posterior mean (or MAP). This yields a simple approximate
(1 − δ)-interval
$$
\widehat{\mathrm{Err}}\pm
z_{1-\delta/2}\sqrt{\widehat{\mathrm{Var}}(\mathrm{Err})},
$$
with the understanding that heavy-tailed posteriors for exponents are
better handled by posterior sampling.
For (ii), we generate posterior draws θ(m), compute the corresponding optimizer (N*(m), Ds*(m), Dr*(m)), and evaluate either the plug-in error Err(N*(m), Ds*(m), Dr*(m); θ(m)) or, more conservatively, Err(N*(m), Ds*(m), Dr*(m); θtrue) approximated by held-out evaluations. The resulting empirical quantiles provide credible intervals for achievable performance under allocation uncertainty. When one desires a one-shot robust decision, we may instead choose the allocation minimizing a risk-averse criterion such as the posterior (1 − δ)-quantile of Err, which directly trades mean performance against robustness to exponent misestimation.
Combining the local Lipschitz property with the smoothness of Err in (x, y) yields the following
practical consequence. Suppose |α̂ − α|, |β̂ − β|, and the relative
errors in â, b̂, ρ̂ are all at
most ε, and assume we are away
from degeneracies (unique interior optimum on one branch, or
near-equality of branches). Then, for sufficiently small ε, the allocation computed under
θ̂ satisfies
Err(N̂, D̂s, D̂r) ≤ (1 + O(εlog B)) Err(N*, Ds*, Dr*) + O(ε),
where the O(ε) term
accounts for additive floors and constant-factor misspecification. This
justifies allocating only a modest fraction of budget to exponent
estimation: once ε is driven
to the point where εlog B ≪ 1, further
improvements in exponent accuracy have diminishing returns compared to
simply spending on N and Deff. The remaining
question is how large a pilot sweep is required to reach this regime,
which we address next via sample complexity and experiment design.
We turn to the question left implicit so far: given budget B, how much of it must be spent on pilot sweeps in order to estimate the shared exponents (α, β) and the real-data multiplier ρ to a tolerance that makes the downstream allocation reliable. The central object is the statistical efficiency of our scaling experiment design, since θ̂ is obtained from finitely many training-and-evaluation runs on a benchmark family.
A single scaling point consists of training at (N, Ds, Dr)
and evaluating on tasks t ∈ {1, …, T} using M evaluation episodes per task (or
per task-seed pair). Writing pt(N, Ds, Dr) = 1 − Errt(N, Ds, Dr)
for the success probability, a standard model is
$$
S_{t}\;\sim\;\mathrm{Binomial}(M,p_t),
\qquad
\widehat{\mathrm{Err}}_t\;=\;1-\frac{S_t}{M},
$$
possibly augmented with an additional task/seed dispersion term. In any
case, the conditional variance satisfies $\mathrm{Var}(\widehat{\mathrm{Err}}_t\mid p_t)\le
1/(4M)$. When we aggregate across T tasks via a hierarchical model
with task random effects (at, bt, Et)
and shared (α, β, ρ), the
effective noise for estimating the shared exponents decreases roughly
like 1/(TM) (up to a
multiplicative factor reflecting task heterogeneity). This immediately
yields a design principle: if training runs are expensive, it is often
cheaper to increase T and
M (evaluation) until
evaluation cost is negligible, thereby reducing posterior uncertainty in
(α, β, ρ)
without additional training cost.
Although our error model is additive in two power laws plus a
floor,
Err = aDeff−α + bN−β + E,
local sample complexity can be read off from a first-order
(Fisher-information) approximation. To make this explicit, suppose for
the moment that we operate in a regime where E is known or negligible relative to
the non-floor error, and we consider a branch where Deff is directly
parameterized (all-sim or all-real). If we hold N fixed and sweep Deff over a log-range
RD := log (Dmax/Dmin),
then locally ∂Err/∂α scales
like aDeff−αlog Deff,
hence the information about α
grows with the dispersion of log Deff. A crude but
useful proxy is the standard linear-regression formula: for KD distinct
data-scale points with homoscedastic error variance σ2,
where the second approximation assumes points roughly uniformly spaced
in log Deff. An
analogous estimate holds for β
with RN := log (Nmax/Nmin)
and KN
distinct model-scale points. Thus, for a target sd(α̂) ≤ ε we require
$$
K_D\;\gtrsim\;\frac{12\sigma^2}{\varepsilon^2 R_D^2},
\qquad
K_N\;\gtrsim\;\frac{12\sigma^2}{\varepsilon^2 R_N^2}.
$$
While ignores the additive two-term structure, it correctly captures the
two levers we control: (i) increase the number of scaling points K, (ii) increase the log-range of
the sweep, and (iii) reduce σ2 by more tasks and more
evaluation episodes.
The additive structure implies an additional, practically important
constraint: α and are poorly
identified if one term dominates the other across the sweep. Concretely,
if aDeff−α ≫ bN−β
for all chosen points, then the likelihood is nearly invariant to β, and any estimate of β will be driven by prior
assumptions rather than data. Therefore, we should ensure that the sweep
includes a neighborhood of the ``balance curve’’
since there the gradients with respect to both α and β have comparable magnitude,
yielding high joint information. Operationally, we do not know α, β a priori, but even a
coarse initial sweep allows us to locate the approximate intersection
region, after which subsequent points can be concentrated near the
predicted optimum and near .
The multiplier ρ enters
only through Deff = Ds + ρDr
and through the real-data cost comparison. If we never train on real
data (Dr = 0 always),
then ρ is not identifiable; if
we always train on real data only (Ds = 0 always),
then ρ is confounded with
a. Hence we must include
mixed-modality points. A simple and efficient pattern is a design at
fixed (N, D) (total
episodes) in which we train one run with (Ds = D, Dr = 0)
and another with (Ds = 0, Dr = D).
Under the model, the difference in non-floor error is
approximately
ΔErr ≈ a(D−α − (ρD)−α) = aD−α(1 − ρ−α),
which is informative about ρ
when aD−α is
not too small (i.e. away from the floor) and when ρ is not extremely close to 1. In view of the threshold rule for the
modality split, we additionally want to know whether we are near the
indifference boundary g(N; ρ) = 0;
accordingly, a practical goal is not sd(ρ̂) ≤ ε in absolute
terms, but rather sd(g(N; ρ̂)) small
compared to the margin |g(N; ρ)| at the
candidate optimum. This aligns the experiment design with the decision
it must support.
These considerations lead to a sequential sweep that is more budget-efficient than a static grid.
We choose K0 points
that span (log N, log D) over the
feasible engineering range (e.g. a 3 × 3 or 4 × 4 factorial design), and we include at
least two paired sim-vs-real points to seed identification of ρ. We allocate evaluation budget so
that binomial noise is small compared to between-point differences,
e.g. choose TM so
that $1/\sqrt{TM}\ll$ the anticipated
error drop across adjacent points.
We fit the hierarchical model and compute a posterior over (α, β, ρ). For
selecting the next scaling point, we evaluate an approximate expected
information gain criterion, such as maximizing the determinant of the
Fisher information for (α, β, ρ) under
the current posterior (a Bayesian D-optimal rule), subject to the
remaining budget and feasibility constraints. A simpler surrogate is to
sample near (i) the predicted balance curve , and (ii) the predicted
optimizer under the posterior mean, since these points are
simultaneously informative and decision-relevant.
Rather than targeting exponent accuracy in isolation, we stop when
uncertainty in the optimal allocation is small, e.g. when posterior
draws of (α, β, ρ) imply
that log N* and
log Deff* have
standard deviation at most τ
(for a user-chosen τ), or
equivalently when εlog B is empirically small
in the sense of the robustness bound. This makes explicit the
diminishing-returns phenomenon: once the posterior uncertainty is such
that alternative allocations predicted by the posterior are
near-indifferent in achieved error, further sweeps are dominated by
spending the remaining budget on training at the chosen scale.
In aggregate, the pilot sample complexity is controlled by three quantities: the number of distinct training runs K (expensive), the total evaluation mass TM (comparatively cheap), and the log-ranges RD, RN (engineering-limited but crucial). Sequentially concentrating points near the balance region and near the decision boundary for the sim-vs-real split yields exponent estimates accurate enough for allocation with a small pilot fraction of B. The next section clarifies why such parametric structure is not merely convenient: without it, even formulating an efficient allocation rule is computationally intractable in the worst case.
The preceding allocation rule relies on the parametric scaling assumption (H1), which converts budget allocation into a low-dimensional, well-behaved optimization problem. We now make precise why some such structure is not merely aesthetically convenient: if we remove (H1) and only assume that error improves monotonically with additional resources, then the allocation problem becomes computationally intractable in the worst case. The correct interpretation is not that practitioners ``should not’’ do allocation without power laws, but rather that any actionable prescription must implicitly exploit additional regularity (parametric form, convexity, smoothness, submodularity, etc.); otherwise no efficient algorithm can be guaranteed.
Fix a finite set of feasible model sizes 𝒩 and consider the decision variables (N, Ds, Dr) ∈ 𝒩 × ℤ ≥ 02 under the linear budget constraint κN(Ds + Dr) + csDs + crDr ≤ B. Suppose that the downstream error Err(N, Ds, Dr) is assumed to be coordinate-wise non-increasing in each argument (more model capacity and more data cannot worsen performance). Further assume that Err is presented by a value oracle: given (N, Ds, Dr), we can train/evaluate and obtain Err(N, Ds, Dr) (or a sufficiently accurate estimate thereof). This formalizes the strongest ``black-box’’ abstraction one might hope to use when no parametric scaling law is trusted.
Consider the decision problem: given (B, κ, cs, cr) and a threshold η, determine whether there exists a feasible allocation with Err(N, Ds, Dr) ≤ η. Even when N is fixed (so only (Ds, Dr) remain), and even when we allow Err to take only finitely many values, this decision problem is NP-hard. The core reason is that monotonicity alone permits Err to encode arbitrary combinatorial ``step improvements’’ that behave like selecting items under a knapsack constraint.
We outline a reduction that captures the essence of Theorem~5 in the global context. Let a 0–1 knapsack instance be given by item weights wi ∈ ℤ > 0, item values vi ∈ ℤ > 0, a capacity W, and a target value V. The question is whether there exists a subset S ⊆ {1, …, n} such that ∑i ∈ Swi ≤ W and ∑i ∈ Svi ≥ V.
We construct an allocation instance with fixed N and a single data modality for
simplicity; take Dr ≡ 0, κ = 0, cs = 1, and set
the budget B := W.
Thus feasibility is simply Ds ≤ W.
The only remaining task is to define a monotone non-increasing error
function of Ds that encodes
the knapsack objective. To do so, we introduce data increments that can
be purchased only in certain bundles: for each subset S define a special dataset
size
DS := ∑i ∈ Swi.
Define Err(Ds) to be a
step function that attains a low value if and only if Ds equals (or
exceeds) some DS whose
corresponding value is large. Concretely, set
$$
\mathrm{Err}(D_s) := 1 - \max\Bigl\{ \frac{1}{C}\sum_{i\in S} v_i \,:\,
D_S \le D_s \Bigr\},
$$
where $C:=\sum_{i=1}^n v_i$ is a
normalizing constant ensuring Err ∈ [0, 1], and the maximum over an empty
set is defined as 0. This function is
monotone non-increasing in Ds by
construction. Moreover, there exists Ds ≤ W
with Err(Ds) ≤ 1 − V/C
if and only if there exists a subset S with DS ≤ W
and ∑i ∈ Svi ≥ V,
which is exactly the knapsack decision problem. Hence deciding
feasibility under an error threshold is NP-hard.
The same construction can be embedded in the original (Ds, Dr) formulation with linear costs by assigning separate costs to the two modalities and forcing all effective purchases to occur in, say, the real-data coordinate; fixing N removes any dependence on model size. The conclusion is that hardness is not an artifact of model-selection: it already appears in the data-allocation subproblem.
The above reduction uses an error function with discontinuous steps, which might seem unrealistic. However, the point is not that real learning curves are adversarial, but that without additional assumptions one cannot preclude adversarial instances. In particular, any algorithm that claims to output a near-optimal allocation for monotone error functions would imply P = NP, even if it is allowed to adaptively query the oracle. Moreover, simple approximation guarantees are also blocked in the worst case: by making the steps sufficiently sharp (or by introducing plateaus separated by narrow transition regions), one can force any polynomial number of oracle queries to be uninformative about where the next improvement occurs. Thus, absent structure, the sample complexity of exploration and the computational complexity of optimization are coupled in an unfavorable way.
To escape this impossibility, we must restrict the function class. The scaling-law hypothesis (H1) is one such restriction, and it is particularly convenient because it yields (i) identifiability from a modest number of scaling points (Section~), (ii) a convex (after change-of-variables) allocation problem with global optima (Theorem~3), and (iii) robustness of the optimizer to small parameter errors (Theorem~4). More broadly, any allocation theory with guarantees must assume some combination of:Power laws are not uniquely privileged, but they instantiate all three properties in a form that is empirically plausible for many families of representation-learning systems and, crucially, admits transparent budget trade-offs.
We therefore treat (H1) as an : it is the minimal structural hypothesis under which we can both (i) estimate the relevant quantities from pilot sweeps and (ii) compute a recommended allocation with a correctness story. This viewpoint also sharpens what it means to validate the approach experimentally. It is not enough to report that larger N or larger Deff helps; rather, we must verify that within the operating range the measured errors are consistent with a model whose induced optimizer is stable. If the observed learning curves substantially violate the assumed smooth trade-off (e.g. exhibit abrupt regime changes not captured by a single pair (α, β)), then the hardness discussion predicts that allocation will be intrinsically fragile, and any claimed optimality should be regarded as heuristic.
In the next section we therefore specify an experimental protocol whose purpose is implementation-strengthening: it standardizes the benchmark suite and the accounting of Cost, fits the hierarchical scaling model on controlled grids, and compares the resulting allocations against simple baselines under equal budget, thereby testing whether the structural assumption required to avoid worst-case hardness is empirically justified.
We now specify an experimental protocol whose purpose is not to discover a new learning algorithm, but to make the allocation theory operational and falsifiable under controlled accounting. The protocol is designed to (i) produce a dataset of noisy observations $\widehat{\mathrm{Err}}(N,D_s,D_r)$ over a budget-feasible grid, (ii) fit the hierarchical scaling model in a manner that separates shared exponents from task idiosyncrasies, and (iii) evaluate whether the resulting optimizer yields improvements over simple, widely used heuristics at matched total cost.
We fix a benchmark family T consisting of tasks indexed by t ∈ {1, …, |T|}. Each task t is given by a procedural generator Gt(ω) producing initial states, goal specifications, and nuisance variation (textures, lighting, object instances, layouts) from a seed ω. We require that both simulation and real-world episodes admit a common episodic interface: an episode is a finite horizon trajectory with a standardized observation and action space, and success is a Boolean event measurable at termination. We report error as Err := 1 − SR, where SR is success rate averaged over a fixed number of evaluation seeds per task. The only role of this standardization is to make the unit ``episode’’ comparable across data sources so that (Ds, Dr) and the accounting in Cost are meaningful.
Prior to any sweeps, we publish a budget B and coefficients (κ, cs, cr)
in a common cost unit. We treat κ as the conversion factor from
training compute to cost, where training compute is proportional to
N(Ds + Dr)
(with the proportionality fixed by optimizer settings and sequence
length/horizon). We measure cs as the
amortized simulator cost per episode (including physics stepping,
rendering if applicable, storage, and any domain randomization
overhead). We measure cr as the
amortized real episode cost (fleet time, operator time, resets,
maintenance, and expected wear). For transparency, we additionally log a
decomposed bill of materials for each run:
$$
\mathrm{Cost}=\underbrace{\kappa
N(D_s+D_r)}_{\text{train}}+\underbrace{c_s
D_s}_{\text{sim}}+\underbrace{c_r D_r}_{\text{real}},
$$
and we report each term separately. This decomposition is not used by
the optimizer beyond the linear model, but it is essential for
reproducing conclusions under alternative accounting choices.
We choose a discrete model-size set 𝒩 (e.g. a logarithmic grid spanning a plausible deployment range) and a finite set of . For each N ∈ 𝒩 and for each target effective-data level Deff on a logarithmic grid, we construct one or more allocations (Ds, Dr) that (a) achieve the desired Deff = Ds + ρ0Dr using a provisional ρ0 (e.g. ρ0 = 1 for design), and (b) satisfy Cost(N, Ds, Dr) ≤ Bslice for a designated slice budget Bslice ≤ B. In practice, because (Ds, Dr) are integers and because cs, cr may not permit exact equality, we target a narrow band Cost ∈ [(1 − ξ)Bslice, Bslice] for a small ξ (e.g. ξ = 0.02) to minimize wasted budget while maintaining feasibility. This construction ensures that comparisons across points reflect a controlled trade-off between model capacity and data, rather than confounding from different total spend.
At each grid point (N, Ds, Dr) we run multiple independent training seeds. Each trained policy is evaluated on a fixed evaluation set: for each task t, we sample m procedural seeds and compute $\widehat{\mathrm{SR}}_{t}$, hence $\widehat{\mathrm{Err}}_{t}=1-\widehat{\mathrm{SR}}_{t}$. We retain the binomial standard error proxy $\widehat{s}_{t}=\sqrt{\widehat{\mathrm{SR}}_{t}(1-\widehat{\mathrm{SR}}_{t})/m}$ (or a beta-binomial posterior interval if preferred). We also report aggregate error $\widehat{\mathrm{Err}}=\frac{1}{|T|}\sum_t \widehat{\mathrm{Err}}_{t}$ with uncertainty obtained by task-level bootstrap. The repetition count is chosen so that the posterior over exponents concentrates sufficiently to make allocation decisions stable; we operationalize this by requiring that the induced optimizer (Step~4 of the algorithm) changes by less than a preset tolerance when refitting with a held-out subset of seeds.
Using all collected points, we fit a hierarchical model of the
form
Errt(N, Ds, Dr) = at(Ds + ρDr)−α + btN−β + Et,
with shared (α, β, ρ) and
task-specific (at, bt, Et).
We implement inference with either Hamiltonian Monte Carlo or
variational approximations, but we require posterior predictive checks:
we draw replicated errors from the fitted model and verify that observed
error trends across each axis (N, Deff) are
within credible bands. We also perform held-out validation by
withholding an entire budget slice (or an entire model size) and testing
predictive accuracy; failure of this test is treated as evidence against
a single-regime power law over the swept range, in which case we either
restrict the operating region or fit a structured mixture
(e.g. piecewise exponents) and rerun the allocation.
Given the posterior over parameters, we compute an allocation (N̂, D̂s, D̂r) by solving the relaxed convex program and rounding to feasible integers and feasible N ∈ 𝒩. We compare this allocation against matched-budget baselines that reflect common practice: (i) scaling where Ds, Dr are fixed proportions and N is maximized; (ii) scaling where N is fixed and data is maximized; (iii) (Dr = 0) and (Ds = 0) strategies; and (iv) a ``Chinchilla-style’’ heuristic that enforces a fixed power-law ratio between N and total data Ds + Dr ignoring ρ. Each baseline is instantiated by explicitly solving max over its restricted family subject to Cost ≤ B so that all comparisons are exact in cost, not approximate in wall-clock.
For each method we report: the chosen (N, Ds, Dr); the achieved Cost and its decomposition; the measured $\widehat{\mathrm{Err}}_t$ for all t ∈ T; and aggregate $\widehat{\mathrm{Err}}$ with uncertainty. We additionally report the fitted posterior for (α, β, ρ), including credible intervals and posterior correlations, since these quantities determine the qualitative regime (model-limited versus data-limited, and sim versus real preference). Finally, we publish the exact grid specification, procedural seeds for evaluation, and the cost accounting script. Under this protocol, the theory makes a concrete, refutable prediction: if (H1)–(H2) are a good local model of the learning curves, then the allocation computed from the fit should outperform the baselines at equal total cost, and the improvement should be stable under small perturbations of the fitted parameters.
Our primary claim is that, once one accepts a locally valid scaling
model of the form
Err(N, Ds, Dr) = a(Ds + ρDr)−α + bN−β + E
together with an explicit accounting
Cost(N, Ds, Dr) = κN(Ds + Dr) + csDs + crDr,
then allocation ceases to be a matter of taste and becomes an
optimization problem with falsifiable predictions. We regard this as
directly relevant to the design of ``robot data engines’’ in 2026, where
the dominant engineering question is no longer whether to collect data,
but , , and under a shared budget across simulation infrastructure,
fleet operations, and training compute.
A practical data engine must internalize that data are not free even
in simulation: increasing Ds increases
both generation cost csDs
and training cost κNDs.
Likewise, increasing Dr increases
generation cost crDr
and training cost κNDr,
and in many realistic systems κN(Ds + Dr)
becomes dominant once N is
large. Our formulation suggests that a mature data engine should surface
a single ledger of across all levers, rather than reporting separate
dataset size'' andmodel size’’ milestones. In particular,
the quantity c̃(N) := min {κN + cs, (κN + cr)/ρ}
plays the role of an effective price per unit Deff, and it is this
price—not raw episode counts—that governs the optimal spend in the
relaxed program. Thus, the relevant KPI for an organization is not
merely episodes collected per day'' buteffective episodes
per unit cost at the current N,’’ which changes as architectures
and training stacks evolve.
The threshold rule implicit in the split optimization clarifies how
we should reason about sim-to-real strategy. For fixed (N, Deff), the
minimum-cost way to realize Deff is either all-sim or
all-real except at a measure-zero boundary; convex mixtures are
cost-optimal only when the effective per-unit prices match:
$$
\kappa N + c_s \;=\; \frac{\kappa N + c_r}{\rho}.
$$
This observation is often obscured in informal discussions that treat
``some real data’’ as intrinsically necessary. In our model, real data
are necessary only insofar as they lower the effective price of Deff or change the
functional form of Err (i.e., violate
the single-ρ assumption).
Operationally, the rule implies that efforts to improve simulation
fidelity, domain randomization, or synthetic-to-real alignment should be
evaluated through their impact on the ρ (and possibly on the validity of
the shared exponent α),
because a modest increase in ρ
can flip the inequality and thus change the optimal split
discontinuously. Conversely, reducing cr through
better resets, autonomy in data collection, or lower-latency
teleoperation can matter as much as increasing ρ; both act through the same
threshold comparison once κN is accounted for.
A second implication is methodological: if we wish to speak
meaningfully about allocations, we must make
an episode'' anda unit of cost’’ comparable across labs and
platforms. The protocol’s insistence on a common episodic interface,
published (κ, cs, cr),
and explicit decomposition of Cost is
not bureaucratic; it is the minimum structure required for an allocation
claim to be transportable. In the absence of such standardization,
apparent wins can be artifacts of unreported simulator amortization,
different reset labor, or inconsistent evaluation seeds. For 2026-era
benchmarks, we therefore expect that the benchmark specification will
include not only task generators Gt and success
criteria, but also a recommended accounting rubric (what is counted in
cr,
whether rendering is included in cs, how κ is computed), so that scaling
sweeps across organizations can be compared on a common axis.
The most immediate limitation is the assumption that a single pair of exponents (α, β) and a single ρ govern the entire swept region and all tasks. In robotics, the learning curve can exhibit regime changes: exploration-limited behavior at small D, representation-limited behavior at small N, and saturation effects as E dominates. Moreover, ``real data’’ is itself heterogeneous: teleoperation demonstrations, autonomous rollouts, failure cases, and safety interventions may have different effective multipliers. A single ρ may therefore average over qualitatively distinct contributions, and the threshold rule may become misleading if Dr is not a scalar commodity. One principled extension is to replace Deff = Ds + ρDr by a multi-component effective volume Deff = ∑jρjDr, j + Ds with type-specific costs cr, j, yielding a linear program for the split at fixed (N, Deff); another is to adopt a latent-mixture scaling model in which different tasks or data types belong to different regimes with different (α, β, ρ). The latter sacrifices some convexity but remains tractable with structured priors and posterior risk minimization.
Our analysis treats N as the sole model-side lever. In practice, deployed performance can also improve with test-time compute: longer horizons, model-predictive rollouts, diffusion sampling steps, or retrieval over large episodic memories. These introduce an additional resource Ctest with an associated cost and a potentially different scaling exponent. A minimal extension is to augment the error model by an additive term d Ctest−γ and add a deployment budget constraint, or to treat Ctest as bounded and incorporate it into E as a deployment-induced offset. Either way, the allocation problem becomes multi-dimensional but retains the same logic: we equate marginal error reductions per unit cost across training compute, data, and inference compute, subject to feasibility constraints such as latency bounds and on-device memory.
Finally, robotics forces us to confront safety as a first-class
constraint. The objective Err = 1 − SR
collapses multiple failure modes into a scalar and does not distinguish
benign failures from hazardous ones. A safety-aware allocation should
instead optimize a composite objective, for example
Errλ := (1 − SR) + λ Risk,
or impose chance constraints Pr (unsafe
event) ≤ η on the learned policy. Real-world data
collection itself can be constrained by safety, creating a coupling
between Dr
and admissible policies during data gathering. These considerations
suggest replacing the single constrained program by a constrained risk
minimization problem (e.g. with CVaR penalties) and treating safety
events as separate observables in the scaling fit. The main conceptual
point persists: once risk metrics are logged and priced, the allocator
can trade off safer (but more expensive) real data, improved simulation
safety filters, and additional training compute in a common currency,
and the resulting policy choice becomes auditable.
In summary, we view the scaling-law allocator not as a final theory of robot learning, but as a disciplined between empirical learning curves and organizational resource decisions. Its value is greatest when it forces explicit declarations: what is counted as cost, what is treated as data, and what constitutes success. The appropriate next step is not to make the model more complicated by default, but to expand it only where posterior predictive checks fail, thereby maintaining the central benefit of the approach: allocation decisions that are both computationally solvable and empirically refutable.