Scale-Calibrated Multi-Fidelity NAS for Adapterized Foundation Models: Rank Guarantees and Tight Sample Complexity

Metadata

Total Words: 10,733
Export Date: 2026-01-20 05:50:51
Description: Neural Architecture Search (NAS) has shifted from expert-driven design to automated optimization, but classical reinforcement-learning, evolutionary, and even differentiable NAS remain bottlenecked by expensive architecture evaluations and unreliable proxy metrics (e.g., weight sharing, early stopping). Motivated by these challenges highlighted in the NAS survey literature, we focus on the 2026 regime where most practitioners adapt frozen foundation models using lightweight modules (LoRA/IA³/adapter routing) rather than pretraining from scratch. We formalize adapter NAS as a budgeted multi-fidelity optimization problem: cheap partial fine-tunes yield learning-curve prefixes, while expensive full fine-tunes reveal the true objective. Our main contribution is a scaling-law–calibrated predictor that maps low-fidelity curve statistics and architecture descriptors to full-fidelity performance with calibrated uncertainty, and an adaptive promotion/elimination algorithm that uses this calibration to allocate scarce full fine-tunes. Under a bounded-distortion assumption linking partial trajectories to full outcomes, we prove uniform prediction and ranking guarantees and derive near-tight sample complexity bounds: only Õ((d + log |𝒜|)/ϵ²) full fine-tunes suffice to identify an ϵ-optimal architecture with high probability, while a matching Ω(1/ϵ²) lower bound is unavoidable. We further show an impossibility result: without any structural proxy-to-true linkage, proxy-only NAS admits no nontrivial worst-case guarantees. Experiments (to strengthen the contribution) would validate compute savings (1–2 orders of magnitude), ranking fidelity, and hardware-aware extensions across multiple devices.

1. Introduction: Adapterized foundation models as the 2026 NAS target; why proxy mismatch is the core bottleneck; contributions and guarantees.
2. Related Work: RL-NAS, EA, DARTS/DARTS+, Fair DARTS, one-shot NAS, performance predictors, random search baselines, hardware-aware NAS; why existing theory does not cover curve-calibrated adapter NAS.
3. Problem Setup: Multi-fidelity evaluation model; objectives (single-objective quality and optional constraints); definition of proxies from partial training trajectories; costs and budgets.
4. Calibration Model: Scaling-law/learning-curve feature design; hypothesis class 𝒢; uniform error and conformal calibration; assumptions and how to test them empirically.
5. Algorithm: Calibrated Adaptive Successive Halving (CASH-NAS) with predictor CIs; promotion/elimination logic; handling multiple tasks and device metrics (optional extension).
6. Theory I (Upper Bounds): Uniform prediction bound ⇒ rank consistency; ϵ-optimal selection with Õ((d + log |𝒜|)/ϵ²) high-fidelity evaluations; regret-style bound for anytime operation.
7. Theory II (Lower Bounds & Impossibility): (i) proxy-only NAS impossibility without linkage; (ii) Ω(1/ϵ²) high-fidelity lower bound even with informative proxies.
8. Experimental Plan (Implementation-Strengthening): adapter search space, task suite, devices; calibration curves; ablations; baselines; reproducibility checklist.
9. Discussion: What breaks outside adapter regime; implications for hardware-aware Pareto NAS; future directions (federated, multi-device, safety metrics).

Content

1. Introduction: Adapterized foundation models as the 2026 NAS target; why proxy mismatch is the core bottleneck; contributions and guarantees.

Foundation models have shifted the practical burden of adaptation from training a task model from scratch to choosing an around a fixed pretrained backbone. In 2026-scale deployments, we typically freeze a large base model M₀ and introduce parameter-efficient modules (e.g., adapters, low-rank updates, sparse experts, routing policies, or mixtures thereof) whose placement, width/rank, sharing pattern, and routing logic define a discrete design a ∈ 𝒜. The resulting search space is finite but combinatorially large, and the full-fidelity evaluation F(a)—the outcome of a standardized fine-tuning recipe measured on a held-out validation metric—is expensive. Consequently, adapterized neural architecture search (NAS) is now less a matter of can we optimize the weights?'' and more a matter ofcan we locate a near-optimal a while paying for only a small number of expensive fine-tunes?’’

We formalize this setting as a two-fidelity optimization problem. A high-fidelity oracle HiFi(a) returns a noisy observation of F(a), and each high-fidelity query corresponds to an end-to-end fine-tuning run at the intended training budget. In contrast, a low-fidelity oracle LoFi(a, b) returns partial-training information at a small budget b, such as an initial learning-curve prefix, early-layer statistics, or optimization traces. From this we compute a feature vector z(a, b) ∈ ℝ^d that combines architecture descriptors (e.g., adapter placement masks, ranks, parameter counts, routing sparsity) with curve-based statistics (e.g., early loss slope, curvature, gradient norms, or stability indicators). The operative hope is that z(a, b₀) at a fixed cheap budget b₀ is predictive enough to focus high-fidelity evaluations on a small subset of promising designs.

The central obstacle is . Early learning curves, partial optimization traces, and architectural descriptors are not, in general, sufficient to recover the ordering induced by F. Mismatch arises from phenomena that are invisible at small b₀: late-phase generalization, optimizer schedule interactions, task mixture effects, and representation drift. In adapterized regimes the mismatch is particularly acute, because (i) different placements can yield similar early loss behavior but different asymptotic generalization, (ii) routing or sparsity mechanisms often exhibit delayed specialization, and (iii) small changes in parameterization can alter stability and regularization in ways that manifest only late in training. For these reasons, a purely proxy-driven NAS procedure is unreliable in the worst case: without additional structure linking proxy observations to F, there is no principled way to guarantee near-optimal selection.

Our approach is to separate what can be from what must be . We assume only a bounded-distortion linkage between the true objective and an unknown predictor of low-fidelity features: there exists g_* ∈ 𝒢 such that |F(a) − g_*(z(a, b₀))| ≤ η for all a ∈ 𝒜. The parameter η quantifies the irreducible proxy mismatch at budget b₀ within the chosen hypothesis class 𝒢. This assumption does not posit that the proxy is correct, nor that it preserves rank exactly; it merely asserts that after appropriate calibration, the proxy-induced representation admits a uniformly accurate predictor up to additive distortion η. This is the weakest form of structural linkage that still permits nontrivial worst-case guarantees, and it is precisely the quantity one expects to improve when the proxy features are strengthened (e.g., via richer curve summaries or better architecture encodings) or when the budget b₀ is increased.

Given this linkage, we view high-fidelity evaluations as a scarce supervision resource used to calibrate a predictor on top of low-fidelity features. The algorithmic template is simple: we perform a low-fidelity sweep to compute z(a, b₀) for many (often all) a ∈ 𝒜, then spend a limited budget of m high-fidelity queries to train and calibrate a predictor ĝ mapping features to objective values. We finally output the architecture minimizing ĝ (or, in an elimination variant, repeatedly refine a candidate set using calibrated confidence intervals). The methodological point is that the low-fidelity phase is not used to directly; it is used to build a representation in which a small number of high-fidelity labels suffices to generalize uniformly across 𝒜.

Our contributions are organized around guarantees that make this intuition precise.

We fix a finite architecture set 𝒜 around a frozen M₀, define F(a) as the full-fidelity objective, and specify two oracles LoFi and HiFi with an explicit cost separation. This abstraction captures modern adapter search pipelines in which partial fine-tunes can be parallelized widely, while full fine-tunes dominate wall-clock and energy cost. The key modeling choice is to treat LoFi(a, b₀) as producing rather than a noisy scalar objective. This reflects the empirical reality that early trajectories contain heterogeneous signals (stability, curvature, speed, implicit regularization) that are more informative jointly than any single proxy metric.

Under bounded distortion and standard capacity control for 𝒢, we show that m high-fidelity samples suffice to obtain a uniform prediction error bound over all architectures. The uniformity is essential: because the final decision is an arg min over 𝒜, pointwise accuracy is inadequate. The resulting sample complexity scales as
$$ m=\tilde O\!\left(\sigma^2\frac{d+\log|\mathcal{A}|}{\epsilon^2}\right), $$
matching the familiar 1/ϵ² dependence of stochastic best-arm identification and exhibiting the expected logarithmic dependence on the size of the search space.

Uniform error immediately yields rank consistency statements: predicted orderings differ from true orderings only on pairs whose true objective gap is small relative to the uniform error. This leads to explicit Kendall-τ disagreement bounds in terms of the empirical gap distribution, clarifying when a calibrated predictor should be expected to produce a high-quality shortlist even if exact identification of the optimum is infeasible.

Beyond one-shot selection, we analyze a promotion/elimination variant in which we maintain a surviving set S ⊆ 𝒜 and discard architectures whose lower confidence bound is provably worse than the best upper confidence bound by a margin. With calibrated confidence intervals, the optimal architecture is never eliminated, and the allocation of high-fidelity queries adapts to gaps Δ_a = F(a) − F(a^*), yielding instance-dependent complexity of the form Õ(σ²/Δ_a²) per suboptimal architecture.

We also formalize two negative results that delimit what is possible. First, without any constraint linking low-fidelity observations to F, no proxy-only procedure can guarantee nontrivial approximation uniformly over all objectives: indistinguishability constructions show that the same proxies can correspond to objectives with disjoint minimizers. Second, even when proxies are perfectly informative up to calibration (η = 0), any algorithm that returns an ϵ-optimal architecture with probability 1 − δ must perform Ω(σ²log (1/δ)/ϵ²) high-fidelity queries in the worst case. Thus the dependence on σ², ϵ, and δ in our guarantees is essentially unavoidable.

In sum, we treat adapterized NAS as a calibrated, multi-fidelity selection problem: low-fidelity runs produce a feature space, high-fidelity runs provide scarce labels, and the correctness criterion is uniform enough to support an arg min decision with a prescribed (ϵ, δ) guarantee. The resulting perspective explains both why naive proxy ranking can fail (large η or insufficient calibration) and why a modest number of full fine-tunes can nevertheless suffice when the proxy representation is well-chosen (small η and controlled complexity d).

Neural architecture search (NAS) originated in settings where the architecture controls the entire model and training from scratch is the dominant cost. In our setting the backbone M₀ is fixed and frozen, and the search object is an (placement, rank/width, sharing, and routing of parameter-efficient modules). This shift changes both the empirical bottlenecks and the appropriate abstraction: full-fidelity evaluation is still expensive, but low-fidelity information is naturally available in the form of partial fine-tuning trajectories and optimization traces that are specific to the adapterized regime.

Early NAS methods used reinforcement learning (RL) controllers to generate architectures, receiving validation performance as a reward signal (e.g., ). These approaches are conceptually compatible with discrete design spaces, but in practice they require large numbers of expensive evaluations and rely on weight sharing or other amortization strategies to be computationally feasible. In adapter search, an RL controller can indeed propose module placements and ranks; however, without a careful treatment of multi-fidelity information, the evaluation cost remains dominated by full fine-tunes, and the controller must still cope with noisy rewards and limited samples. From a theoretical standpoint, RL-based NAS analyses typically inherit regret guarantees for policy optimization under assumptions that do not reflect proxy-based calibration: the reward is assumed to be observed (perhaps noisily) for each sampled architecture, rather than inferred from a feature-rich low-fidelity trajectory prefix.

Evolutionary algorithms (EA) and related population-based strategies are another classical pillar of NAS (e.g., ). Their strength is robustness to nonconvexity and discrete decisions, and their weakness is again evaluation cost. Various works introduce cheap mutations, surrogate models, or partial training to reduce cost. In adapterized regimes, evolutionary search is natural because placements and routing rules admit structured mutations, and partial fine-tuning can serve as a screening signal. Nevertheless, standard EA treatments view low-fidelity information primarily as a proxy (early stopping, truncated training) rather than as an object from which one learns a calibrated predictor with explicit uniform guarantees. Moreover, most EA analyses are asymptotic or rely on simplifying assumptions (e.g., unimodal landscapes or drift conditions) that do not directly yield finite-sample (ϵ, δ) statements for the arg min decision over a large finite set.

Differentiable architecture search (DARTS) and its successors relax discrete choices into continuous parameters and optimize them jointly with weights via gradient descent (e.g., ). Numerous variants address instability, bias toward parameter-free operations, and fairness of comparison (e.g., DARTS+ and FairDARTS-like adjustments; see surveys such as for categorization). These methods are most effective when the search space is expressed as a cell or a supernet with shared parameters. In adapter NAS, one can also define a supernet over possible adapter placements/ranks and relax selection variables. However, two issues arise. First, weight sharing can be less reliable when the base model is frozen and the adaptation modules are comparatively small: the supernet coupling may distort relative performance, and the mismatch between relaxed training dynamics and final discrete instantiation can be large. Second, the most common theoretical analyses of continuous-relaxation NAS focus on optimization properties of the relaxed problem (e.g., convergence to stationary points) rather than on statistical guarantees that connect low-fidelity signals (partial fine-tunes, early trajectories) to full-fidelity performance across a finite architecture set. Our formulation explicitly isolates the statistical calibration step needed to turn low-fidelity features into a predictor whose error can be controlled uniformly over 𝒜.

One-shot NAS trains a single over-parameterized supernet and evaluates sub-architectures by inheriting weights (e.g., ). This strategy is a form of multi-fidelity evaluation: the inherited-weight score is a cheap estimator of the fully trained score. The primary difficulty is estimator bias and rank inconsistency, which is known to be severe in some regimes. The adapterized setting exacerbates this: because the backbone is fixed, the adaptation modules must negotiate with a representation that they did not co-train, and supernet coupling effects can change the effective regularization on the adapters. Existing theory for one-shot NAS, where present, typically addresses properties of weight sharing under restrictive assumptions (e.g., linearization, neural tangent approximations, or simplified losses) and does not provide an (ϵ, δ) guarantee for selecting a discrete adapter configuration using a small number of full fine-tunes. Our approach avoids assuming that the low-fidelity estimator is unbiased or even scalar; instead we treat low-fidelity outputs as features and allow an explicit distortion term capturing irreducible mismatch.

A substantial body of work studies predictors for architecture performance and learning-curve extrapolation (e.g., ). These methods often combine architecture encodings with partial training signals to predict final accuracy, and many are explicitly multi-fidelity, using early stopping or bandit-style budget allocation (e.g., Hyperband and its variants ). Our setting is closest in spirit to these predictor-based and multi-fidelity hyperparameter optimization (HPO) lines, with two important distinctions. First, the proxy z(a, b₀) in adapter NAS is typically richer than the scalar curves used in classical curve extrapolation: it naturally mixes architectural descriptors (placement masks, ranks, routing sparsity) with trajectory-derived statistics (loss slope, stability, gradient norms), which are not well-modeled as noisy samples of a single monotone learning curve. Second, our target is not merely good average prediction, but a control sufficient to justify an arg min decision over a finite, potentially large 𝒜. Many predictor works evaluate rank correlation empirically, but do not provide the kind of worst-case selection guarantee that is needed when a small error on a small subset of architectures can change the identity of the minimizer.

Random search remains a strong baseline in NAS and HPO, particularly in high-dimensional discrete spaces where sophisticated methods can be brittle (cf. in HPO and subsequent NAS observations). In adapterized NAS, random search coupled with a small number of full fine-tunes is often competitive because the cost of each evaluation dominates and because proxy signals can be misleading. Our guarantees clarify when random sampling is sufficient and when it is not: if no structural linkage exists between low-fidelity observations and full-fidelity performance, then no proxy-based method can improve worst-case guarantees over naive selection, and one must fall back to allocating high-fidelity evaluations in a manner consistent with best-arm identification lower bounds. Conversely, when linkage holds in a quantifiable way, a calibrated predictor can reduce the number of required full evaluations.

Hardware-aware NAS incorporates latency, memory, energy, or throughput constraints, often via multi-objective optimization or constrained search (e.g., ). Adapter architectures are frequently deployed under tight serving constraints (e.g., parameter budgets for on-device adaptation or routing-induced latency). While our primary formulation is single-objective (minimize the full-fidelity metric), the same multi-fidelity calibration perspective can be extended to constrained selection by augmenting features with cost proxies and applying calibration to both quality and constraint predictors. Existing theory for hardware-aware NAS typically treats cost models as known or cheaply measured, whereas in adapter routing settings latency can be data-dependent and thus itself noisy; this reinforces the need for explicit confidence control when eliminating architectures.

Theoretical treatments of NAS and multi-fidelity optimization tend to fall into one of three categories: (i) bandit-style analyses where each architecture/arm yields a noisy scalar reward upon evaluation; (ii) Bayesian optimization and surrogate modeling under smoothness or kernel assumptions on a continuous domain; or (iii) optimization analyses of continuous relaxations (e.g., DARTS). Our setting departs from (i) because the low-fidelity oracle does not yield a scalar reward comparable across architectures, but rather a feature-rich trajectory prefix from which one must learn a predictor. It departs from (ii) because the domain is a large finite set with structured, heterogeneous encodings, and the meaningful assumption is not global smoothness in a metric space but a bounded-distortion relationship between full-fidelity performance and a hypothesis class over proxy features. It departs from (iii) because we do not assume that the relaxed optimization problem faithfully captures the discrete decision, nor that weight sharing yields unbiased estimates. Consequently, existing theory does not directly yield the (ϵ, δ) selection guarantee we seek for adapter NAS under a small number of full fine-tunes.

These gaps motivate the problem setup in the next section: we formalize a two-fidelity model in which low-fidelity queries produce proxy features from partial training trajectories, and high-fidelity queries provide scarce noisy labels used for calibration and selection over a finite architecture set.

3. Problem Setup: Multi-fidelity evaluation model; objectives (single-objective quality and optional constraints); definition of proxies from partial training trajectories; costs and budgets.

We formalize adapter architecture search for a fixed pretrained foundation model M₀ as a budgeted optimization problem over a finite design set. Let 𝒜 denote a finite collection of , each specifying (for example) the placement of parameter-efficient modules (e.g., adapters, LoRA blocks), their ranks or widths, possible parameter sharing patterns, and (when applicable) routing or sparsity rules. Throughout, M₀ is held fixed and frozen; the only trainable parameters under architecture a ∈ 𝒜 are those introduced by the adaptation modules prescribed by a.

For each a ∈ 𝒜, we define F(a) ∈ ℝ as the performance metric of interest. Concretely, F(a) may be the mean validation loss (lower is better) over a fixed task suite after applying a standardized fine-tuning recipe to the adapter parameters under architecture a. The standardization is part of the problem definition: it includes the training data, preprocessing, optimizer, schedule, batch size, stopping rule, and evaluation protocol. This is necessary so that F(a) is well-defined as a property of a, rather than an artifact of an adaptive training procedure.

We assume that direct access to F(a) is expensive. Instead, we have a high-fidelity oracle HiFi which returns a noisy observation
HiFi(a) = F(a) + ξ,
where ξ is conditionally mean-zero and σ²-sub-Gaussian. This noise model abstracts randomness from optimization (minibatching, initialization of adapter weights), data order, and evaluation variance across tasks. The σ²-sub-Gaussian assumption is used to express finite-sample concentration for estimators built from a limited number of full evaluations.

In addition to expensive high-fidelity evaluation, we assume access to a cheap low-fidelity oracle LoFi(a, b) parameterized by a b, which returns partial information from a truncated fine-tuning run of architecture a up to budget b. The budget b can represent steps, epochs, tokens, or wall-clock time. We emphasize that LoFi(a, b) is not assumed to be a scalar proxy comparable across architectures; rather, it returns a structured (and possibly additional traces). A canonical example is a sequence
LoFi(a, b) = (ℓ_a, 1, …, ℓ_a, b)
of training or validation losses, but in practice it may also include gradient norms, learning-rate values, moments of activation statistics, per-task losses, routing entropy, sparsity patterns, or adapter-specific diagnostics. The only requirement at this stage is that the low-fidelity output can be deterministically transformed into a feature vector.

We fix a reference low-fidelity budget b₀ intended to be small relative to a full fine-tune, and for each a ∈ 𝒜 we compute a proxy feature vector
z(a, b₀) ∈ ℝ^d
by applying a featurization map z(⋅, ⋅) to (i) an encoding of the architecture a itself and (ii) the partial trajectory information returned by LoFi(a, b₀). We treat d as the resulting feature dimension (or, more generally, as a complexity parameter controlling the capacity of the predictor class used downstream).

The role of z(a, b₀) is to capture information that is plausibly predictive of F(a) while being cheap to obtain. Since adapter architectures naturally factor into discrete choices (placement masks, rank allocations, routing graphs) and early optimization signals (loss slope, stability, curvature proxies), we allow z(a, b₀) to concatenate heterogeneous components:
$$ z(a,b_0) \;=\; \mathrm{concat}\!\Bigl(\underbrace{\phi_{\mathrm{arch}}(a)}_{\text{static descriptors}},\ \underbrace{\phi_{\mathrm{curve}}(\mathsf{LoFi}(a,b_0))}_{\text{trajectory statistics}}\Bigr). $$
Here ϕ_arch(a) can include counts of trainable parameters, per-layer adapter placements, rank vectors, routing sparsity levels, and similar structural descriptors; ϕ_curve can include the initial loss, the average slope over windows, measures of oscillation or stability, gap between training and validation losses at early time, or cross-task dispersion in multi-task fine-tuning. We do not assume monotonicity, smoothness, or unbiasedness of these features; they are merely inputs for subsequent calibration from a small number of high-fidelity labels.

We explicitly model the compute disparity between fidelities. A low-fidelity query at budget b incurs cost c_L(b), while a high-fidelity query incurs cost c_H, with c_H ≫ c_L(b₀). The algorithm is permitted to issue many low-fidelity queries—potentially for all a ∈ 𝒜 at budget b₀—since these can often be parallelized across machines and run with small adapter parameter counts. By contrast, the number of high-fidelity queries is sharply limited. We therefore express the primary resource constraint either as
#{high-fidelity queries} ≤ m,
or, in a unified form, as a total compute budget B with
∑_{(a, b) queried}c_L(b) + ∑_a queriedc_H ≤ B.
In the remainder of the development we focus on the regime where low-fidelity information is abundant at a fixed budget b₀, and the dominant constraint is the number m of full fine-tunes.

The core objective is to return an architecture â ∈ 𝒜 whose full-fidelity performance is near-optimal. Given target accuracy ϵ > 0 and failure probability δ ∈ (0, 1), we seek an algorithm which, using at most m high-fidelity evaluations and low-fidelity evaluations at cost dominated by n := |𝒜| queries at budget b₀, satisfies
F(â) ≤ min_{a ∈ 𝒜}F(a) + ϵ with probability at least 1 − δ.
We also allow the algorithm to output a ranked shortlist S ⊆ 𝒜 (e.g., the top k candidates under a predictor), optionally equipped with confidence intervals derived from the high-fidelity calibration set.

In many deployments, adapter architectures must meet serving constraints, such as parameter count, latency, memory footprint, or routing-induced compute. To capture this, we may associate to each a ∈ 𝒜 a constraint metric C(a) (or a vector of metrics) and impose a threshold κ. The constrained optimization problem is then
min_{a ∈ 𝒜}F(a) subject to C(a) ≤ κ.
The constraint itself may be (i) known deterministically from a (e.g., parameter count), (ii) measurable cheaply (e.g., static FLOPs), or (iii) noisy and data-dependent (e.g., latency under dynamic routing). In the latter cases one can define an additional oracle and proxy features for constraint prediction analogous to the quality objective. For the problem setup, it suffices to note that our algorithmic interface can accommodate either hard filtering (remove architectures violating deterministic constraints) or probabilistic feasibility (maintain confidence bounds for C(a) in addition to those for F(a)).

We emphasize the intended access pattern. The low-fidelity stage may query LoFi(a, b₀) for many or all a ∈ 𝒜, producing a table {z(a, b₀)}_{a ∈ 𝒜}. The high-fidelity stage then adaptively selects a small subset of architectures for HiFi evaluation in order to (i) calibrate a mapping from proxy features to full-fidelity performance, and (ii) identify an approximately optimal architecture with controlled failure probability. This separation reflects the empirical reality that early partial-training traces are cheap and parallelizable, whereas full evaluations are scarce and must be allocated strategically.

The remainder of the paper will specify how we turn the proxy table {z(a, b₀)} and a limited number of noisy labels {HiFi(a_i)}_i = 1^m into a calibrated predictor and associated uncertainty estimates suitable for selection and safe elimination over the finite set 𝒜.

4. Calibration Model: Scaling-law/learning-curve feature design; hypothesis class 𝒢; uniform error and conformal calibration; assumptions and how to test them empirically.

This section specifies the statistical object that mediates between low-fidelity observations and the full-fidelity objective: a calibrated predictor of F(a) from proxy features z(a, b₀), together with uncertainty estimates suitable for selection and (in the next section) safe elimination. We emphasize that the predictor is not assumed to be ``correct’’ in a realizable sense; rather, we isolate an explicit approximation error η capturing intrinsic mismatch between proxy information and the true objective.

The proxy vector z(a, b₀) ∈ ℝ^d may be viewed as a summary of (i) a discrete architecture description and (ii) an early learning signal. While the development does not require a particular featurization, the bounded-distortion assumption (A1) is only plausible when z exposes coarse invariants that persist from early training to completion.

A generic and practically stable approach is to include of the partial learning curve that are motivated by empirical scaling laws. Suppose LoFi(a, b₀) contains a scalar validation loss sequence (ℓ_a, t)_t = 1^b₀. We may fit a small parametric family on the prefix, e.g.,
ℓ_a, t ≈ α_a + β_at^−γ_a or ℓ_a, t ≈ α_a + β_aexp (−γ_at),
via least squares or robust regression on t ∈ {t_min, …, b₀}. The fitted parameters (α_a, β_a, γ_a) can be used as components of z(a, b₀), along with goodness-of-fit diagnostics (e.g., residual variance on the prefix) that act as a proxy for optimization stability. When LoFi includes multiple traces (training loss, validation loss, gradient norms, or per-task losses), we fit such summaries per trace and concatenate. In addition, we include nonparametric local statistics that tend to be robust under short prefixes:
$$ \text{level: }\ell_{a,1},\ \ell_{a,b_0};\qquad \text{slope: }\frac{1}{w}\sum_{t=b_0-w+1}^{b_0}(\ell_{a,t}-\ell_{a,t-1});\qquad \text{curvature proxy: }\ell_{a,b_0}-2\ell_{a,b_0-w}+\ell_{a,b_0-2w}. $$
These are intended to capture whether a run has ``entered’’ a productive regime early, without asserting monotonicity or linearity across architectures.

The architecture descriptor component ϕ_arch(a) should include at least those aspects known to strongly influence full-fidelity outcomes: number of trainable parameters, per-layer adapter placements, ranks/widths, any sharing constraints, and (for routing) expected active fraction or entropy measures under random or early-trained gates. When the goal includes device or serving constraints, it is useful (though not required for the quality predictor) to include static compute and memory features such as FLOPs and activation footprint. We treat all such quantities as deterministic functions of a, hence valid low-fidelity inputs.

Given proxy vectors z(a, b₀), we posit a hypothesis class 𝒢 ⊆ {g : ℝ^d → ℝ} from which we fit a predictor ĝ. The analysis downstream depends on a capacity parameter, taken here as pdim(𝒢) ≤ d, and on the existence of some g_* ∈ 𝒢 achieving bounded distortion. Concretely, 𝒢 may be instantiated as:

We fit ĝ by empirical risk minimization (ERM) or regularized ERM on the observed high-fidelity labels. Writing y_i = HiFi(a_i) and z_i = z(a_i, b₀), we solve
$$ \hat g \in \arg\min_{g\in\mathcal{G}} \frac{1}{m}\sum_{i=1}^m (g(z_i)-y_i)^2 + \lambda \Omega(g), $$
where Ω is a complexity penalty (e.g., ∥w∥₂² in the linear case) and λ is tuned on a hold-out subset of the m high-fidelity samples or via nested cross-validation within D. The role of regularization is not only predictive accuracy but also stability of uncertainty estimates used by the selection logic.

The core structural assumption is that proxy features admit a predictor within 𝒢 that is uniformly close to the truth:
∃g_* ∈ 𝒢 s.t. sup_{a ∈ 𝒜}|F(a) − g_*(z(a, b₀))| ≤ η.
The parameter η measures : even with infinite high-fidelity labels, the best proxy-based predictor can deviate from F by at most η. This explicitly separates what can be improved by more HiFi evaluations (estimation error) from what cannot (misspecification and proxy insufficiency). The regime of interest for ϵ-optimal selection is η ≪ ϵ, and in particular the later guarantee will require η ≤ ϵ/4.

Under sub-Gaussian noise and capacity control, ERM yields a uniform estimation bound over the finite set 𝒜. Informally, if a₁, …, a_m are drawn uniformly from 𝒜, then with probability at least 1 − δ,
$$ \sup_{a\in\mathcal{A}} |\hat g(z(a,b_0))-F(a)| \ \le\ \eta + C\,\sigma\sqrt{\frac{d+\log(n/\delta)}{m}}, $$
for a universal constant C depending on the choice of loss and boundedness conditions. The key point is that the error decomposes additively into η and an estimation term scaling as $1/\sqrt{m}$. This is the quantitative reason that a small number of high-fidelity labels can ``calibrate’’ a proxy table computed for all a ∈ 𝒜.

For the adaptive algorithm, we require not only a point predictor but also calibrated uncertainty. We therefore construct a confidence interval CI(a) around ĝ(z(a, b₀)) using conformal prediction on a held-out calibration subset. Concretely, split the m labeled pairs into a proper training set D_train and a calibration set D_cal. Fit ĝ on D_train, compute residuals
r_j = |y_j − ĝ(z_j)|, (z_j, y_j) ∈ D_cal,
and let q̂ be the (1 − α)-quantile of {r_j} with the standard finite-sample correction. Then define
CI(a) = [ĝ(z(a, b₀)) − q̂, ĝ(z(a, b₀)) + q̂].
By exchangeability of the labeled examples, this yields marginal coverage
Pr (F(a) ∈ CI(a)) ≥ 1 − α
for a new architecture a drawn from the same sampling scheme as the calibration points. In our setting 𝒜 is finite and the algorithm evaluates only architectures in 𝒜, so we use two refinements when needed: (i) (stratified calibration) to maintain coverage across salient groups (e.g., parameter-count bins), and (ii) a union-bound adjustment to target simultaneous coverage over a candidate set S ⊆ 𝒜 maintained by the algorithm, by setting α = δ/|S| when S is fixed or by using a conservative schedule over rounds.

We stress that conformal intervals do not require correctness of 𝒢; they adapt to misspecification by inflating q̂. The cost is that wide intervals may preclude aggressive elimination, which is appropriate when proxies are weak.

Assumption (A1) is structural and cannot be verified from low-fidelity data alone, but it can be (or at least stress-tested) with a modest number of high-fidelity labels. In practice we recommend the following diagnostics, each computed on an initial random batch of m₀ high-fidelity evaluations:

These checks do not prove (A1), but they provide actionable feedback: either the proxy-to-true linkage is sufficiently tight to justify the calibrated predictor approach, or the procedure should be modified (richer low-fidelity signals, larger b₀, alternative 𝒢, or a fallback to more direct high-fidelity exploration).

The next section uses the calibrated predictor and its confidence intervals to define an adaptive allocation rule over HiFi queries, combining exploitation of low predicted loss with elimination safeguards derived from calibrated uncertainty.

5. Algorithm: Calibrated Adaptive Successive Halving (CASH-NAS) with predictor CIs; promotion/elimination logic; handling multiple tasks and device metrics (optional extension).

We now specify the multi-fidelity procedure that combines (i) an exhaustive low-fidelity sweep at budget b₀ to construct proxy features {z(a, b₀)}_{a ∈ 𝒜}, with (ii) an adaptive allocation of a limited number m of high-fidelity evaluations guided by a calibrated predictor and its confidence intervals. The algorithm is designed to support two modes of operation: a simple variant (sufficient for the non-adaptive upper bound in ), and an variant that safely eliminates clearly suboptimal architectures and concentrates high-fidelity evaluations on a shrinking candidate set.

The algorithm takes as input 𝒜, the low-fidelity budget b₀, a high-fidelity query budget m, and target parameters (ϵ, δ). It outputs a final choice â ∈ 𝒜, and optionally a ranked shortlist S ⊆ 𝒜 with calibrated confidence intervals {CI(a)}_a ∈ S. We treat ϵ as the target additive optimality gap and δ as an overall failure probability.

We first query LoFi(a, b₀) for every a ∈ 𝒜 (or for as many architectures as is feasible under the low-fidelity compute budget), and compute z(a, b₀). This phase is embarrassingly parallel and produces a fixed proxy table that is reused throughout.
Importantly, the subsequent high-fidelity allocation is z(a, b₀), so we can choose which architectures to evaluate using HiFi without further low-fidelity cost.

Adaptive selection based on a predictor requires that the predictor be calibrated on at least a small number of high-fidelity labels. We therefore select an initial batch of m₀ architectures a₁, …, a_m₀ uniformly at random from 𝒜 (or from the current candidate set S if the procedure is restarted), query y_i = HiFi(a_i), and form a labeled dataset
D_m₀ = {(z(a_i, b₀), y_i)}_i = 1^m₀.
The role of this step is purely statistical: it ensures that early decisions are not driven by an uncalibrated model, and it provides data to estimate uncertainty (e.g., via conformal residuals).

For t > m₀, we proceed iteratively. At iteration t, we fit a predictor ĝ_t ∈ 𝒢 on the current labeled set D_t − 1 (using ERM/regularized ERM as in ), and we construct a confidence interval CI_t(a) = [LCB_t(a), UCB_t(a)] for each a in the current candidate set S_t − 1. The particular calibration mechanism is not fixed by the analysis; we require only that, with probability at least 1 − δ, the intervals provide valid coverage for all architectures on which we base elimination decisions. Conformal prediction provides a concrete instantiation: we fit ĝ_t on a proper training subset and compute a quantile of residuals on a calibration subset to obtain a data-dependent half-width q̂_t, yielding
LCB_t(a) = ĝ_t(z(a, b₀)) − q̂_t, UCB_t(a) = ĝ_t(z(a, b₀)) + q̂_t.
We then choose the next high-fidelity query as a lower-confidence-bound minimizer,
a_t ∈ arg min_{a ∈ S_t − 1}LCB_t(a),
query y_t = HiFi(a_t), and set D_t = D_t − 1 ∪ {(z(a_t, b₀), y_t)}.
This LCB rule is the natural choice for minimization: it prioritizes architectures that are either predicted to be good or have high uncertainty, and it is compatible with elimination rules that rely on simultaneous upper/lower bounds.

To concentrate the m high-fidelity queries, we optionally interleave querying with elimination. At certain iterations (or at the end of each ``round’’ in a staged schedule), we update ĝ_t and CI_t(⋅) and shrink the candidate set according to a safety criterion. The basic elimination rule is

Under valid calibration, any architecture removed by cannot be ϵ-optimal: its best plausible value exceeds the best plausible value among survivors by more than ϵ/2. The margin ϵ/2 is chosen so that, once the confidence intervals are sufficiently tight, elimination and final selection together yield an ϵ-optimal output.

To obtain a behavior, we couple with an explicit round schedule. For example, define rounds r = 0, 1, …, R with candidate sets S^(r) and cumulative high-fidelity counts m^(r), where m⁽⁰⁾ = m₀ and m^(r + 1) − m^(r) is the number of new HiFi labels acquired in round r. At the end of each round, we apply and, if it does not reduce the set sufficiently (e.g., because the intervals are still wide), we perform a by retaining only the ⌈|S^(r)|/2⌉ architectures with smallest LCB, i.e.,
S^(r + 1) = Top_{⌈|S^(r)|/2⌉}(S^(r); a ↦ LCB(a)),
which enforces computational focus while remaining conservative when uncertainty dominates. In regimes where proxies are informative, the safety rule typically drives most of the reduction, and the explicit halving is rarely activated.

The same mechanism can be viewed as a promotion algorithm: each architecture in S starts with a proxy score ĝ(z(a, b₀)), and only architectures with sufficiently low confidence bounds are ``promoted’’ to receive additional high-fidelity evaluations. Because HiFi(a) is noisy, the algorithm may query the same a multiple times; in practice this can be implemented either as repeated fine-tuning with different seeds or as repeated measurements of a stochastic evaluation pipeline. The confidence interval construction should reflect the effective sample size for repeatedly evaluated architectures (e.g., by incorporating empirical variance or by treating repeats as additional exchangeable calibration points).

At termination (after m high-fidelity queries), we output either (i) the empirical best among queried points, â ∈ arg min_{a ∈ {a₁, …, a_m}}y(a), or (ii) the proxy-calibrated minimizer over the surviving set,
â ∈ arg min_{a ∈ S_m}ĝ_m(z(a, b₀)),
together with {CI_m(a)}_{a ∈ S_m}. The latter is preferable when |S_m| is still large but the predictor is well calibrated; the former is robust when misspecification is severe and we prefer to trust only direct measurements.

When the objective aggregates performance across tasks t ∈ [T], we may define
$$ F(a)=\frac{1}{T}\sum_{t=1}^T F_t(a), \qquad \mathsf{HiFi}(a)=F(a)+\xi, $$
where F_t(a) is the full-fidelity metric on task t. Operationally, we can either (i) compute HiFi(a) by running all tasks and averaging, or (ii) query tasks sequentially and maintain an online estimate of F(a) with task-level uncertainty. In case (ii), we can incorporate task-level partial averages into the confidence bounds, e.g., defining y_a, s as the mean over the first s tasks and using a union bound over s to preserve correctness of elimination. For proxy features, we may either featurize each task-specific prefix and concatenate, or include task-invariant summaries (e.g., mean/variance across tasks) when T is large.

If we impose a hard device constraint (latency, memory, parameter count), let C(a) be a deterministic cost computed from the architecture descriptor, and restrict to the feasible set 𝒜_feas = {a ∈ 𝒜 : C(a) ≤ C_max} before running CASH-NAS. If the device metric is itself noisy (e.g., measured latency), we treat it as an additional high-fidelity output and maintain a calibrated interval CI^C(a) in parallel, eliminating architectures that are infeasible with high confidence. For soft tradeoffs, we may optimize a scalarized objective F_λ(a) = F(a) + λC(a) for a chosen λ, or maintain a small Pareto set by running CASH-NAS for several λ values using the same proxy table. These extensions preserve the central structure: low-fidelity features provide a global covariate map over 𝒜, while a small number of calibrated high-fidelity measurements controls both selection and safe elimination.

The subsequent section formalizes the statistical guarantees that connect calibrated prediction error to ranking consistency and ϵ-optimal selection, and quantifies the required number of high-fidelity evaluations in terms of (d, σ², ϵ, δ) and |𝒜|.

6. Theory I (Upper Bounds): Uniform prediction bound ⇒ rank consistency; ϵ-optimal selection with Õ((d + log |𝒜|)/ϵ²) high-fidelity evaluations; regret-style bound for anytime operation.

We now formalize the sense in which a small number of calibrated high-fidelity labels suffices to (i) recover a near-correct ranking over 𝒜 from proxy features, and (ii) select an ϵ-optimal architecture. We also record an anytime (promotion-style) guarantee for the confidence-bound variant that repeatedly queries and eliminates.

Throughout, we work under the standing assumptions of the enclosing scope: there exists g_* ∈ 𝒢 such that sup_{a ∈ 𝒜}|F(a) − g_*(z(a, b₀))| ≤ η, the oracle noise in HiFi(a) is conditionally σ²-sub-Gaussian, and pdim(𝒢) ≤ d. We write n := |𝒜|.

Let a₁, …, a_m be drawn i.i.d. uniformly from 𝒜, and let y_i = HiFi(a_i). Consider an empirical risk minimizer (or regularized ERM) ĝ ∈ 𝒢 fit on {(z(a_i, b₀), y_i)}_i = 1^m. Standard uniform convergence for real-valued hypothesis classes (via pseudo-dimension control) yields a uniform deviation bound of the form

with probability at least 1 − δ, for a universal constant c > 0. The decomposition γ_m = η + (estimation) is conceptually important: the low-fidelity budget b₀ and feature map z(⋅, b₀) control η (a misspecification term), while m controls only the estimation term. In particular, the high-fidelity sample complexity can be small only insofar as η is commensurately small.

We next show that a uniform additive error guarantee immediately implies that the predicted ranking is correct except possibly on pairs whose true gaps are small. Fix any ĝ satisfying sup_a|ĝ(z(a, b₀)) − F(a)| ≤ γ for some γ ≥ 0. Consider two architectures a, a^′ ∈ 𝒜. If the ordering induced by ĝ disagrees with that induced by F, then necessarily

Indeed, the disagreement implies
F(a) ≥ ĝ(z(a, b₀)) − γ > ĝ(z(a^′, b₀)) − γ ≥ F(a^′) − 2γ,
which rearranges to F(a) − F(a^′) > −2γ. Symmetrizing over which item is truly better yields |F(a) − F(a^′)| ≤ 2γ. Thus, large true gaps are under uniform prediction error.

A convenient aggregate summary is obtained by translating to a Kendall-τ disagreement bound. Let τ(ĝ, F) denote the fraction of pairs (a, a^′) whose ordering differs under ĝ and F (equivalently, normalized Kendall-τ distance). Define the gap-tail function
ρ(γ) := Pr_(a, a^′)[ |F(a) − F(a^′)| ≤ 2γ ],
where (a, a^′) is uniform over unordered pairs from 𝒜. Then implies the deterministic inequality

Combining with (taking γ = γ_m) yields a quantitative statement: as m increases, the estimation term shrinks as m^−1/2, and the fraction of misranked pairs is controlled by the prevalence of near-ties in the true objective landscape at scale 2γ_m. In particular, if the objective has few near-ties, the ranking induced by the calibrated predictor is accurate even when γ_m is not yet small enough for ϵ-optimal selection.

We now connect uniform prediction error to optimization. Let a^* ∈ arg min_{a ∈ 𝒜}F(a) be any optimal architecture. If â ∈ arg min_{a ∈ 𝒜}ĝ(z(a, b₀)), then whenever sup_a|ĝ(z(a, b₀)) − F(a)| ≤ γ,

Thus, any procedure that produces a predictor with a uniform error bound γ yields a 2γ-optimal architecture by simply minimizing ĝ over the proxy table.

To obtain an ϵ-optimality guarantee, it therefore suffices to enforce γ_m ≤ ϵ/2. Under the distortion condition η ≤ ϵ/4, we may choose m large enough so that the estimation term in is at most ϵ/4, i.e.,
$$ c\,\sigma\sqrt{\frac{d+\log(n/\delta)}{m}} \;\le\;\frac{\epsilon}{4}. $$
Equivalently, it suffices to take

up to universal constants (and up to polylogarithmic factors if one uses a more refined capacity measure or a more elaborate calibration mechanism). Plugging γ = ϵ/2 into yields F(â) ≤ F(a^*) + ϵ with probability at least 1 − δ. Notably, the dependence on n enters only through log n, reflecting that the high-fidelity labels are used primarily to learn a predictor over a finite covariate set rather than to exhaustively evaluate all architectures.

We finally record a bound for the confidence-bound (promotion/elimination) variant that adaptively concentrates high-fidelity queries on promising architectures. Let CI_t(a) = [LCB_t(a), UCB_t(a)] be calibrated intervals such that, with probability at least 1 − δ, we have simultaneousմամբ coverage F(a) ∈ CI_t(a) for all a that are used in elimination or selection decisions up to time t. Suppose we eliminate an architecture a once

Under the stated coverage event, the optimal a^* is never eliminated: if a^* were removed, then would imply
F(a^*) ≥ LCB_t(a^*) > min_a^′UCB_t(a^′) + ϵ/2 ≥ F(a^*) + ϵ/2,
a contradiction. Hence the candidate set always retains an ϵ-optimal element once intervals are sufficiently tight.

To quantify the number of queries devoted to suboptimal architectures, let Δ_a := F(a) − F(a^*) ≥ 0. In typical constructions (including sub-Gaussian concentration with union bounds, or conformal intervals with an appropriate correction), the half-width of CI_t(a) scales as
$\mathrm{width}_t(a)\asymp \sigma\sqrt{\log(n/\delta)/N_t(a)}$
where N_t(a) is the number of high-fidelity evaluations of a up to time t. Under such a width law, once N_t(a) is large enough that width_t(a) ≪ Δ_a, condition will trigger and remove a. Consequently, each suboptimal architecture is queried only on the order of

times before elimination, where Õ(⋅) hides logarithmic factors in n and 1/δ. This gap-dependent behavior implies an anytime simple-regret guarantee: after any budget t, the best surviving architecture (or the LCB-minimizer) is near-optimal whenever all architectures with Δ_a ≫ width_t(a) have been eliminated. In particular, after m queries, the output is ϵ-optimal with probability at least 1 − δ, and the total effort concentrates on the near-tied region around a^*, with far-from-optimal architectures incurring only O(1/Δ_a²) high-fidelity cost.

Taken together, – show that calibrated multi-fidelity NAS can attain ϵ-optimality using Õ((d + log n)σ²/ϵ²) high-fidelity evaluations, while and explain why, in practice, accurate ranking and aggressive pruning may occur substantially earlier than the worst-case ϵ-optimality threshold.

7. Theory II (Lower Bounds & Impossibility): (i) proxy-only NAS impossibility without linkage; (ii) Ω(1/ϵ²) high-fidelity lower bound even with informative proxies.

We complement the upper bounds of Section~ with two lower-bound statements. The first isolates a structural necessity: without an explicit linkage between the low-fidelity proxy information and the full-fidelity objective, no algorithm that relies only on low-fidelity data can offer nontrivial worst-case guarantees. The second shows that even under the strongest possible linkage (no distortion, i.e. η = 0), the number of high-fidelity evaluations must scale as Ω(1/ϵ²) in the target accuracy due to the irreducible noise in HiFi. Together, these results justify both components of a calibrated multi-fidelity approach: (i) an assumption such as bounded distortion to connect proxy features to the target metric, and (ii) a non-negligible number of high-fidelity evaluations to overcome stochastic uncertainty.

Consider any algorithm ℳ which, after observing the full low-fidelity transcript (e.g. {LoFi(a, b₀)}_{a ∈ 𝒜} and hence the entire proxy table {z(a, b₀)}_{a ∈ 𝒜}), outputs an architecture â ∈ 𝒜. If we impose no relationship between the mapping a ↦ z(a, b₀) and the full-fidelity values F(a), then the proxy transcript alone does not identify the minimizer of F even in principle. Formally, one may construct two full-fidelity objectives F₁ and F₂ that are both consistent with the same low-fidelity observations yet have disjoint minimizers, forcing ℳ to fail on at least one instance.

A particularly direct indistinguishability construction suffices. Fix any proxy table {z(a, b₀)}_{a ∈ 𝒜} (or even let all proxies be identical, z(a, b₀) ≡ z₀), and pick two distinct architectures a⁽¹⁾ ≠ a⁽²⁾. Define two objectives F₁, F₂ : 𝒜 → ℝ by
F₁ (a⁽¹⁾) = 0, F₁(a) = 1 ∀a ≠ a⁽¹⁾, F₂ (a⁽²⁾) = 0, F₂(a) = 1 ∀a ≠ a⁽²⁾.
By construction, the low-fidelity transcript (hence the observed proxies) is identical under F₁ and F₂, since we have not assumed that LoFi encodes any information about F. Therefore ℳ must output the same (possibly randomized) distribution over â under both objectives. In particular, letting ℙ_j denote probability under objective F_j and the algorithmic randomness of ℳ, we have
ℙ₁ [â = a⁽¹⁾] = ℙ₂ [â = a⁽¹⁾], ℙ₁ [â = a⁽²⁾] = ℙ₂ [â = a⁽²⁾].
Hence at least one of ℙ₁[â ≠ a⁽¹⁾] or ℙ₂[â ≠ a⁽²⁾] is at least 1/2. On that objective, the returned architecture suffers suboptimality at least 1. This yields a worst-case constant lower bound:
$$ \max\Bigl\{\mathbb{E}_{F_1}\bigl[F_1(\hat a)-\min_a F_1(a)\bigr],\; \mathbb{E}_{F_2}\bigl[F_2(\hat a)-\min_a F_2(a)\bigr]\Bigr\}\;\ge\;\frac{1}{2}. $$
The same argument extends to any desired constant gap c > 0 by scaling the non-optimal values. Crucially, this is not a limitation of any particular choice of proxy features; it is a limitation of the information pattern. If the proxy transcript is not assumed to constrain F (e.g. via bounded distortion or a related structural condition), then the minimizer of F is not identifiable from proxy data alone, and no nontrivial approximation guarantee is possible uniformly over all F.

This impossibility result should be read as a necessity statement for assumptions like (A1). In our setting, (A1) can be viewed as a minimal identifiability condition: it excludes adversarial instances in which low-fidelity observations are statistically independent of F. Without such an exclusion, the problem degenerates to a no-free-lunch regime.

We next show that high-fidelity evaluation is information-theoretically necessary at the scale 1/ϵ², even if proxies are perfectly informative in the sense that η = 0 and 𝒢 contains the true mapping F(a) = g_*(z(a, b₀)). The point is that, in the presence of σ²-sub-Gaussian noise in HiFi, identifying an ϵ-optimal architecture requires enough repetitions to statistically resolve gaps of size ϵ. This remains true regardless of how informative z(⋅, b₀) is, because the algorithm still must determine where the true optimum lies among near-tied candidates, and the only access to F is via noisy labels.

We formalize this by reduction to binary hypothesis testing (equivalently, the two-arm best-arm identification lower bound). Consider the special case |𝒜| = 2, with architectures a₁, a₂. Fix any proxy features z(a₁, b₀), z(a₂, b₀); we may even allow them to be distinct and known to the learner. Define two full-fidelity instances F⁽⁺⁾ and F⁽⁻⁾ such that the identity of the better architecture flips:
F⁽⁺⁾(a₁) = 0, F⁽⁺⁾(a₂) = ϵ, F⁽⁻⁾(a₁) = ϵ, F⁽⁻⁾(a₂) = 0.
Assume the high-fidelity oracle returns y = HiFi(a) = F^(±)(a) + ξ, where ξ is centered σ²-sub-Gaussian. Any algorithm that is ϵ-optimal with probability at least 1 − δ must, on instance F⁽⁺⁾, output a₁ with probability at least 1 − δ, and on instance F⁽⁻⁾, output a₂ with probability at least 1 − δ; otherwise the error exceeds ϵ.

Let the algorithm make a (possibly adaptive) total of m high-fidelity queries, producing a transcript T consisting of queried indices and observed noisy values. Denote by ℙ₊ and ℙ₋ the distributions of T under F⁽⁺⁾ and F⁽⁻⁾, respectively. Standard change-of-measure arguments (e.g. Le Cam, Bretagnolle–Huber inequalities, or bandit lower bounds of Mannor–Tsitsiklis type) imply that if a decision rule distinguishes two hypotheses with error probability at most δ, then the KL divergence satisfies

On the other hand, for sub-Gaussian (in particular Gaussian) noise models, each query to a contributes at most on the order of (F⁽⁺⁾(a) − F⁽⁻⁾(a))²/σ² to the KL divergence between the two induced observation laws. In our two-instance construction, the mean shift is ϵ for whichever arm is queried, and thus the per-sample KL is O(ϵ²/σ²). Summing over m queries yields the upper bound

for a universal constant C > 0 (with C = 1/2 in the Gaussian case). Combining and gives
$$ m \;\ge\; c'\,\frac{\sigma^2}{\epsilon^2}\,\log\!\left(\frac{1}{\delta}\right) $$
for a universal constant c^′ > 0. This establishes the claimed lower bound Ω(σ²log (1/δ)/ϵ²) on the number of high-fidelity evaluations required to guarantee ϵ-optimality with confidence 1 − δ, even in the most favorable proxy regime.

The foregoing lower bounds align with the upper bounds from Section~ in two complementary ways. First, they show that the bounded-distortion linkage (or an equivalent structural assumption) is not merely convenient but logically necessary to avoid proxy-only no-free-lunch behavior. Second, they show that the ϵ⁻² dependence in the high-fidelity sample complexity is not an artifact of our analysis: it is forced by the stochastic nature of HiFi and persists even when proxies are fully informative up to zero distortion. Thus, the upper bound m = Õ(σ²(d + log n)/ϵ²) can be interpreted as near-tight: improvements can only come from reducing the effective complexity term (e.g. via a smaller hypothesis class or additional structure) and from exploiting favorable gap structure (as in promotion/elimination analyses), but not from beating the fundamental 1/ϵ² scaling in the worst case.

8. Experimental Plan (Implementation-Strengthening): adapter search space, task suite, devices; calibration curves; ablations; baselines; reproducibility checklist.

We outline an experimental program whose purpose is twofold: (i) to instantiate the multi-fidelity protocol in a way that respects realistic adapter-training practice, and (ii) to stress-test the theoretical levers (distortion, calibration, and high-fidelity sample complexity) under controlled ablations. Throughout, we fix a frozen base model M₀ and a discrete architecture set 𝒜 of adapter/routing configurations, and we treat full-fidelity evaluation as a standardized fine-tuning recipe whose output metric is F(a) for each a ∈ 𝒜.

We construct 𝒜 as a Cartesian product of (a) , (b) , and (c) options. Placements include (i) attention-only adapters, (ii) MLP-only adapters, and (iii) adapters at both sublayers; we further allow layer subsets (e.g. every layer, every other layer, last k layers) to expose depth/compute tradeoffs. Parameterizations include LoRA-style low-rank updates with rank r ∈ {2, 4, 8, 16, 32}, bottleneck adapters with bottleneck width h ∈ {8, 16, 32, 64}, and optionally IA³-style multiplicative vectors. Routing choices include (i) a fixed shared adapter across tasks, (ii) task-conditional selection among K experts, and (iii) soft mixture-of-experts gates constrained to a small set of templates (to keep 𝒜 finite). We record for each a a descriptor vector (layer mask, rank/bottleneck, parameter count, and routing template) which becomes part of z(a, b₀). The resulting n = |𝒜| should be large enough to make exhaustive high-fidelity search unrealistic while still enabling a full low-fidelity sweep (e.g. n ∈ [10³, 10⁵], depending on device availability).

We evaluate on a heterogeneous task suite to ensure that proxy behavior is not an artifact of a single benchmark. Concretely, we include: (i) sentence-level classification (e.g. GLUE-style), (ii) token-level tasks (e.g. named entity recognition), (iii) generation (summarization and question answering), and (iv) instruction-following or preference-aligned generation if applicable to M₀. For each task we fix preprocessing, maximum sequence length, and a metric definition; F(a) is the validation loss (or negative metric) over tasks under a common adapter training recipe, to match the theory’s single scalar objective. We standardize training hyperparameters across a (learning rate schedule, batch size, optimizer, number of steps) except for any hyperparameters that are intrinsically coupled to the architecture (e.g. parameter count-dependent weight decay), which we either fix or explicitly include as a controlled factor in 𝒜. To prevent leakage of low-fidelity information into the final selection criterion, we compute F(a) only from full-fidelity runs and reserve a held-out evaluation split (or held-out tasks) for final reporting.

Because the algorithmic motivation is compute efficiency, we measure cost on multiple devices: a high-end datacenter GPU (e.g. A100/H100), a mid-range GPU, and at least one alternative accelerator (e.g. TPU or an inference-optimized GPU) if available. For each device we report (i) wall-clock time per low-fidelity query c_L(b₀), (ii) wall-clock time per high-fidelity query c_H, and (iii) optional energy consumption (via vendor power telemetry). When hardware-aware considerations are included, we also compute per-architecture latency/memory proxies (parameter count, activation checkpointing overhead) and, for a subset, measured end-to-end inference latency. This enables subsequent extension to Pareto selection, while keeping the primary objective F fixed for the present section.

We set a fixed low-fidelity budget b₀ (e.g. a small number of steps/epochs) used for all a. From LoFi(a, b₀) we compute learning-curve statistics: initial loss, slope over early steps, smoothed minimum, curvature, and (when applicable) gradient-norm summaries. We concatenate these with architecture descriptors to form z(a, b₀) ∈ ℝ^d. We emphasize that the proxy is not simply ``early stopping score’’; rather, it is a feature representation intended to support learning a predictor ĝ ∈ 𝒢 that is calibrated against HiFi. We predefine several feature sets of increasing richness to support ablations: descriptors-only, curve-only, and combined.

We implement several choices of 𝒢, ranging from linear models and kernel ridge regression to gradient-boosted trees and small neural regressors, each with controlled capacity (effective d). For calibration, we recommend a split-conformal or cross-conformal procedure on the growing high-fidelity dataset D, producing prediction intervals CI_t(a) and thus LCB_t(a), UCB_t(a). We evaluate calibration by (i) empirical coverage versus nominal coverage across queried points and (ii) coverage conditional on predicted score quantiles (reliability-style curves). We also report interval width as a function of m and examine whether the elimination rule would have removed any architecture later found to be near-optimal (a practical check of elimination safety). Since the theory is phrased in terms of uniform error, we additionally compute the maximum absolute error over a held-out set of architectures with full-fidelity labels (when budget permits) and relate it to observed Kendall-τ rank correlation.

We include ablations that each remove one ingredient of the multi-fidelity method.

We compare against baselines that cover common NAS practice under adapter tuning:
(i) random search with m full-fidelity trials; (ii) proxy-only selection using early validation loss at b₀ (no calibration); (iii) single-fidelity Bayesian optimization (Gaussian process or tree-structured Parzen estimators) using only high-fidelity evaluations; (iv) multi-fidelity bandit methods such as Successive Halving/ASHA/Hyperband using partial budgets, with a fairness constraint matching total compute; and (v) regression-based surrogates trained on the same m high-fidelity points but without conformal calibration or without low-fidelity features. We report both optimization performance (best F achieved) and resource usage (time/energy), emphasizing matched-budget comparisons.

We will release: (i) the explicit definition of 𝒜 (enumeration code and counts), (ii) the exact low- and high-fidelity training recipes (optimizer, schedule, batch size, gradient clipping, mixed precision, checkpointing), (iii) the featurization code for z(a, b₀) and the predictor training configuration, (iv) random seeds for architecture sampling and training, (v) dataset versions and preprocessing scripts, and (vi) device specifications and software environment (CUDA/cuDNN, framework versions). All results will be accompanied by confidence intervals over multiple runs (at least over the algorithm’s randomness and, when feasible, over training seeds), and we will log all queried architectures and their observed HiFi values to permit exact replay of the selection trajectory.

9. Discussion: What breaks outside adapter regime; implications for hardware-aware Pareto NAS; future directions (federated, multi-device, safety metrics).

We conclude by clarifying which aspects of the analysis are specific to the adapter regime, what failure modes we expect when those conditions are violated, and how the same multi-fidelity viewpoint can be extended to hardware-aware and safety-aware architecture selection.

The central practical advantage of restricting to adapter/routing architectures on a frozen M₀ is that the search space 𝒜 can be made large while keeping both fidelities well-defined and reasonably comparable across a ∈ 𝒜. In particular, the low-fidelity oracle LoFi(a, b₀) is meaningful only insofar as the early learning dynamics share a stable relationship with the full-fidelity outcome F(a). Freezing M₀ and varying only lightweight modules tends to reduce catastrophic non-stationarities in optimization (e.g. divergent runs, sharp regime changes in loss curves), and it keeps the per-step compute cost closer across architectures. Both effects directly support the bounded-distortion premise (A1), namely that there exists a single g_* ∈ 𝒢 such that F(a) ≈ g_*(z(a, b₀)) uniformly over a. When we move outside this regime, the primary risk is not simply that prediction becomes harder, but that the mismatch parameter η ceases to be small in any hypothesis class with controllable complexity.

Our guarantees are additive and explicitly pay the price η, so the statement ``what breaks’’ is concrete: if η is on the order of the target ϵ (or larger), the high-fidelity sample complexity required to compensate through estimation shrinks to irrelevance, because even perfect estimation within 𝒢 cannot identify an ϵ-optimal architecture. There are several mechanisms by which η can increase outside adapter-style search:
(i) if we unfreeze substantial parts of M₀, early training curves can be dominated by transient effects (e.g. rapid representation drift) that are weakly predictive of the final fine-tuned state;
(ii) if architectures change optimization conditioning (e.g. adding deep non-residual modules, changing normalization, or altering effective depth), then the same budget b₀ may correspond to qualitatively different training phases across a, breaking the comparability of LoFi(a, b₀);
(iii) if we allow optimizer or schedule choices to vary across a in ways not represented in z(a, b₀), then two architectures may share similar prefixes for incidental reasons while diverging later. In all cases, a pragmatic symptom is that the empirical Kendall-τ saturates well below what would be implied by the observed proxy noise alone, indicating structural ranking errors consistent with a large η.

The two-fidelity abstraction implicitly assumes c_H ≫ c_L(b₀) and that we can afford O(|𝒜|) low-fidelity runs. This can fail in regimes where (a) the model is so large that even a short prefix has high wall-clock and memory cost, (b) the search includes architectures whose per-step cost varies widely (e.g. heavy routing or large experts), or (c) the system is I/O-bound so that launching many partial runs is itself expensive. When these deviations occur, the appropriate modification is to treat low-fidelity queries as budgeted and selective rather than exhaustive. Conceptually, we then have a third optimization layer: we must decide which architectures to (acquire z(a, b₀)) before deciding which to query at high fidelity. This suggests a triage protocol in which cheap descriptor-only features are used to choose a subset for partial training, and only then do we proceed to calibrated high-fidelity querying. The theoretical counterpart would replace uniform low-fidelity access by an adaptive design that controls the induced selection bias in ĝ, which is nontrivial because both the training set and the evaluation set become adaptively filtered.

Although we have treated F as a scalar objective, adapter architectures are often deployed under explicit constraints on latency, memory, or energy. A natural extension is to define a vector-valued objective
J(a) = (F(a), L(a), M(a), E(a)),
where L, M, E denote device-specific inference latency, peak memory, and energy (or proxy measurements), and to seek a Pareto-optimal set rather than a single minimizer. The multi-fidelity structure remains useful, but the selection logic must change: instead of eliminating architectures whose lower confidence bound is worse than the current best upper confidence bound on a scalar, we eliminate architectures that are with high confidence. Concretely, if we can construct calibrated sets CI_t^F(a), CI_t^L(a), etc., then we may declare a safely dominated by a^′ whenever
sup CI_t^F(a^′) ≤ inf CI_t^F(a), sup CI_t^L(a^′) ≤ inf CI_t^L(a), etc.
with at least one strict inequality, and remove a from the candidate set. We emphasize that this is more delicate than scalarization: weighted sums can miss non-convex Pareto fronts and can behave poorly when objectives have different noise scales. From an experimental standpoint, this motivates measuring not only F(a) but also device-dependent metrics for a small subset and learning predictors for each component using shared features z(a, b₀) augmented with hardware descriptors.

A further complication is that both accuracy and cost can be device- and batch-dependent. For example, routing decisions may be cheap on one accelerator and expensive on another due to kernel support, and quantization-aware or compilation-aware effects can reorder architectures by latency even when parameter counts are similar. If the deployment environment is uncertain (or heterogeneous), the appropriate objective is closer to a distributional criterion, such as minimizing expected F(a) subject to high-probability latency constraints across a device distribution. Methodologically, this suggests conditioning the predictor on a device descriptor u and learning g(z, u) ≈ F (and similarly for L, M, E), or learning a family of predictors indexed by device. The bounded-distortion requirement then becomes over both a and u, which is plausible only if u captures the relevant hardware variation. Practically, this motivates reporting transfer: fit ĝ on one device, then evaluate calibration and ranking error on another to quantify how far we are from such uniformity.

Many realistic adapter-tuning scenarios are federated: low-fidelity training (or even high-fidelity training) occurs on private data silos, and only limited statistics can be shared. In this setting, LoFi(a, b₀) may be available at many clients, but HiFi(a) may be rare or centralized. The algorithmic question becomes: can we aggregate proxy features z(a, b₀) and a small number of high-fidelity labels without leaking private information? One direction is to design z to be privacy-preserving by construction (e.g. sharing only coarse learning-curve summaries, clipped and noised) and then apply the same calibration-and-selection logic. Another is to treat each client as a separate task distribution and extend the objective to a robust criterion (e.g. worst-client performance), in which case we would require a proxy-to-true linkage that holds uniformly across clients. This regime makes explicit the need for uncertainty quantification: even if mean performance is predictable, tail behavior across clients may not be.

Finally, we note that practical model selection increasingly includes safety-related criteria (toxicity, bias metrics, jailbreak robustness, hallucination rate) that may be expensive to evaluate and may not correlate with early training curves. In our framework, these can be incorporated either as additional components of a vector-valued objective (leading again to Pareto selection) or as constraints (e.g. minimize F(a) subject to safety score ≤ τ). The main conceptual warning is that bounded distortion may hold for F but fail for safety metrics, even within adapters: two architectures with indistinguishable learning-curve prefixes can differ materially in generation behavior. Thus, a realistic protocol may require separate high-fidelity evaluation channels for safety, with their own noise levels and sample-complexity demands, and may need to decouple the predictor classes for performance and safety rather than forcing a single z-based model to explain both.

On the theory side, the most pressing extension is to relax the static bounded-distortion condition into something that can be and . In practice we can estimate a data-dependent distortion proxy (e.g. cross-validated residuals of ĝ on held-out high-fidelity points) and use it to decide whether the multi-fidelity approach is justified, or whether one should revert to more conservative high-fidelity exploration. A second direction is to analyze adaptively sampled calibration sets (our pseudocode uses LCB-based selection after warm-start), where uniform convergence arguments must account for the dependence between {a_t} and the observed noise. A third direction is to develop guarantees for structured search spaces (combinatorial 𝒜) where one cannot afford even O(|𝒜|) low-fidelity sweeps; here, one would want to combine proxy learning with combinatorial bandits or structured Bayesian optimization, while retaining explicit distortion parameters. These extensions would bring the formal model closer to the full range of NAS practice without sacrificing the clarity of the multi-fidelity separation: proxies can guide search only to the extent that their structural linkage to F is stable and calibratable.