Foundation models have shifted the practical burden of adaptation
from training a task model from scratch to choosing an around a fixed
pretrained backbone. In 2026-scale deployments, we typically freeze a
large base model M0
and introduce parameter-efficient modules (e.g., adapters, low-rank
updates, sparse experts, routing policies, or mixtures thereof) whose
placement, width/rank, sharing pattern, and routing logic define a
discrete design a ∈ 𝒜. The
resulting search space is finite but combinatorially large, and the
full-fidelity evaluation F(a)—the outcome of a
standardized fine-tuning recipe measured on a held-out validation
metric—is expensive. Consequently, adapterized neural architecture
search (NAS) is now less a matter of
can we optimize the weights?'' and more a matter ofcan we
locate a near-optimal a while
paying for only a small number of expensive fine-tunes?’’
We formalize this setting as a two-fidelity optimization problem. A high-fidelity oracle HiFi(a) returns a noisy observation of F(a), and each high-fidelity query corresponds to an end-to-end fine-tuning run at the intended training budget. In contrast, a low-fidelity oracle LoFi(a, b) returns partial-training information at a small budget b, such as an initial learning-curve prefix, early-layer statistics, or optimization traces. From this we compute a feature vector z(a, b) ∈ ℝd that combines architecture descriptors (e.g., adapter placement masks, ranks, parameter counts, routing sparsity) with curve-based statistics (e.g., early loss slope, curvature, gradient norms, or stability indicators). The operative hope is that z(a, b0) at a fixed cheap budget b0 is predictive enough to focus high-fidelity evaluations on a small subset of promising designs.
The central obstacle is . Early learning curves, partial optimization traces, and architectural descriptors are not, in general, sufficient to recover the ordering induced by F. Mismatch arises from phenomena that are invisible at small b0: late-phase generalization, optimizer schedule interactions, task mixture effects, and representation drift. In adapterized regimes the mismatch is particularly acute, because (i) different placements can yield similar early loss behavior but different asymptotic generalization, (ii) routing or sparsity mechanisms often exhibit delayed specialization, and (iii) small changes in parameterization can alter stability and regularization in ways that manifest only late in training. For these reasons, a purely proxy-driven NAS procedure is unreliable in the worst case: without additional structure linking proxy observations to F, there is no principled way to guarantee near-optimal selection.
Our approach is to separate what can be from what must be . We assume only a bounded-distortion linkage between the true objective and an unknown predictor of low-fidelity features: there exists g* ∈ 𝒢 such that |F(a) − g*(z(a, b0))| ≤ η for all a ∈ 𝒜. The parameter η quantifies the irreducible proxy mismatch at budget b0 within the chosen hypothesis class 𝒢. This assumption does not posit that the proxy is correct, nor that it preserves rank exactly; it merely asserts that after appropriate calibration, the proxy-induced representation admits a uniformly accurate predictor up to additive distortion η. This is the weakest form of structural linkage that still permits nontrivial worst-case guarantees, and it is precisely the quantity one expects to improve when the proxy features are strengthened (e.g., via richer curve summaries or better architecture encodings) or when the budget b0 is increased.
Given this linkage, we view high-fidelity evaluations as a scarce supervision resource used to calibrate a predictor on top of low-fidelity features. The algorithmic template is simple: we perform a low-fidelity sweep to compute z(a, b0) for many (often all) a ∈ 𝒜, then spend a limited budget of m high-fidelity queries to train and calibrate a predictor ĝ mapping features to objective values. We finally output the architecture minimizing ĝ (or, in an elimination variant, repeatedly refine a candidate set using calibrated confidence intervals). The methodological point is that the low-fidelity phase is not used to directly; it is used to build a representation in which a small number of high-fidelity labels suffices to generalize uniformly across 𝒜.
Our contributions are organized around guarantees that make this intuition precise.
We fix a finite architecture set 𝒜 around a frozen M0, define F(a) as the full-fidelity objective, and specify two oracles LoFi and HiFi with an explicit cost separation. This abstraction captures modern adapter search pipelines in which partial fine-tunes can be parallelized widely, while full fine-tunes dominate wall-clock and energy cost. The key modeling choice is to treat LoFi(a, b0) as producing rather than a noisy scalar objective. This reflects the empirical reality that early trajectories contain heterogeneous signals (stability, curvature, speed, implicit regularization) that are more informative jointly than any single proxy metric.
Under bounded distortion and standard capacity control for 𝒢, we show that m high-fidelity samples suffice to
obtain a uniform prediction error bound over all architectures. The
uniformity is essential: because the final decision is an arg min over 𝒜, pointwise accuracy is inadequate. The
resulting sample complexity scales as
$$
m=\tilde O\!\left(\sigma^2\frac{d+\log|\mathcal{A}|}{\epsilon^2}\right),
$$
matching the familiar 1/ϵ2 dependence of
stochastic best-arm identification and exhibiting the expected
logarithmic dependence on the size of the search space.
Uniform error immediately yields rank consistency statements: predicted orderings differ from true orderings only on pairs whose true objective gap is small relative to the uniform error. This leads to explicit Kendall-τ disagreement bounds in terms of the empirical gap distribution, clarifying when a calibrated predictor should be expected to produce a high-quality shortlist even if exact identification of the optimum is infeasible.
Beyond one-shot selection, we analyze a promotion/elimination variant in which we maintain a surviving set S ⊆ 𝒜 and discard architectures whose lower confidence bound is provably worse than the best upper confidence bound by a margin. With calibrated confidence intervals, the optimal architecture is never eliminated, and the allocation of high-fidelity queries adapts to gaps Δa = F(a) − F(a*), yielding instance-dependent complexity of the form Õ(σ2/Δa2) per suboptimal architecture.
We also formalize two negative results that delimit what is possible. First, without any constraint linking low-fidelity observations to F, no proxy-only procedure can guarantee nontrivial approximation uniformly over all objectives: indistinguishability constructions show that the same proxies can correspond to objectives with disjoint minimizers. Second, even when proxies are perfectly informative up to calibration (η = 0), any algorithm that returns an ϵ-optimal architecture with probability 1 − δ must perform Ω(σ2log (1/δ)/ϵ2) high-fidelity queries in the worst case. Thus the dependence on σ2, ϵ, and δ in our guarantees is essentially unavoidable.
In sum, we treat adapterized NAS as a calibrated, multi-fidelity selection problem: low-fidelity runs produce a feature space, high-fidelity runs provide scarce labels, and the correctness criterion is uniform enough to support an arg min decision with a prescribed (ϵ, δ) guarantee. The resulting perspective explains both why naive proxy ranking can fail (large η or insufficient calibration) and why a modest number of full fine-tunes can nevertheless suffice when the proxy representation is well-chosen (small η and controlled complexity d).
Neural architecture search (NAS) originated in settings where the architecture controls the entire model and training from scratch is the dominant cost. In our setting the backbone M0 is fixed and frozen, and the search object is an (placement, rank/width, sharing, and routing of parameter-efficient modules). This shift changes both the empirical bottlenecks and the appropriate abstraction: full-fidelity evaluation is still expensive, but low-fidelity information is naturally available in the form of partial fine-tuning trajectories and optimization traces that are specific to the adapterized regime.
Early NAS methods used reinforcement learning (RL) controllers to generate architectures, receiving validation performance as a reward signal (e.g., ). These approaches are conceptually compatible with discrete design spaces, but in practice they require large numbers of expensive evaluations and rely on weight sharing or other amortization strategies to be computationally feasible. In adapter search, an RL controller can indeed propose module placements and ranks; however, without a careful treatment of multi-fidelity information, the evaluation cost remains dominated by full fine-tunes, and the controller must still cope with noisy rewards and limited samples. From a theoretical standpoint, RL-based NAS analyses typically inherit regret guarantees for policy optimization under assumptions that do not reflect proxy-based calibration: the reward is assumed to be observed (perhaps noisily) for each sampled architecture, rather than inferred from a feature-rich low-fidelity trajectory prefix.
Evolutionary algorithms (EA) and related population-based strategies are another classical pillar of NAS (e.g., ). Their strength is robustness to nonconvexity and discrete decisions, and their weakness is again evaluation cost. Various works introduce cheap mutations, surrogate models, or partial training to reduce cost. In adapterized regimes, evolutionary search is natural because placements and routing rules admit structured mutations, and partial fine-tuning can serve as a screening signal. Nevertheless, standard EA treatments view low-fidelity information primarily as a proxy (early stopping, truncated training) rather than as an object from which one learns a calibrated predictor with explicit uniform guarantees. Moreover, most EA analyses are asymptotic or rely on simplifying assumptions (e.g., unimodal landscapes or drift conditions) that do not directly yield finite-sample (ϵ, δ) statements for the arg min decision over a large finite set.
Differentiable architecture search (DARTS) and its successors relax discrete choices into continuous parameters and optimize them jointly with weights via gradient descent (e.g., ). Numerous variants address instability, bias toward parameter-free operations, and fairness of comparison (e.g., DARTS+ and FairDARTS-like adjustments; see surveys such as for categorization). These methods are most effective when the search space is expressed as a cell or a supernet with shared parameters. In adapter NAS, one can also define a supernet over possible adapter placements/ranks and relax selection variables. However, two issues arise. First, weight sharing can be less reliable when the base model is frozen and the adaptation modules are comparatively small: the supernet coupling may distort relative performance, and the mismatch between relaxed training dynamics and final discrete instantiation can be large. Second, the most common theoretical analyses of continuous-relaxation NAS focus on optimization properties of the relaxed problem (e.g., convergence to stationary points) rather than on statistical guarantees that connect low-fidelity signals (partial fine-tunes, early trajectories) to full-fidelity performance across a finite architecture set. Our formulation explicitly isolates the statistical calibration step needed to turn low-fidelity features into a predictor whose error can be controlled uniformly over 𝒜.
One-shot NAS trains a single over-parameterized supernet and evaluates sub-architectures by inheriting weights (e.g., ). This strategy is a form of multi-fidelity evaluation: the inherited-weight score is a cheap estimator of the fully trained score. The primary difficulty is estimator bias and rank inconsistency, which is known to be severe in some regimes. The adapterized setting exacerbates this: because the backbone is fixed, the adaptation modules must negotiate with a representation that they did not co-train, and supernet coupling effects can change the effective regularization on the adapters. Existing theory for one-shot NAS, where present, typically addresses properties of weight sharing under restrictive assumptions (e.g., linearization, neural tangent approximations, or simplified losses) and does not provide an (ϵ, δ) guarantee for selecting a discrete adapter configuration using a small number of full fine-tunes. Our approach avoids assuming that the low-fidelity estimator is unbiased or even scalar; instead we treat low-fidelity outputs as features and allow an explicit distortion term capturing irreducible mismatch.
A substantial body of work studies predictors for architecture performance and learning-curve extrapolation (e.g., ). These methods often combine architecture encodings with partial training signals to predict final accuracy, and many are explicitly multi-fidelity, using early stopping or bandit-style budget allocation (e.g., Hyperband and its variants ). Our setting is closest in spirit to these predictor-based and multi-fidelity hyperparameter optimization (HPO) lines, with two important distinctions. First, the proxy z(a, b0) in adapter NAS is typically richer than the scalar curves used in classical curve extrapolation: it naturally mixes architectural descriptors (placement masks, ranks, routing sparsity) with trajectory-derived statistics (loss slope, stability, gradient norms), which are not well-modeled as noisy samples of a single monotone learning curve. Second, our target is not merely good average prediction, but a control sufficient to justify an arg min decision over a finite, potentially large 𝒜. Many predictor works evaluate rank correlation empirically, but do not provide the kind of worst-case selection guarantee that is needed when a small error on a small subset of architectures can change the identity of the minimizer.
Random search remains a strong baseline in NAS and HPO, particularly in high-dimensional discrete spaces where sophisticated methods can be brittle (cf. in HPO and subsequent NAS observations). In adapterized NAS, random search coupled with a small number of full fine-tunes is often competitive because the cost of each evaluation dominates and because proxy signals can be misleading. Our guarantees clarify when random sampling is sufficient and when it is not: if no structural linkage exists between low-fidelity observations and full-fidelity performance, then no proxy-based method can improve worst-case guarantees over naive selection, and one must fall back to allocating high-fidelity evaluations in a manner consistent with best-arm identification lower bounds. Conversely, when linkage holds in a quantifiable way, a calibrated predictor can reduce the number of required full evaluations.
Hardware-aware NAS incorporates latency, memory, energy, or throughput constraints, often via multi-objective optimization or constrained search (e.g., ). Adapter architectures are frequently deployed under tight serving constraints (e.g., parameter budgets for on-device adaptation or routing-induced latency). While our primary formulation is single-objective (minimize the full-fidelity metric), the same multi-fidelity calibration perspective can be extended to constrained selection by augmenting features with cost proxies and applying calibration to both quality and constraint predictors. Existing theory for hardware-aware NAS typically treats cost models as known or cheaply measured, whereas in adapter routing settings latency can be data-dependent and thus itself noisy; this reinforces the need for explicit confidence control when eliminating architectures.
Theoretical treatments of NAS and multi-fidelity optimization tend to fall into one of three categories: (i) bandit-style analyses where each architecture/arm yields a noisy scalar reward upon evaluation; (ii) Bayesian optimization and surrogate modeling under smoothness or kernel assumptions on a continuous domain; or (iii) optimization analyses of continuous relaxations (e.g., DARTS). Our setting departs from (i) because the low-fidelity oracle does not yield a scalar reward comparable across architectures, but rather a feature-rich trajectory prefix from which one must learn a predictor. It departs from (ii) because the domain is a large finite set with structured, heterogeneous encodings, and the meaningful assumption is not global smoothness in a metric space but a bounded-distortion relationship between full-fidelity performance and a hypothesis class over proxy features. It departs from (iii) because we do not assume that the relaxed optimization problem faithfully captures the discrete decision, nor that weight sharing yields unbiased estimates. Consequently, existing theory does not directly yield the (ϵ, δ) selection guarantee we seek for adapter NAS under a small number of full fine-tunes.
These gaps motivate the problem setup in the next section: we formalize a two-fidelity model in which low-fidelity queries produce proxy features from partial training trajectories, and high-fidelity queries provide scarce noisy labels used for calibration and selection over a finite architecture set.
We formalize adapter architecture search for a fixed pretrained foundation model M0 as a budgeted optimization problem over a finite design set. Let 𝒜 denote a finite collection of , each specifying (for example) the placement of parameter-efficient modules (e.g., adapters, LoRA blocks), their ranks or widths, possible parameter sharing patterns, and (when applicable) routing or sparsity rules. Throughout, M0 is held fixed and frozen; the only trainable parameters under architecture a ∈ 𝒜 are those introduced by the adaptation modules prescribed by a.
For each a ∈ 𝒜, we define F(a) ∈ ℝ as the performance metric of interest. Concretely, F(a) may be the mean validation loss (lower is better) over a fixed task suite after applying a standardized fine-tuning recipe to the adapter parameters under architecture a. The standardization is part of the problem definition: it includes the training data, preprocessing, optimizer, schedule, batch size, stopping rule, and evaluation protocol. This is necessary so that F(a) is well-defined as a property of a, rather than an artifact of an adaptive training procedure.
We assume that direct access to F(a) is expensive. Instead,
we have a high-fidelity oracle HiFi
which returns a noisy observation
HiFi(a) = F(a) + ξ,
where ξ is conditionally
mean-zero and σ2-sub-Gaussian. This
noise model abstracts randomness from optimization (minibatching,
initialization of adapter weights), data order, and evaluation variance
across tasks. The σ2-sub-Gaussian
assumption is used to express finite-sample concentration for estimators
built from a limited number of full evaluations.
In addition to expensive high-fidelity evaluation, we assume access
to a cheap low-fidelity oracle LoFi(a, b) parameterized by
a b, which returns partial
information from a truncated fine-tuning run of architecture a up to budget b. The budget b can represent steps, epochs,
tokens, or wall-clock time. We emphasize that LoFi(a, b) is not assumed
to be a scalar proxy comparable across architectures; rather, it returns
a structured (and possibly additional traces). A canonical example is a
sequence
LoFi(a, b) = (ℓa, 1, …, ℓa, b)
of training or validation losses, but in practice it may also include
gradient norms, learning-rate values, moments of activation statistics,
per-task losses, routing entropy, sparsity patterns, or adapter-specific
diagnostics. The only requirement at this stage is that the low-fidelity
output can be deterministically transformed into a feature vector.
We fix a reference low-fidelity budget b0 intended to be small
relative to a full fine-tune, and for each a ∈ 𝒜 we compute a proxy feature
vector
z(a, b0) ∈ ℝd
by applying a featurization map z(⋅, ⋅) to (i) an encoding of the
architecture a itself and (ii)
the partial trajectory information returned by LoFi(a, b0). We
treat d as the resulting
feature dimension (or, more generally, as a complexity parameter
controlling the capacity of the predictor class used downstream).
The role of z(a, b0)
is to capture information that is plausibly predictive of F(a) while being cheap to
obtain. Since adapter architectures naturally factor into discrete
choices (placement masks, rank allocations, routing graphs) and early
optimization signals (loss slope, stability, curvature proxies), we
allow z(a, b0)
to concatenate heterogeneous components:
$$
z(a,b_0) \;=\;
\mathrm{concat}\!\Bigl(\underbrace{\phi_{\mathrm{arch}}(a)}_{\text{static
descriptors}},\
\underbrace{\phi_{\mathrm{curve}}(\mathsf{LoFi}(a,b_0))}_{\text{trajectory
statistics}}\Bigr).
$$
Here ϕarch(a) can
include counts of trainable parameters, per-layer adapter placements,
rank vectors, routing sparsity levels, and similar structural
descriptors; ϕcurve
can include the initial loss, the average slope over windows, measures
of oscillation or stability, gap between training and validation losses
at early time, or cross-task dispersion in multi-task fine-tuning. We do
not assume monotonicity, smoothness, or unbiasedness of these features;
they are merely inputs for subsequent calibration from a small number of
high-fidelity labels.
We explicitly model the compute disparity between fidelities. A
low-fidelity query at budget b
incurs cost cL(b),
while a high-fidelity query incurs cost cH, with cH ≫ cL(b0).
The algorithm is permitted to issue many low-fidelity
queries—potentially for all a ∈ 𝒜 at budget b0—since these can often
be parallelized across machines and run with small adapter parameter
counts. By contrast, the number of high-fidelity queries is sharply
limited. We therefore express the primary resource constraint either
as
#{high-fidelity
queries} ≤ m,
or, in a unified form, as a total compute budget B with
∑(a, b) queriedcL(b) + ∑a queriedcH ≤ B.
In the remainder of the development we focus on the regime where
low-fidelity information is abundant at a fixed budget b0, and the dominant
constraint is the number m of
full fine-tunes.
The core objective is to return an architecture â ∈ 𝒜 whose full-fidelity
performance is near-optimal. Given target accuracy ϵ > 0 and failure probability
δ ∈ (0, 1), we seek an
algorithm which, using at most m high-fidelity evaluations and
low-fidelity evaluations at cost dominated by n := |𝒜| queries at budget b0, satisfies
F(â) ≤ mina ∈ 𝒜F(a) + ϵ with
probability at least 1 − δ.
We also allow the algorithm to output a ranked shortlist S ⊆ 𝒜 (e.g., the top k candidates under a predictor),
optionally equipped with confidence intervals derived from the
high-fidelity calibration set.
In many deployments, adapter architectures must meet serving
constraints, such as parameter count, latency, memory footprint, or
routing-induced compute. To capture this, we may associate to each a ∈ 𝒜 a constraint metric C(a) (or a vector of
metrics) and impose a threshold κ. The constrained optimization
problem is then
mina ∈ 𝒜F(a) subject
to C(a) ≤ κ.
The constraint itself may be (i) known deterministically from a (e.g., parameter count), (ii)
measurable cheaply (e.g., static FLOPs), or (iii) noisy and
data-dependent (e.g., latency under dynamic routing). In the latter
cases one can define an additional oracle and proxy features for
constraint prediction analogous to the quality objective. For the
problem setup, it suffices to note that our algorithmic interface can
accommodate either hard filtering (remove architectures violating
deterministic constraints) or probabilistic feasibility (maintain
confidence bounds for C(a) in addition to those
for F(a)).
We emphasize the intended access pattern. The low-fidelity stage may query LoFi(a, b0) for many or all a ∈ 𝒜, producing a table {z(a, b0)}a ∈ 𝒜. The high-fidelity stage then adaptively selects a small subset of architectures for HiFi evaluation in order to (i) calibrate a mapping from proxy features to full-fidelity performance, and (ii) identify an approximately optimal architecture with controlled failure probability. This separation reflects the empirical reality that early partial-training traces are cheap and parallelizable, whereas full evaluations are scarce and must be allocated strategically.
The remainder of the paper will specify how we turn the proxy table {z(a, b0)} and a limited number of noisy labels {HiFi(ai)}i = 1m into a calibrated predictor and associated uncertainty estimates suitable for selection and safe elimination over the finite set 𝒜.
This section specifies the statistical object that mediates between low-fidelity observations and the full-fidelity objective: a calibrated predictor of F(a) from proxy features z(a, b0), together with uncertainty estimates suitable for selection and (in the next section) safe elimination. We emphasize that the predictor is not assumed to be ``correct’’ in a realizable sense; rather, we isolate an explicit approximation error η capturing intrinsic mismatch between proxy information and the true objective.
The proxy vector z(a, b0) ∈ ℝd may be viewed as a summary of (i) a discrete architecture description and (ii) an early learning signal. While the development does not require a particular featurization, the bounded-distortion assumption (A1) is only plausible when z exposes coarse invariants that persist from early training to completion.
A generic and practically stable approach is to include of the
partial learning curve that are motivated by empirical scaling laws.
Suppose LoFi(a, b0)
contains a scalar validation loss sequence (ℓa, t)t = 1b0.
We may fit a small parametric family on the prefix, e.g.,
ℓa, t ≈ αa + βat−γa or ℓa, t ≈ αa + βaexp (−γat),
via least squares or robust regression on t ∈ {tmin, …, b0}.
The fitted parameters (αa, βa, γa)
can be used as components of z(a, b0),
along with goodness-of-fit diagnostics (e.g., residual variance on the
prefix) that act as a proxy for optimization stability. When LoFi includes multiple traces (training loss,
validation loss, gradient norms, or per-task losses), we fit such
summaries per trace and concatenate. In addition, we include
nonparametric local statistics that tend to be robust under short
prefixes:
$$
\text{level: }\ell_{a,1},\ \ell_{a,b_0};\qquad
\text{slope:
}\frac{1}{w}\sum_{t=b_0-w+1}^{b_0}(\ell_{a,t}-\ell_{a,t-1});\qquad
\text{curvature proxy: }\ell_{a,b_0}-2\ell_{a,b_0-w}+\ell_{a,b_0-2w}.
$$
These are intended to capture whether a run has ``entered’’ a productive
regime early, without asserting monotonicity or linearity across
architectures.
The architecture descriptor component ϕarch(a) should include at least those aspects known to strongly influence full-fidelity outcomes: number of trainable parameters, per-layer adapter placements, ranks/widths, any sharing constraints, and (for routing) expected active fraction or entropy measures under random or early-trained gates. When the goal includes device or serving constraints, it is useful (though not required for the quality predictor) to include static compute and memory features such as FLOPs and activation footprint. We treat all such quantities as deterministic functions of a, hence valid low-fidelity inputs.
Given proxy vectors z(a, b0), we posit a hypothesis class 𝒢 ⊆ {g : ℝd → ℝ} from which we fit a predictor ĝ. The analysis downstream depends on a capacity parameter, taken here as pdim(𝒢) ≤ d, and on the existence of some g* ∈ 𝒢 achieving bounded distortion. Concretely, 𝒢 may be instantiated as:We fit ĝ by empirical risk
minimization (ERM) or regularized ERM on the observed high-fidelity
labels. Writing yi = HiFi(ai)
and zi = z(ai, b0),
we solve
$$
\hat g \in \arg\min_{g\in\mathcal{G}} \frac{1}{m}\sum_{i=1}^m
(g(z_i)-y_i)^2 + \lambda \Omega(g),
$$
where Ω is a complexity
penalty (e.g., ∥w∥22 in the
linear case) and λ is tuned on
a hold-out subset of the m
high-fidelity samples or via nested cross-validation within D. The role of regularization is not
only predictive accuracy but also stability of uncertainty estimates
used by the selection logic.
The core structural assumption is that proxy features admit a
predictor within 𝒢 that is uniformly
close to the truth:
∃g* ∈ 𝒢 s.t. supa ∈ 𝒜|F(a) − g*(z(a, b0))| ≤ η.
The parameter η measures :
even with infinite high-fidelity labels, the best proxy-based predictor
can deviate from F by at most
η. This explicitly separates
what can be improved by more HiFi
evaluations (estimation error) from what cannot (misspecification and
proxy insufficiency). The regime of interest for ϵ-optimal selection is η ≪ ϵ, and in particular
the later guarantee will require η ≤ ϵ/4.
Under sub-Gaussian noise and capacity control, ERM yields a uniform
estimation bound over the finite set 𝒜.
Informally, if a1, …, am
are drawn uniformly from 𝒜, then with
probability at least 1 − δ,
$$
\sup_{a\in\mathcal{A}} |\hat g(z(a,b_0))-F(a)|
\ \le\
\eta + C\,\sigma\sqrt{\frac{d+\log(n/\delta)}{m}},
$$
for a universal constant C
depending on the choice of loss and boundedness conditions. The key
point is that the error decomposes additively into η and an estimation term scaling as
$1/\sqrt{m}$. This is the quantitative
reason that a small number of high-fidelity labels can ``calibrate’’ a
proxy table computed for all a ∈ 𝒜.
For the adaptive algorithm, we require not only a point predictor but
also calibrated uncertainty. We therefore construct a confidence
interval CI(a) around ĝ(z(a, b0))
using conformal prediction on a held-out calibration subset. Concretely,
split the m labeled pairs into
a proper training set Dtrain and a calibration
set Dcal. Fit ĝ on Dtrain, compute
residuals
rj = |yj − ĝ(zj)|, (zj, yj) ∈ Dcal,
and let q̂ be the (1 − α)-quantile of {rj} with the
standard finite-sample correction. Then define
CI(a) = [ĝ(z(a, b0)) − q̂, ĝ(z(a, b0)) + q̂].
By exchangeability of the labeled examples, this yields marginal
coverage
Pr (F(a) ∈ CI(a)) ≥ 1 − α
for a new architecture a drawn
from the same sampling scheme as the calibration points. In our setting
𝒜 is finite and the algorithm evaluates
only architectures in 𝒜, so we use two
refinements when needed: (i) (stratified calibration) to maintain
coverage across salient groups (e.g., parameter-count bins), and (ii) a
union-bound adjustment to target simultaneous coverage over a candidate
set S ⊆ 𝒜 maintained by the
algorithm, by setting α = δ/|S| when
S is fixed or by using a
conservative schedule over rounds.
We stress that conformal intervals do not require correctness of 𝒢; they adapt to misspecification by inflating q̂. The cost is that wide intervals may preclude aggressive elimination, which is appropriate when proxies are weak.
Assumption (A1) is structural and cannot be verified from low-fidelity data alone, but it can be (or at least stress-tested) with a modest number of high-fidelity labels. In practice we recommend the following diagnostics, each computed on an initial random batch of m0 high-fidelity evaluations:These checks do not prove (A1), but they provide actionable feedback: either the proxy-to-true linkage is sufficiently tight to justify the calibrated predictor approach, or the procedure should be modified (richer low-fidelity signals, larger b0, alternative 𝒢, or a fallback to more direct high-fidelity exploration).
The next section uses the calibrated predictor and its confidence intervals to define an adaptive allocation rule over HiFi queries, combining exploitation of low predicted loss with elimination safeguards derived from calibrated uncertainty.
We now specify the multi-fidelity procedure that combines (i) an exhaustive low-fidelity sweep at budget b0 to construct proxy features {z(a, b0)}a ∈ 𝒜, with (ii) an adaptive allocation of a limited number m of high-fidelity evaluations guided by a calibrated predictor and its confidence intervals. The algorithm is designed to support two modes of operation: a simple variant (sufficient for the non-adaptive upper bound in ), and an variant that safely eliminates clearly suboptimal architectures and concentrates high-fidelity evaluations on a shrinking candidate set.
The algorithm takes as input 𝒜, the low-fidelity budget b0, a high-fidelity query budget m, and target parameters (ϵ, δ). It outputs a final choice â ∈ 𝒜, and optionally a ranked shortlist S ⊆ 𝒜 with calibrated confidence intervals {CI(a)}a ∈ S. We treat ϵ as the target additive optimality gap and δ as an overall failure probability.
We first query LoFi(a, b0) for
every a ∈ 𝒜 (or for as many
architectures as is feasible under the low-fidelity compute budget), and
compute z(a, b0).
This phase is embarrassingly parallel and produces a fixed proxy table
that is reused throughout.
Importantly, the subsequent high-fidelity allocation is z(a, b0),
so we can choose which architectures to evaluate using HiFi without further low-fidelity cost.
Adaptive selection based on a predictor requires that the predictor
be calibrated on at least a small number of high-fidelity labels. We
therefore select an initial batch of m0 architectures a1, …, am0
uniformly at random from 𝒜 (or from the
current candidate set S if the
procedure is restarted), query yi = HiFi(ai),
and form a labeled dataset
Dm0 = {(z(ai, b0), yi)}i = 1m0.
The role of this step is purely statistical: it ensures that early
decisions are not driven by an uncalibrated model, and it provides data
to estimate uncertainty (e.g., via conformal residuals).
For t > m0, we
proceed iteratively. At iteration t, we fit a predictor ĝt ∈ 𝒢 on the
current labeled set Dt − 1 (using
ERM/regularized ERM as in ), and we construct a confidence interval
CIt(a) = [LCBt(a), UCBt(a)]
for each a in the current
candidate set St − 1. The
particular calibration mechanism is not fixed by the analysis; we
require only that, with probability at least 1 − δ, the intervals provide valid
coverage for all architectures on which we base elimination decisions.
Conformal prediction provides a concrete instantiation: we fit ĝt on a proper
training subset and compute a quantile of residuals on a calibration
subset to obtain a data-dependent half-width q̂t,
yielding
LCBt(a) = ĝt(z(a, b0)) − q̂t, UCBt(a) = ĝt(z(a, b0)) + q̂t.
We then choose the next high-fidelity query as a lower-confidence-bound
minimizer,
at ∈ arg mina ∈ St − 1LCBt(a),
query yt = HiFi(at),
and set Dt = Dt − 1 ∪ {(z(at, b0), yt)}.
This LCB rule is the natural choice for minimization: it prioritizes
architectures that are either predicted to be good or have high
uncertainty, and it is compatible with elimination rules that rely on
simultaneous upper/lower bounds.
To concentrate the m
high-fidelity queries, we optionally interleave querying with
elimination. At certain iterations (or at the end of each ``round’’ in a
staged schedule), we update ĝt and CIt(⋅)
and shrink the candidate set according to a safety criterion. The basic
elimination rule is
Under valid calibration, any architecture removed by cannot be ϵ-optimal: its best plausible value
exceeds the best plausible value among survivors by more than ϵ/2. The margin ϵ/2 is chosen so that, once the
confidence intervals are sufficiently tight, elimination and final
selection together yield an ϵ-optimal output.
To obtain a behavior, we couple with an explicit round schedule. For
example, define rounds r = 0, 1, …, R with
candidate sets S(r) and
cumulative high-fidelity counts m(r), where
m(0) = m0
and m(r + 1) − m(r)
is the number of new HiFi labels
acquired in round r. At the
end of each round, we apply and, if it does not reduce the set
sufficiently (e.g., because the intervals are still wide), we perform a
by retaining only the ⌈|S(r)|/2⌉
architectures with smallest LCB, i.e.,
S(r + 1) = Top⌈|S(r)|/2⌉(S(r); a ↦ LCB(a)),
which enforces computational focus while remaining conservative when
uncertainty dominates. In regimes where proxies are informative, the
safety rule typically drives most of the reduction, and the explicit
halving is rarely activated.
The same mechanism can be viewed as a promotion algorithm: each architecture in S starts with a proxy score ĝ(z(a, b0)), and only architectures with sufficiently low confidence bounds are ``promoted’’ to receive additional high-fidelity evaluations. Because HiFi(a) is noisy, the algorithm may query the same a multiple times; in practice this can be implemented either as repeated fine-tuning with different seeds or as repeated measurements of a stochastic evaluation pipeline. The confidence interval construction should reflect the effective sample size for repeatedly evaluated architectures (e.g., by incorporating empirical variance or by treating repeats as additional exchangeable calibration points).
At termination (after m
high-fidelity queries), we output either (i) the empirical best among
queried points, â ∈ arg mina ∈ {a1, …, am}y(a),
or (ii) the proxy-calibrated minimizer over the surviving set,
â ∈ arg mina ∈ Smĝm(z(a, b0)),
together with {CIm(a)}a ∈ Sm.
The latter is preferable when |Sm| is still
large but the predictor is well calibrated; the former is robust when
misspecification is severe and we prefer to trust only direct
measurements.
When the objective aggregates performance across tasks t ∈ [T], we may
define
$$
F(a)=\frac{1}{T}\sum_{t=1}^T F_t(a),
\qquad
\mathsf{HiFi}(a)=F(a)+\xi,
$$
where Ft(a)
is the full-fidelity metric on task t. Operationally, we can either (i)
compute HiFi(a) by running all
tasks and averaging, or (ii) query tasks sequentially and maintain an
online estimate of F(a) with task-level
uncertainty. In case (ii), we can incorporate task-level partial
averages into the confidence bounds, e.g., defining ya, s
as the mean over the first s
tasks and using a union bound over s to preserve correctness of
elimination. For proxy features, we may either featurize each
task-specific prefix and concatenate, or include task-invariant
summaries (e.g., mean/variance across tasks) when T is large.
If we impose a hard device constraint (latency, memory, parameter count), let C(a) be a deterministic cost computed from the architecture descriptor, and restrict to the feasible set 𝒜feas = {a ∈ 𝒜 : C(a) ≤ Cmax} before running CASH-NAS. If the device metric is itself noisy (e.g., measured latency), we treat it as an additional high-fidelity output and maintain a calibrated interval CIC(a) in parallel, eliminating architectures that are infeasible with high confidence. For soft tradeoffs, we may optimize a scalarized objective Fλ(a) = F(a) + λC(a) for a chosen λ, or maintain a small Pareto set by running CASH-NAS for several λ values using the same proxy table. These extensions preserve the central structure: low-fidelity features provide a global covariate map over 𝒜, while a small number of calibrated high-fidelity measurements controls both selection and safe elimination.
The subsequent section formalizes the statistical guarantees that connect calibrated prediction error to ranking consistency and ϵ-optimal selection, and quantifies the required number of high-fidelity evaluations in terms of (d, σ2, ϵ, δ) and |𝒜|.
We now formalize the sense in which a small number of calibrated high-fidelity labels suffices to (i) recover a near-correct ranking over 𝒜 from proxy features, and (ii) select an ϵ-optimal architecture. We also record an anytime (promotion-style) guarantee for the confidence-bound variant that repeatedly queries and eliminates.
Throughout, we work under the standing assumptions of the enclosing scope: there exists g* ∈ 𝒢 such that supa ∈ 𝒜|F(a) − g*(z(a, b0))| ≤ η, the oracle noise in HiFi(a) is conditionally σ2-sub-Gaussian, and pdim(𝒢) ≤ d. We write n := |𝒜|.
Let a1, …, am
be drawn i.i.d. uniformly from 𝒜, and
let yi = HiFi(ai).
Consider an empirical risk minimizer (or regularized ERM) ĝ ∈ 𝒢 fit on {(z(ai, b0), yi)}i = 1m.
Standard uniform convergence for real-valued hypothesis classes (via
pseudo-dimension control) yields a uniform deviation bound of the
form
with probability at least 1 − δ, for a universal constant
c > 0. The decomposition
γm = η + (estimation)
is conceptually important: the low-fidelity budget b0 and feature map z(⋅, b0) control
η (a misspecification term),
while m controls only the
estimation term. In particular, the high-fidelity sample complexity can
be small only insofar as η is
commensurately small.
We next show that a uniform additive error guarantee immediately
implies that the predicted ranking is correct except possibly on pairs
whose true gaps are small. Fix any ĝ satisfying supa|ĝ(z(a, b0)) − F(a)| ≤ γ
for some γ ≥ 0. Consider two
architectures a, a′ ∈ 𝒜. If
the ordering induced by ĝ
disagrees with that induced by F, then necessarily
Indeed, the disagreement implies
F(a) ≥ ĝ(z(a, b0)) − γ > ĝ(z(a′, b0)) − γ ≥ F(a′) − 2γ,
which rearranges to F(a) − F(a′) > −2γ.
Symmetrizing over which item is truly better yields |F(a) − F(a′)| ≤ 2γ.
Thus, large true gaps are under uniform prediction error.
A convenient aggregate summary is obtained by translating to a
Kendall-τ disagreement bound.
Let τ(ĝ, F)
denote the fraction of pairs (a, a′) whose
ordering differs under ĝ and
F (equivalently, normalized
Kendall-τ distance). Define
the gap-tail function
ρ(γ) := Pr(a, a′)[ |F(a) − F(a′)| ≤ 2γ ],
where (a, a′) is
uniform over unordered pairs from 𝒜.
Then implies the deterministic inequality
Combining with (taking γ = γm)
yields a quantitative statement: as m increases, the estimation term
shrinks as m−1/2,
and the fraction of misranked pairs is controlled by the prevalence of
near-ties in the true objective landscape at scale 2γm. In
particular, if the objective has few near-ties, the ranking induced by
the calibrated predictor is accurate even when γm is not yet
small enough for ϵ-optimal
selection.
We now connect uniform prediction error to optimization. Let a* ∈ arg mina ∈ 𝒜F(a)
be any optimal architecture. If â ∈ arg mina ∈ 𝒜ĝ(z(a, b0)),
then whenever supa|ĝ(z(a, b0)) − F(a)| ≤ γ,
Thus, any procedure that produces a predictor with a uniform error bound
γ yields a 2γ-optimal architecture by simply
minimizing ĝ over the proxy
table.
To obtain an ϵ-optimality
guarantee, it therefore suffices to enforce γm ≤ ϵ/2.
Under the distortion condition η ≤ ϵ/4, we may choose
m large enough so that the
estimation term in is at most ϵ/4, i.e.,
$$
c\,\sigma\sqrt{\frac{d+\log(n/\delta)}{m}}
\;\le\;\frac{\epsilon}{4}.
$$
Equivalently, it suffices to take
up to universal constants (and up to polylogarithmic factors if one uses
a more refined capacity measure or a more elaborate calibration
mechanism). Plugging γ = ϵ/2 into yields F(â) ≤ F(a*) + ϵ
with probability at least 1 − δ. Notably, the dependence on
n enters only through log n, reflecting that the
high-fidelity labels are used primarily to learn a predictor over a
finite covariate set rather than to exhaustively evaluate all
architectures.
We finally record a bound for the confidence-bound
(promotion/elimination) variant that adaptively concentrates
high-fidelity queries on promising architectures. Let CIt(a) = [LCBt(a), UCBt(a)]
be calibrated intervals such that, with probability at least 1 − δ, we have simultaneousմամբ
coverage F(a) ∈ CIt(a)
for all a that are used in
elimination or selection decisions up to time t. Suppose we eliminate an
architecture a once
Under the stated coverage event, the optimal a* is never eliminated:
if a* were removed,
then would imply
F(a*) ≥ LCBt(a*) > mina′UCBt(a′) + ϵ/2 ≥ F(a*) + ϵ/2,
a contradiction. Hence the candidate set always retains an ϵ-optimal element once intervals are
sufficiently tight.
To quantify the number of queries devoted to suboptimal
architectures, let Δa := F(a) − F(a*) ≥ 0.
In typical constructions (including sub-Gaussian concentration with
union bounds, or conformal intervals with an appropriate correction),
the half-width of CIt(a)
scales as
$\mathrm{width}_t(a)\asymp
\sigma\sqrt{\log(n/\delta)/N_t(a)}$
where Nt(a)
is the number of high-fidelity evaluations of a up to time t. Under such a width law, once
Nt(a)
is large enough that widtht(a) ≪ Δa,
condition will trigger and remove a. Consequently, each suboptimal
architecture is queried only on the order of
times before elimination, where Õ(⋅) hides logarithmic factors in
n and 1/δ. This gap-dependent behavior
implies an anytime simple-regret guarantee: after any budget t, the best surviving architecture
(or the LCB-minimizer) is near-optimal whenever all architectures with
Δa ≫ widtht(a)
have been eliminated. In particular, after m queries, the output is ϵ-optimal with probability at least
1 − δ, and the total effort
concentrates on the near-tied region around a*, with far-from-optimal
architectures incurring only O(1/Δa2)
high-fidelity cost.
Taken together, – show that calibrated multi-fidelity NAS can attain ϵ-optimality using Õ((d + log n)σ2/ϵ2) high-fidelity evaluations, while and explain why, in practice, accurate ranking and aggressive pruning may occur substantially earlier than the worst-case ϵ-optimality threshold.
We complement the upper bounds of Section~ with two lower-bound statements. The first isolates a structural necessity: without an explicit linkage between the low-fidelity proxy information and the full-fidelity objective, no algorithm that relies only on low-fidelity data can offer nontrivial worst-case guarantees. The second shows that even under the strongest possible linkage (no distortion, i.e. η = 0), the number of high-fidelity evaluations must scale as Ω(1/ϵ2) in the target accuracy due to the irreducible noise in HiFi. Together, these results justify both components of a calibrated multi-fidelity approach: (i) an assumption such as bounded distortion to connect proxy features to the target metric, and (ii) a non-negligible number of high-fidelity evaluations to overcome stochastic uncertainty.
Consider any algorithm ℳ which, after observing the full low-fidelity transcript (e.g. {LoFi(a, b0)}a ∈ 𝒜 and hence the entire proxy table {z(a, b0)}a ∈ 𝒜), outputs an architecture â ∈ 𝒜. If we impose no relationship between the mapping a ↦ z(a, b0) and the full-fidelity values F(a), then the proxy transcript alone does not identify the minimizer of F even in principle. Formally, one may construct two full-fidelity objectives F1 and F2 that are both consistent with the same low-fidelity observations yet have disjoint minimizers, forcing ℳ to fail on at least one instance.
A particularly direct indistinguishability construction suffices. Fix
any proxy table {z(a, b0)}a ∈ 𝒜
(or even let all proxies be identical, z(a, b0) ≡ z0),
and pick two distinct architectures a(1) ≠ a(2).
Define two objectives F1, F2 : 𝒜 → ℝ
by
F1 (a(1)) = 0, F1(a) = 1 ∀a ≠ a(1), F2 (a(2)) = 0, F2(a) = 1 ∀a ≠ a(2).
By construction, the low-fidelity transcript (hence the observed
proxies) is identical under F1 and F2, since we have not
assumed that LoFi encodes any
information about F. Therefore
ℳ must output the same (possibly
randomized) distribution over â under both objectives. In
particular, letting ℙj denote probability
under objective Fj and the
algorithmic randomness of ℳ, we
have
ℙ1 [â = a(1)] = ℙ2 [â = a(1)], ℙ1 [â = a(2)] = ℙ2 [â = a(2)].
Hence at least one of ℙ1[â ≠ a(1)]
or ℙ2[â ≠ a(2)]
is at least 1/2. On that objective, the
returned architecture suffers suboptimality at least 1. This yields a worst-case constant lower
bound:
$$
\max\Bigl\{\mathbb{E}_{F_1}\bigl[F_1(\hat a)-\min_a F_1(a)\bigr],\;
\mathbb{E}_{F_2}\bigl[F_2(\hat a)-\min_a
F_2(a)\bigr]\Bigr\}\;\ge\;\frac{1}{2}.
$$
The same argument extends to any desired constant gap c > 0 by scaling the non-optimal
values. Crucially, this is not a limitation of any particular choice of
proxy features; it is a limitation of the information pattern. If the
proxy transcript is not assumed to constrain F (e.g. via bounded distortion or a
related structural condition), then the minimizer of F is not identifiable from proxy
data alone, and no nontrivial approximation guarantee is possible
uniformly over all F.
This impossibility result should be read as a necessity statement for assumptions like (A1). In our setting, (A1) can be viewed as a minimal identifiability condition: it excludes adversarial instances in which low-fidelity observations are statistically independent of F. Without such an exclusion, the problem degenerates to a no-free-lunch regime.
We next show that high-fidelity evaluation is information-theoretically necessary at the scale 1/ϵ2, even if proxies are perfectly informative in the sense that η = 0 and 𝒢 contains the true mapping F(a) = g*(z(a, b0)). The point is that, in the presence of σ2-sub-Gaussian noise in HiFi, identifying an ϵ-optimal architecture requires enough repetitions to statistically resolve gaps of size ϵ. This remains true regardless of how informative z(⋅, b0) is, because the algorithm still must determine where the true optimum lies among near-tied candidates, and the only access to F is via noisy labels.
We formalize this by reduction to binary hypothesis testing
(equivalently, the two-arm best-arm identification lower bound).
Consider the special case |𝒜| = 2, with
architectures a1, a2.
Fix any proxy features z(a1, b0), z(a2, b0);
we may even allow them to be distinct and known to the learner. Define
two full-fidelity instances F(+) and F(−) such that the
identity of the better architecture flips:
F(+)(a1) = 0, F(+)(a2) = ϵ, F(−)(a1) = ϵ, F(−)(a2) = 0.
Assume the high-fidelity oracle returns y = HiFi(a) = F(±)(a) + ξ,
where ξ is centered σ2-sub-Gaussian. Any
algorithm that is ϵ-optimal
with probability at least 1 − δ must, on instance F(+), output a1 with probability at
least 1 − δ, and on instance
F(−), output a2 with probability at
least 1 − δ; otherwise the
error exceeds ϵ.
Let the algorithm make a (possibly adaptive) total of m high-fidelity queries, producing a
transcript T consisting of
queried indices and observed noisy values. Denote by ℙ+ and ℙ− the distributions of T under F(+) and F(−), respectively.
Standard change-of-measure arguments (e.g. Le Cam, Bretagnolle–Huber
inequalities, or bandit lower bounds of Mannor–Tsitsiklis type) imply
that if a decision rule distinguishes two hypotheses with error
probability at most δ, then
the KL divergence satisfies
On the other hand, for sub-Gaussian (in particular Gaussian) noise
models, each query to a
contributes at most on the order of (F(+)(a) − F(−)(a))2/σ2
to the KL divergence between the two induced observation laws. In our
two-instance construction, the mean shift is ϵ for whichever arm is queried, and
thus the per-sample KL is O(ϵ2/σ2).
Summing over m queries yields
the upper bound
for a universal constant C > 0 (with C = 1/2 in the Gaussian case).
Combining and gives
$$
m \;\ge\;
c'\,\frac{\sigma^2}{\epsilon^2}\,\log\!\left(\frac{1}{\delta}\right)
$$
for a universal constant c′ > 0. This
establishes the claimed lower bound Ω(σ2log (1/δ)/ϵ2)
on the number of high-fidelity evaluations required to guarantee ϵ-optimality with confidence 1 − δ, even in the most favorable
proxy regime.
The foregoing lower bounds align with the upper bounds from Section~ in two complementary ways. First, they show that the bounded-distortion linkage (or an equivalent structural assumption) is not merely convenient but logically necessary to avoid proxy-only no-free-lunch behavior. Second, they show that the ϵ−2 dependence in the high-fidelity sample complexity is not an artifact of our analysis: it is forced by the stochastic nature of HiFi and persists even when proxies are fully informative up to zero distortion. Thus, the upper bound m = Õ(σ2(d + log n)/ϵ2) can be interpreted as near-tight: improvements can only come from reducing the effective complexity term (e.g. via a smaller hypothesis class or additional structure) and from exploiting favorable gap structure (as in promotion/elimination analyses), but not from beating the fundamental 1/ϵ2 scaling in the worst case.
We outline an experimental program whose purpose is twofold: (i) to instantiate the multi-fidelity protocol in a way that respects realistic adapter-training practice, and (ii) to stress-test the theoretical levers (distortion, calibration, and high-fidelity sample complexity) under controlled ablations. Throughout, we fix a frozen base model M0 and a discrete architecture set 𝒜 of adapter/routing configurations, and we treat full-fidelity evaluation as a standardized fine-tuning recipe whose output metric is F(a) for each a ∈ 𝒜.
We construct 𝒜 as a Cartesian product of (a) , (b) , and (c) options. Placements include (i) attention-only adapters, (ii) MLP-only adapters, and (iii) adapters at both sublayers; we further allow layer subsets (e.g. every layer, every other layer, last k layers) to expose depth/compute tradeoffs. Parameterizations include LoRA-style low-rank updates with rank r ∈ {2, 4, 8, 16, 32}, bottleneck adapters with bottleneck width h ∈ {8, 16, 32, 64}, and optionally IA3-style multiplicative vectors. Routing choices include (i) a fixed shared adapter across tasks, (ii) task-conditional selection among K experts, and (iii) soft mixture-of-experts gates constrained to a small set of templates (to keep 𝒜 finite). We record for each a a descriptor vector (layer mask, rank/bottleneck, parameter count, and routing template) which becomes part of z(a, b0). The resulting n = |𝒜| should be large enough to make exhaustive high-fidelity search unrealistic while still enabling a full low-fidelity sweep (e.g. n ∈ [103, 105], depending on device availability).
We evaluate on a heterogeneous task suite to ensure that proxy behavior is not an artifact of a single benchmark. Concretely, we include: (i) sentence-level classification (e.g. GLUE-style), (ii) token-level tasks (e.g. named entity recognition), (iii) generation (summarization and question answering), and (iv) instruction-following or preference-aligned generation if applicable to M0. For each task we fix preprocessing, maximum sequence length, and a metric definition; F(a) is the validation loss (or negative metric) over tasks under a common adapter training recipe, to match the theory’s single scalar objective. We standardize training hyperparameters across a (learning rate schedule, batch size, optimizer, number of steps) except for any hyperparameters that are intrinsically coupled to the architecture (e.g. parameter count-dependent weight decay), which we either fix or explicitly include as a controlled factor in 𝒜. To prevent leakage of low-fidelity information into the final selection criterion, we compute F(a) only from full-fidelity runs and reserve a held-out evaluation split (or held-out tasks) for final reporting.
Because the algorithmic motivation is compute efficiency, we measure cost on multiple devices: a high-end datacenter GPU (e.g. A100/H100), a mid-range GPU, and at least one alternative accelerator (e.g. TPU or an inference-optimized GPU) if available. For each device we report (i) wall-clock time per low-fidelity query cL(b0), (ii) wall-clock time per high-fidelity query cH, and (iii) optional energy consumption (via vendor power telemetry). When hardware-aware considerations are included, we also compute per-architecture latency/memory proxies (parameter count, activation checkpointing overhead) and, for a subset, measured end-to-end inference latency. This enables subsequent extension to Pareto selection, while keeping the primary objective F fixed for the present section.
We set a fixed low-fidelity budget b0 (e.g. a small number of steps/epochs) used for all a. From LoFi(a, b0) we compute learning-curve statistics: initial loss, slope over early steps, smoothed minimum, curvature, and (when applicable) gradient-norm summaries. We concatenate these with architecture descriptors to form z(a, b0) ∈ ℝd. We emphasize that the proxy is not simply ``early stopping score’’; rather, it is a feature representation intended to support learning a predictor ĝ ∈ 𝒢 that is calibrated against HiFi. We predefine several feature sets of increasing richness to support ablations: descriptors-only, curve-only, and combined.
We implement several choices of 𝒢, ranging from linear models and kernel ridge regression to gradient-boosted trees and small neural regressors, each with controlled capacity (effective d). For calibration, we recommend a split-conformal or cross-conformal procedure on the growing high-fidelity dataset D, producing prediction intervals CIt(a) and thus LCBt(a), UCBt(a). We evaluate calibration by (i) empirical coverage versus nominal coverage across queried points and (ii) coverage conditional on predicted score quantiles (reliability-style curves). We also report interval width as a function of m and examine whether the elimination rule would have removed any architecture later found to be near-optimal (a practical check of elimination safety). Since the theory is phrased in terms of uniform error, we additionally compute the maximum absolute error over a held-out set of architectures with full-fidelity labels (when budget permits) and relate it to observed Kendall-τ rank correlation.
We include ablations that each remove one ingredient of the multi-fidelity method.We compare against baselines that cover common NAS practice under
adapter tuning:
(i) random search with m
full-fidelity trials; (ii) proxy-only selection using early validation
loss at b0 (no
calibration); (iii) single-fidelity Bayesian optimization (Gaussian
process or tree-structured Parzen estimators) using only high-fidelity
evaluations; (iv) multi-fidelity bandit methods such as Successive
Halving/ASHA/Hyperband using partial budgets, with a fairness constraint
matching total compute; and (v) regression-based surrogates trained on
the same m high-fidelity
points but without conformal calibration or without low-fidelity
features. We report both optimization performance (best F achieved) and resource usage
(time/energy), emphasizing matched-budget comparisons.
We will release: (i) the explicit definition of 𝒜 (enumeration code and counts), (ii) the exact low- and high-fidelity training recipes (optimizer, schedule, batch size, gradient clipping, mixed precision, checkpointing), (iii) the featurization code for z(a, b0) and the predictor training configuration, (iv) random seeds for architecture sampling and training, (v) dataset versions and preprocessing scripts, and (vi) device specifications and software environment (CUDA/cuDNN, framework versions). All results will be accompanied by confidence intervals over multiple runs (at least over the algorithm’s randomness and, when feasible, over training seeds), and we will log all queried architectures and their observed HiFi values to permit exact replay of the selection trajectory.
We conclude by clarifying which aspects of the analysis are specific to the adapter regime, what failure modes we expect when those conditions are violated, and how the same multi-fidelity viewpoint can be extended to hardware-aware and safety-aware architecture selection.
The central practical advantage of restricting to adapter/routing architectures on a frozen M0 is that the search space 𝒜 can be made large while keeping both fidelities well-defined and reasonably comparable across a ∈ 𝒜. In particular, the low-fidelity oracle LoFi(a, b0) is meaningful only insofar as the early learning dynamics share a stable relationship with the full-fidelity outcome F(a). Freezing M0 and varying only lightweight modules tends to reduce catastrophic non-stationarities in optimization (e.g. divergent runs, sharp regime changes in loss curves), and it keeps the per-step compute cost closer across architectures. Both effects directly support the bounded-distortion premise (A1), namely that there exists a single g* ∈ 𝒢 such that F(a) ≈ g*(z(a, b0)) uniformly over a. When we move outside this regime, the primary risk is not simply that prediction becomes harder, but that the mismatch parameter η ceases to be small in any hypothesis class with controllable complexity.
Our guarantees are additive and explicitly pay the price η, so the statement ``what breaks’’
is concrete: if η is on the
order of the target ϵ (or
larger), the high-fidelity sample complexity required to compensate
through estimation shrinks to irrelevance, because even perfect
estimation within 𝒢 cannot identify an
ϵ-optimal architecture. There
are several mechanisms by which η can increase outside adapter-style
search:
(i) if we unfreeze substantial parts of M0, early training curves
can be dominated by transient effects (e.g. rapid representation drift)
that are weakly predictive of the final fine-tuned state;
(ii) if architectures change optimization conditioning (e.g. adding deep
non-residual modules, changing normalization, or altering effective
depth), then the same budget b0 may correspond to
qualitatively different training phases across a, breaking the comparability of
LoFi(a, b0);
(iii) if we allow optimizer or schedule choices to vary across a in ways not represented in z(a, b0),
then two architectures may share similar prefixes for incidental reasons
while diverging later. In all cases, a pragmatic symptom is that the
empirical Kendall-τ saturates
well below what would be implied by the observed proxy noise alone,
indicating structural ranking errors consistent with a large η.
The two-fidelity abstraction implicitly assumes cH ≫ cL(b0) and that we can afford O(|𝒜|) low-fidelity runs. This can fail in regimes where (a) the model is so large that even a short prefix has high wall-clock and memory cost, (b) the search includes architectures whose per-step cost varies widely (e.g. heavy routing or large experts), or (c) the system is I/O-bound so that launching many partial runs is itself expensive. When these deviations occur, the appropriate modification is to treat low-fidelity queries as budgeted and selective rather than exhaustive. Conceptually, we then have a third optimization layer: we must decide which architectures to (acquire z(a, b0)) before deciding which to query at high fidelity. This suggests a triage protocol in which cheap descriptor-only features are used to choose a subset for partial training, and only then do we proceed to calibrated high-fidelity querying. The theoretical counterpart would replace uniform low-fidelity access by an adaptive design that controls the induced selection bias in ĝ, which is nontrivial because both the training set and the evaluation set become adaptively filtered.
Although we have treated F
as a scalar objective, adapter architectures are often deployed under
explicit constraints on latency, memory, or energy. A natural extension
is to define a vector-valued objective
J(a) = (F(a), L(a), M(a), E(a)),
where L, M, E denote
device-specific inference latency, peak memory, and energy (or proxy
measurements), and to seek a Pareto-optimal set rather than a single
minimizer. The multi-fidelity structure remains useful, but the
selection logic must change: instead of eliminating architectures whose
lower confidence bound is worse than the current best upper confidence
bound on a scalar, we eliminate architectures that are with high
confidence. Concretely, if we can construct calibrated sets CItF(a),
CItL(a),
etc., then we may declare a
safely dominated by a′ whenever
sup CItF(a′) ≤ inf CItF(a), sup CItL(a′) ≤ inf CItL(a), etc.
with at least one strict inequality, and remove a from the candidate set. We
emphasize that this is more delicate than scalarization: weighted sums
can miss non-convex Pareto fronts and can behave poorly when objectives
have different noise scales. From an experimental standpoint, this
motivates measuring not only F(a) but also
device-dependent metrics for a small subset and learning predictors for
each component using shared features z(a, b0)
augmented with hardware descriptors.
A further complication is that both accuracy and cost can be device- and batch-dependent. For example, routing decisions may be cheap on one accelerator and expensive on another due to kernel support, and quantization-aware or compilation-aware effects can reorder architectures by latency even when parameter counts are similar. If the deployment environment is uncertain (or heterogeneous), the appropriate objective is closer to a distributional criterion, such as minimizing expected F(a) subject to high-probability latency constraints across a device distribution. Methodologically, this suggests conditioning the predictor on a device descriptor u and learning g(z, u) ≈ F (and similarly for L, M, E), or learning a family of predictors indexed by device. The bounded-distortion requirement then becomes over both a and u, which is plausible only if u captures the relevant hardware variation. Practically, this motivates reporting transfer: fit ĝ on one device, then evaluate calibration and ranking error on another to quantify how far we are from such uniformity.
Many realistic adapter-tuning scenarios are federated: low-fidelity training (or even high-fidelity training) occurs on private data silos, and only limited statistics can be shared. In this setting, LoFi(a, b0) may be available at many clients, but HiFi(a) may be rare or centralized. The algorithmic question becomes: can we aggregate proxy features z(a, b0) and a small number of high-fidelity labels without leaking private information? One direction is to design z to be privacy-preserving by construction (e.g. sharing only coarse learning-curve summaries, clipped and noised) and then apply the same calibration-and-selection logic. Another is to treat each client as a separate task distribution and extend the objective to a robust criterion (e.g. worst-client performance), in which case we would require a proxy-to-true linkage that holds uniformly across clients. This regime makes explicit the need for uncertainty quantification: even if mean performance is predictable, tail behavior across clients may not be.
Finally, we note that practical model selection increasingly includes safety-related criteria (toxicity, bias metrics, jailbreak robustness, hallucination rate) that may be expensive to evaluate and may not correlate with early training curves. In our framework, these can be incorporated either as additional components of a vector-valued objective (leading again to Pareto selection) or as constraints (e.g. minimize F(a) subject to safety score ≤ τ). The main conceptual warning is that bounded distortion may hold for F but fail for safety metrics, even within adapters: two architectures with indistinguishable learning-curve prefixes can differ materially in generation behavior. Thus, a realistic protocol may require separate high-fidelity evaluation channels for safety, with their own noise levels and sample-complexity demands, and may need to decouple the predictor classes for performance and safety rather than forcing a single z-based model to explain both.
On the theory side, the most pressing extension is to relax the static bounded-distortion condition into something that can be and . In practice we can estimate a data-dependent distortion proxy (e.g. cross-validated residuals of ĝ on held-out high-fidelity points) and use it to decide whether the multi-fidelity approach is justified, or whether one should revert to more conservative high-fidelity exploration. A second direction is to analyze adaptively sampled calibration sets (our pseudocode uses LCB-based selection after warm-start), where uniform convergence arguments must account for the dependence between {at} and the observed noise. A third direction is to develop guarantees for structured search spaces (combinatorial 𝒜) where one cannot afford even O(|𝒜|) low-fidelity sweeps; here, one would want to combine proxy learning with combinatorial bandits or structured Bayesian optimization, while retaining explicit distortion parameters. These extensions would bring the formal model closer to the full range of NAS practice without sacrificing the clarity of the multi-fidelity separation: proxies can guide search only to the extent that their structural linkage to F is stable and calibratable.