One-Round Federated Few-Shot Personalization via Prompt–Adapter Meta-Learning with Communication Lower Bounds

Metadata

Total Words: 8,891
Export Date: 2026-01-20 02:57:47
Description: Few-shot federated learning (FS-FL) arises when each client holds scarce, private, and heterogeneous data and must personalize quickly. Contemporary few-shot learning offers two complementary adaptation mechanisms: in-context learning (ICL), which adapts using demonstrations without gradient updates, and parameter-efficient fine-tuning (PEFT), which adapts using a small set of trainable parameters. We formalize a one-round-per-episode FS-FL model motivated by real 2026 constraints (bandwidth/latency/energy), and propose a prompt–adapter method that shares a global prompt prior and a low-rank adapter basis while allowing each client to compute a closed-form local adapter from its K-shot support set. We provide tight communication and risk bounds in a linearized foundation-model regime: our algorithm achieves oracle-rate excess risk scaling as O(r/K) under convex smooth losses, while communicating O(r) scalars, and we prove a matching Ω(r) (up to logs) lower bound via a reduction from distributed linear estimation. We also show that uncertainty-calibrated aggregation is minimax-optimal among one-round linear aggregators under heteroscedastic client noise. The work connects FS-FL and modern few-shot paradigms highlighted in recent surveys: moving beyond centralized supervised FSL toward privacy-preserving, cost-aware adaptation with principled guarantees. Experiments (recommended) on heterogeneous-domain benchmarks would validate robustness, bandwidth savings, and energy proxies against multi-round baselines.

1. Introduction: FS-FL motivation (privacy + scarce data), 2026 constraints (bandwidth/energy), and why combining ICL priors with PEFT is natural; summary of contributions and guarantees.
2. Related Work: FS-FL via meta-learning (FedAvg-as-meta, MAML-style), PEFT in FL, ICL/prompting, and uncertainty/calibration; position relative to survey taxonomy (meta-learning/transfer/ICL/hybrid).
3. Problem Setup: Formal one-round-per-episode federated few-shot model; episodic data (support/query) per client; objective as bilevel/meta objective with communication constraints.
4. Model Class: Linearized foundation model abstraction (fixed φ_{θ0}); low-rank adapter family; optional prompt prior as a feature shift; assumptions (convexity/smoothness/noise).
5. Algorithm: FedPrompt-Adapter (FPA): server broadcast of (prompt prior, adapter basis), client closed-form adaptation, uncertainty estimation, and one-shot upload; server uncertainty-weighted aggregation and meta-update.
6. Guarantees I (Upper Bounds): Excess query risk decomposition; O(r/K) local adaptation rate; impact of client heterogeneity; convergence in outer rounds R.
7. Guarantees II (Lower Bounds & Tightness): Communication lower bound for one-round personalization; show near-matching upper bound; discuss necessity of sending Ω(r) information.
8. Extensions: (a) semi-supervised support sets (unlabeled examples) via pseudo-labeling/EM; (b) class-incremental sessions; (c) robustness under cross-domain shifts via shift-aware weighting; (d) privacy add-ons (DP noise) and resulting risk/comm tradeoffs.
9. Experimental Plan (Recommended): Benchmarks, federated splits, baselines (ICL-only, adapter-only, multi-round FL/meta-FL), and metrics (accuracy, bandwidth, wall-clock, energy proxy, calibration).
10. Discussion & Limitations: When linearization is valid, where ICL priors help/hurt, open questions on nonconvex deep regimes, and implications for Green AI reporting.
11. Conclusion: What the one-round FS-FL framing changes, and the main theoretical takeaways.

Content

1. Introduction: FS-FL motivation (privacy + scarce data), 2026 constraints (bandwidth/energy), and why combining ICL priors with PEFT is natural; summary of contributions and guarantees.

Federated learning has matured into a standard mechanism for training predictive models under data-locality constraints, yet a substantial fraction of deployments remain : each client i ∈ [n] can label only a small support set S of size K per task episode, while wishing to achieve low risk on the corresponding query distribution. This regime is enforced both by privacy—raw (x, y) cannot be centralized—and by operational scarcity: for many applications, new users, devices, or clinical sites provide limited supervision before predictions are needed. The central mathematical difficulty is that, with K ≪ d, naive per-client fine-tuning in the ambient parameter space is statistically ill-posed and, in federated settings, also communication- and energy-intensive.

The practical constraints anticipated for near-term deployments (which we treat as design axioms) sharpen the problem. First, bandwidth is not only limited but across clients, and server–client interaction is often constrained to a single request–response per episode; this is typical when connectivity is intermittent, when participation windows are short, or when protocol simplicity is prioritized. Second, energy is a first-order resource: on-device compute and radio transmission costs must be kept small and predictable. Third, foundation models are now commonly available as shared backbones, so we may assume access to a pretrained embedding map ϕ_θ₀ : 𝒳 → ℝ^d that is fixed during the core adaptation loop. Consequently, the relevant question is no longer whether one can train a large model in a federated manner, but whether one can a shared representation from a handful of labeled examples per client using only one-round communication.

We adopt the viewpoint that two recent paradigms should be combined rather than treated as competing. On the one hand, parameter-efficient fine-tuning (PEFT) constrains adaptation to a low-dimensional family, typically via adapters or low-rank updates. On the other hand, in-context learning (ICL) suggests that prompts or prefixes can be interpreted as a prior over task-specific predictors: a prompt p conditions the backbone or embedding in a manner that biases the downstream linear predictor toward task-relevant features. In the few-shot federated regime, this complementarity is logically forced. PEFT supplies a low-dimensional parameterization that makes K-shot estimation feasible; prompting supplies a mechanism for encoding shared inductive bias without updating the backbone. In our formalization, the personalized predictor takes the form
w_i = w_g + Ba_i,
with unknown global component w_g ∈ ℝ^d, unknown adapter basis B ∈ ℝ^d × r of rank at most r ≪ d, and client coefficients a_i ∈ ℝ^r estimated from the local support set S (optionally under prompt-conditioning p). The statistical role of this decomposition is to replace an intractable d-dimensional few-shot regression by an r-dimensional one, while retaining enough flexibility to model heterogeneous client tasks.

The algorithmic constraint we impose is communication: in each server round, the server broadcasts a global state (consisting of (w_g, B, p)), and each participating client returns a single message computed from its sampled episode (S, Q) ∼ 𝒯_i. No additional interaction is permitted within the episode. Under this constraint, the only viable personalization rule is one that admits a closed-form or near-closed-form local computation in the r-dimensional space. For square loss, this naturally leads to a ridge estimator for a_i(S) based on projected features (B^⊤ϕ_θ₀(x)). For general convex ℓ(⋅, ⋅) that is L-Lipschitz and β-smooth in the prediction, we may replace the closed form by a small number of stable updates in ℝ^r; the key point is that local computation remains independent of d beyond forming projections.

A second design choice concerns aggregation. Since clients have heterogeneous noise levels and heterogeneous local conditioning, uniform averaging of client messages is suboptimal even in linear models. We therefore attach to each client message an uncertainty proxy, computed from local sufficient statistics; in the square-loss case this is naturally the posterior covariance Σ_i of the ridge estimator. The server uses these uncertainty estimates to compute weights ω_i and aggregates in a generalized least-squares manner. This choice is not merely heuristic: in the linearized regime it is minimax-optimal among unbiased one-round linear aggregators, and it yields a principled method for allocating influence across clients under heteroscedasticity.

Our contributions are thus the following.

Taken together, these results isolate a regime in which federated few-shot personalization is simultaneously (dependence σ²r/K), (dependence Ω(r)), and with modern foundation-model deployments (frozen embeddings, adapter-style PEFT, and prompt-induced priors). The remainder of the paper situates this approach relative to prior work in meta-learning-based federated adaptation, PEFT in distributed settings, and prompt-based conditioning, and then develops the algorithmic and theoretical details.

We situate our results at the intersection of (i) few-shot federated learning via meta-learning, (ii) parameter-efficient fine-tuning in federated settings, (iii) in-context learning and prompt-based conditioning, and (iv) uncertainty-aware aggregation and calibration. Our goal in this discussion is not to provide an exhaustive survey, but to isolate the specific methodological gap induced by the constraint and the presence of a large frozen backbone.

A standard approach to personalization in federated learning is to reinterpret the federated objective as a meta-objective, in which a server-learned initialization or shared representation enables rapid client adaptation. This viewpoint appears both implicitly and explicitly in works that treat FedAvg-style local SGD as an inner loop and view the server aggregation as an outer-loop update (often described as ``FedAvg-as-meta’’). MAML-style methods and their first-order variants adapt this principle more directly by differentiating through (or approximating) a small number of client adaptation steps, yielding meta-updates intended to improve post-adaptation performance on held-out data; see, e.g., . In federated settings, variants such as Per-FedAvg, FedMAML, and related algorithms pursue the same end: learning a global parameter that is a good starting point for client fine-tuning . These methods are empirically effective, but they typically assume either multiple local steps and/or multiple communication rounds per episode, or at least the ability to communicate full-model gradients/parameters. Our setting makes a different design axiom explicit: within an episode, each client may only (i) download a global state once and (ii) upload a single message once. This pushes us toward inner-loop adaptation rules with closed form (or low-dimensional stable updates) and toward outer-loop updates that can be estimated from one-shot summaries.

Personalization has also been addressed through multi-task learning, partial model sharing, mixture/clustered FL, and regularized local objectives. Representative examples include methods that decompose parameters into shared and private blocks (e.g., ``FedPer’’-style head personalization), methods that learn a client similarity structure or clusters, and proximal/regularized variants that control client drift (e.g., FedProx) . These approaches can reduce negative transfer under heterogeneity, but they typically operate in the ambient parameter dimension and do not by themselves resolve the few-shot ill-posedness when K ≪ d. Our formulation instead treats the low-dimensionality of adaptation as a in the few-shot regime and as a under one-round constraints.

The rise of large pretrained models has made parameter-efficient fine-tuning (PEFT) the dominant practical mechanism for adaptation, including adapters, low-rank updates (e.g., LoRA), prefix/prompt tuning, and bias-only updates . Federated analogues aim to reduce both upload size and on-device compute by restricting training to a small set of parameters while keeping the backbone frozen or mostly frozen. Existing FL-PEFT systems often (a) fine-tune a low-rank module per client and average these modules across rounds, or (b) train a shared adapter while optionally allowing client-specific heads. Our focus differs in two respects. First, we explicitly model few-shot adaptation: the per-client quantity of interest is not a slowly trained personalized model, but an episode-conditioned coefficient vector a_i(S) computed from a support set. Second, we distinguish a B ∈ ℝ^d × r from a_i ∈ ℝ^r, so that the client-facing trainable degrees of freedom remain O(r) and admit a ridge-style estimator in the square-loss regime. This decomposition is aligned with the algorithmic requirement that client computation in each episode be predictable and independent of full-model backpropagation.

In-context learning (ICL) and prompt engineering suggest that conditioning a foundation model on a small demonstration set can yield task adaptation without updating weights. In our formalization, we do not attempt to capture the full complexity of ICL; instead, we treat a prompt or prefix p as a global conditioning mechanism that induces a shared bias over downstream predictors (e.g., via a feature shift or a change in embedding geometry). This perspective is compatible with recent interpretations of prompting as learning a low-dimensional control vector and with prefix-tuning as a parameter-efficient adaptation mechanism. The federated dimension adds a constraint: prompt parameters are naturally shared across clients, whereas the information remains local. Consequently, the appropriate abstraction for our purposes is a of prompting (global prior) and low-rank adaptation (local estimator), rather than prompting as a complete replacement for parameter updates.

Uncertainty-aware aggregation is classical in statistics and appears in modern FL through Bayesian FL, federated variational inference, and methods that reweight clients based on local confidence, noise estimates, or gradient variance. In heterogeneous regimes, uniform averaging is generally not minimax optimal even for linear models, and generalized least squares suggests weighting updates inversely proportional to their covariance. Our contribution is to make this principle operational under one-round constraints: the client computes an uncertainty proxy from local sufficient statistics (e.g., a posterior covariance for ridge regression) and transmits it alongside a low-dimensional update, enabling uncertainty-calibrated weighting without revealing raw data. This viewpoint also interfaces naturally with calibration concerns, since heteroscedasticity across clients induces miscalibration if treated as homogeneous.

In survey taxonomies of personalized federated learning, our method lies between (learning to adapt from few data) and (learning a shared embedding), and it incorporates an global prior through prompts. Concretely, we operate with a frozen representation ϕ_θ₀ (transfer), learn a low-rank adaptation family (PEFT), and analyze a one-step adaptation rule from K samples (meta-learning). The remainder of the paper makes this hybrid precise by specifying the episodic one-round protocol, the induced bilevel/meta objective, and the corresponding statistical and communication limits.

3. Problem Setup: Formal one-round-per-episode federated few-shot model; episodic data (support/query) per client; objective as bilevel/meta objective with communication constraints.

We consider a population of n clients indexed by i ∈ [n]. Client i observes an unbounded private data stream of labeled examples (x, y) ∈ 𝒳 × 𝒴 drawn from an unknown distribution D_i. The distributions {D_i}_i = 1ⁿ may be heterogeneous and need not share label marginals, class sets, or noise levels; the only commonality we assume is that each client can evaluate a fixed prediction rule broadcast by the server and can form empirical losses on locally observed samples.

Rather than optimizing a single global risk over pooled samples, we adopt an episodic formulation appropriate for few-shot personalization. For each client i, an consists of a support set
S = {(x_j, y_j)}_j = 1^K
of K labeled examples (the ) together with a query set
Q = {(x_t^′, y_t^′)}_t = 1^m
used to evaluate post-adaptation performance. We let 𝒯_i denote the episode distribution for client i, i.e.,
(S, Q) ∼ 𝒯_i,
where one may take (S, Q) to be drawn i.i.d. from D_i and then split, or more generally as being induced by a client-specific task generator (e.g., class-conditional sampling). The query set is local: it is used by the client to compute an evaluation loss or a learning signal, but raw query examples are never transmitted.

Given a predictor parameter w (or more generally a model state), we write the empirical query loss as
$$ \widehat{L}_{Q}(w)\;=\;\frac{1}{m}\sum_{(x,y)\in Q}\ell\!\left(\mathrm{pred}(w;x),y\right), $$
and similarly define L̂_S(w) on the support set. We assume throughout that ℓ(⋅, ⋅) is convex in the prediction and is L-Lipschitz and β-smooth in the linear predictor (precise parameterization is deferred to the model class in the sequel).

The server maintains a global state θ (e.g., shared parameters, a representation, or a low-dimensional adaptation structure). When client i receives θ, it computes an personalized model by applying a local adaptation map
𝒜: (S, θ) ↦ u_i(S; θ),
where u_i(S; θ) denotes the client-side degrees of freedom produced from the K shots (for instance, an adapter coefficient vector). This yields a predictor
w_i(S; θ) = 𝒲(θ, u_i(S; θ)),
where 𝒲 is the (known) mechanism that combines global and adapted components. Crucially, u_i(S; θ) depends only on (S, θ) and is computed entirely on-device within an episode.

The performance criterion is the expected post-adaptation query risk. Formally, the federated meta-objective is

and the learning goal is to find θ^⋆ ∈ arg min_θF(θ) subject to the communication and computation constraints described next. The bilevel nature arises because w_i(S; θ) is itself the result of an inner adaptation procedure operating on S.

Time proceeds in server rounds (meta-iterations) t = 0, 1, …, R − 1. In round t, the server selects a subset of participants ℐ_t ⊆ [n] (possibly at random), and broadcasts message to each i ∈ ℐ_t. This download contains the current global state θ^t (or an encoded/compressed representation thereof). After receiving θ^t, each participating client i samples a fresh episode (S_i^t, Q_i^t) ∼ 𝒯_i, computes u_i^t = u_i(S_i^t; θ^t), optionally evaluates L̂_{Q_i^t}(w_i(S_i^t; θ^t)) and/or associated derivatives, and returns upload message
M_i^t = Enc(S_i^t, Q_i^t; θ^t)
to the server. No further interaction is permitted within the episode: in particular, the client may not request additional information after inspecting S_i^t, and the server may not adaptively query the client beyond receiving M_i^t. The server then updates
θ^t + 1 = Agg({M_i^t}_{i ∈ ℐ_t}, θ^t),
where Agg is a prescribed aggregation/update rule (e.g., stochastic gradient descent on using one-shot sketches).

The episodic one-round constraint is enforced to model realistic systems in which on-device personalization must complete within a single connectivity window and cannot sustain inner-loop communication. Accordingly, we require (i) one download and one upload per participating client per round; (ii) bounded message sizes, with the intended operating regime being C_up = O(r) real numbers (or the corresponding bits to a fixed precision), and C_down determined by the representation of θ but amortizable over rounds; and (iii) predictable client computation, where the adaptation map 𝒜 should be implementable in time polynomial in K and a small intrinsic dimension (e.g., r), avoiding full-model backpropagation through a large backbone. Privacy is modeled at the protocol level: clients never transmit raw (x, y) pairs, and the server observes only the uploaded summaries {M_i^t}.

In deployments where bandwidth/energy is part of the objective, one may augment with a cost penalty,
min_θ F(θ) + γ ⋅ 𝔼[C_up(θ) + C_down(θ)],
for a chosen tradeoff parameter γ > 0. Our main focus, however, is the feasibility frontier imposed by one-round interaction: achieving low expected query risk while keeping uploads proportional to the intrinsic adaptation dimension rather than the ambient model size.

This setup defines the objects we will analyze: an episodic, heterogeneous federated environment; a bilevel personalization objective; and a strict one-round-per-episode protocol that constrains what information can be exchanged and hence what statistical rates are achievable.

4. Model Class: Linearized foundation model abstraction (fixed φ_{θ0}); low-rank adapter family; optional prompt prior as a feature shift; assumptions (convexity/smoothness/noise).

We model the pretrained foundation model as a fixed feature map
ϕ_θ₀ : 𝒳 → ℝ^d,
which is shared across all clients and is not updated in the core analysis. Given an input x ∈ 𝒳, the client computes the embedding ϕ_θ₀(x) by a forward pass through the backbone. To ensure well-posedness of the subsequent linear prediction layer, we assume a bounded-feature condition: there exists R_ϕ > 0 such that ∥ϕ_θ₀(x)∥₂ ≤ R_ϕ almost surely under each D_i. This abstraction captures the common practical regime in which personalization modifies only a lightweight head (or a low-rank adapter) while leaving the expensive representation fixed.

Given a weight vector w ∈ ℝ^d, the prediction is the scalar
u = w^⊤ϕ_θ₀(x),
and the per-example loss is ℓ(u, y). We work with losses that are convex in u and satisfy the regularity assumptions used later for stability and meta-convergence: ℓ(⋅, y) is L-Lipschitz and β-smooth uniformly over y ∈ 𝒴. Canonical examples include square loss $\ell(u,y)=\tfrac12(u-y)^2$ (with bounded u) and logistic loss ℓ(u, y) = log (1 + exp (−yu)) for y ∈ {±1}. For client i, the induced population risk is
L_i(w) = 𝔼_{(x, y) ∼ D_i}[ℓ(w^⊤ϕ_θ₀(x), y)],
and the corresponding empirical losses on finite samples are denoted L̂_S(w) and L̂_Q(w).

Client predictors are constrained to lie in a shared low-dimensional personalization family:

where w_g ∈ ℝ^d is a global component, B ∈ ℝ^d × r is a global adapter basis with rank(B) ≤ r ≪ d, and a_i ∈ ℝ^r are client-specific coefficients. The role of is twofold: statistically, it enforces a shared inductive bias across heterogeneous clients by restricting personalization to an r-dimensional subspace; operationally, it makes it possible to communicate and optimize client adaptations using O(r) numbers rather than O(d). When needed, we fix a normalization convention (e.g., B^⊤B = I_r) to separate the scale of B from that of a_i.

To accommodate prompt- or prefix-based conditioning without leaving the linearized regime, we introduce a global prompt parameter p that acts as a shared, low-dimensional perturbation of the frozen embedding. Concretely, we posit an effective feature map

for some fixed matrix U ∈ ℝ^d × d_p (or more generally any differentiable map with bounded Jacobian), so that the predictor becomes u = (w_g + Ba)^⊤ψ(x; p). This model captures the first-order effect of prompt tuning around θ₀: changes in prompt parameters induce approximately linear shifts in the representation seen by the linear head. In the sequel, p is treated as part of the server state and is shared across clients; the client-side degrees of freedom remain confined to a_i.

Given server parameters (w_g, B, p) and a K-shot support set S = {(x_j, y_j)}_j = 1^K, client i adapts by solving a regularized empirical problem over a ∈ ℝ^r:

where λ > 0 controls the bias–variance tradeoff of personalization from few shots. Under the convexity of ℓ(⋅, y), the objective in is convex in a (for fixed (w_g, B, p)), and β-smoothness yields stable solutions. In the square-loss case, reduces to ridge regression in r dimensions: with projected design rows
z_j^⊤ = (B^⊤ψ(x_j; p))^⊤ ∈ ℝ^r, Z ∈ ℝ^K × r,
and residual targets r_j = y_j − w_g^⊤ψ(x_j; p), we obtain the closed form

Thus, the client computes only projected features B^⊤ψ(x_j; p), never gradients through the backbone.

To state sharp excess-risk rates, we compare against a low-rank oracle model in which there exist (w_g^⋆, B^⋆, p^⋆) and client coefficients a_i^⋆ such that labels follow a linearized data-generating mechanism

with ξ conditionally sub-Gaussian (or bounded-moment) and possibly heteroscedastic across clients via σ_i². Under mild conditions on the projected covariance 𝔼[zz^⊤] (e.g., bounded condition number), standard regression theory yields that estimating a_i^⋆ from K shots incurs excess prediction risk on the order of σ_i²r/K, matching the intrinsic dimension r rather than d. For general convex smooth losses, we view as a local quadratic approximation: smoothness controls the deviation between the true empirical objective and its second-order expansion in the r-dimensional adapter coordinates.

The model class above isolates the information that must be extracted from an episode to personalize effectively: the client-specific quantity is the r-vector a_i (or an equivalent r-dimensional sufficient statistic), while the server coordinates the shared objects (w_g, B, p). This separation is precisely what enables one-round-per-episode interaction: each client can compute â(S) (and uncertainty information derived from Z^⊤Z + λI_r) locally, and communicate only O(r) numbers that summarize its episode in the intrinsic adaptation space.

5. Algorithm: FedPrompt-Adapter (FPA): server broadcast of (prompt prior, adapter basis), client closed-form adaptation, uncertainty estimation, and one-shot upload; server uncertainty-weighted aggregation and meta-update.

We now specify a concrete one-round protocol that instantiates the separation suggested by –: the server maintains the shared state (w_g, B, p), while each client computes an episode-specific coefficient vector in the intrinsic r-dimensional adapter coordinates and transmits only an O(r)-size summary.

At server round t ∈ {0, …, R − 1}, the server holds parameters
θ^t = (w_g^t, B^t, p^t),
where B^t ∈ ℝ^d × r is represented in a rank-r form (and, when convenient, re-normalized to satisfy (B^t)^⊤B^t = I_r). The server selects a participating subset ℐ_t ⊆ [n] and broadcasts the same message to each i ∈ ℐ_t. This message consists of (w_g^t, B^t, p^t) (or a delta / compressed representation in implementations), and constitutes the sole downlink communication for the episode. No additional interaction is permitted within the episode beyond the client reply described below.

Upon receiving (w_g^t, B^t, p^t), client i samples a fresh episode (S_i, Q_i) ∼ 𝒯_i, where S_i = {(x_ij, y_ij)}_j = 1^K. The client computes embeddings ψ(x_ij; p^t) (cf. ) and forms the projected features
z_ij = (B^t)^⊤ψ(x_ij; p^t) ∈ ℝ^r, Z_i^t ∈ ℝ^K × r.
Crucially, all computations beyond the frozen backbone forward pass occur only in dimension r. In particular, the client does not backpropagate through ϕ_θ₀, and it never constructs or communicates a d-dimensional gradient.

The client computes an adapted coefficient vector by solving the inner problem with (w_g, B, p) = (w_g^t, B^t, p^t). In the square-loss case, this is exactly the ridge solution ,
â_i^t = ((Z_i^t)^⊤Z_i^t + λI_r)⁻¹(Z_i^t)^⊤r_i^t, (r_i^t)_j = y_ij − (w_g^t)^⊤ψ(x_ij; p^t).
For general convex β-smooth losses, we view as an r-dimensional convex problem and allow the client to compute an approximate minimizer using a constant number of steps (e.g., one Newton step or a few gradient steps) initialized at a = 0; the point is that the local compute budget scales with r, not with d.

To calibrate the reliability of the episode-specific update, the client forms an uncertainty proxy derived from the local curvature in adapter space. In the square-loss model , the posterior covariance of a_i under ridge regression is proportional to
Σ_i^t ∝ ((Z_i^t)^⊤Z_i^t + λI_r)⁻¹.
Rather than communicating Σ_i^t in full (which would cost O(r²) scalars), we compress uncertainty into a single scalar weight ω_i^t = f(Σ_i^t), for instance
$$ \omega_i^t \;=\; \frac{1}{\mathrm{tr}(\Sigma_i^t)} \quad \text{or} \quad \omega_i^t \;=\; \big(\det(\Sigma_i^t)\big)^{-1/r}, $$
both of which increase when the projected design is better conditioned and the estimator variance is smaller. This scalar is sufficient to realize uncertainty-weighted averaging at the server while preserving an O(r) uplink.

Client i uploads a single message
M_i^t = (â_i^t, ω_i^t),
whose size is r + 1 real numbers. Optionally, if the client evaluates the adapted predictor on the query set Q_i locally, it may instead upload an O(r)-dimensional g_i^t along with ω_i^t, where g_i^t is an unbiased (or controlled-bias) estimator of ∇_θL̂_{Q_i}(w_g^t + B^tâ_i^t ; p^t) computed via the envelope theorem / implicit differentiation through the r-dimensional adaptation map. In either case, the protocol remains one-round and uplink cost remains O(r).

Upon receiving {M_i^t}_{i ∈ ℐ_t}, the server normalizes weights
$$ \bar\omega_i^t \;=\; \frac{\omega_i^t}{\sum_{j\in\mathcal{I}_t}\omega_j^t}. $$
If clients upload gradient sketches, the server performs the stochastic meta-update
θ^t + 1 = Π (θ^t − η∑_{i ∈ ℐ_t}ω̄_i^t g_i^t),
where Π denotes any operation enforcing the rank/normalization constraints on B (e.g., re-orthogonalization of columns). If clients upload coefficients â_i^t, the server uses them as sufficient statistics for fitting the shared objects (w_g, B, p) via a low-rank regression surrogate: informally, â_i^t identifies how client i wishes to move within the shared subspace, while ω̄_i^t discounts episodes whose K shots provide weak identifiability in that subspace. This uncertainty weighting is the algorithmic counterpart of generalized least squares and is aligned with the optimal linear aggregation principle stated later.

The protocol is intentionally structured so that all episode-specific information is summarized by an r-dimensional object (the adapted coefficients or an equivalent sketch), while the server coordinates only global degrees of freedom. The one-round constraint is met because the client can compute â_i^t and ω_i^t from S_i alone, and any query-based evaluation is performed locally prior to the single upload. This yields a per-episode communication pattern compatible with bandwidth-limited federated deployments and prepares the ground for the excess-risk and communication guarantees that follow.

6. Guarantees I (Upper Bounds): Excess query risk decomposition; O(r/K) local adaptation rate; impact of client heterogeneity; convergence in outer rounds R.

We record the basic performance guarantees that justify the one-round design at the level of excess query risk. Throughout, we measure performance on the episodic query distribution and compare against the that knows the best shared state (w_g, B, p) (and, within an episode, the best coefficients a_i) subject to the family w_i = w_g + Ba_i.

Fix a round t and client i, and abbreviate θ^t = (w_g^t, B^t, p^t). Let a_i^⋆(θ^t) denote the exact inner minimizer of the regularized support objective (cf. ) at θ^t, and let â_i^t be the coefficient produced by the client (exact ridge for square loss, or an approximate minimizer for a general convex loss). Writing w_i^t(S) = w_g^t + B^tâ_i^t(S), we decompose

where w_i^⋆(S) = w_g^⋆ + B^⋆a_i^⋆ is the oracle predictor under the low-rank model and

Term (I) is the (statistical and/or computational) incurred when estimating coefficients from K shots under one-round constraints; term (II) is the due to θ^t not yet matching the oracle global state; term (III) is the (typically negligible) gap between the episode-wise inner minimizer and the fixed oracle coefficients, and vanishes in the square-loss linear model with correctly specified ridge objective (or can be absorbed into (I) as regularization bias). In particular, under correct specification and exact ridge, (III) reduces to the usual λ∥a_i^⋆∥₂² bias term.

In the square-loss regime with the linearized noise model and with (w_g, B) = (w_g^⋆, B^⋆) fixed, term (I) admits the standard ridge bound in intrinsic dimension r. Concretely, letting Z_i ∈ ℝ^K × r denote the projected design with rows z_ij = (B^⋆)^⊤ψ(x_ij; p^⋆), Theorem~1 yields

where the hidden constants depend on mild regularity of the projected covariance (e.g., bounded condition number or eigenvalue concentration for (1/K)Z_i^⊤Z_i). The salient point is that the K-shot rate scales with r rather than d, because only the r adapter coordinates must be inferred within an episode.

For general convex L-Lipschitz and β-smooth losses, we may not have a closed form, but the same intrinsic scaling persists: if the client produces â_i^t satisfying an optimization error L̂_{S_i}(w_g^t + B^tâ_i^t) − min_aL̂_{S_i}(w_g^t + B^ta) ≤ δ, then by standard stability arguments in strongly convex regularized ERM,

and a constant number of first/second-order steps in ℝ^r can ensure δ is of the same order as the statistical term when K is moderate. Thus, one-round computation in r dimensions is sufficient to attain the correct r/K scaling up to smoothness/Lipschitz constants.

Client heterogeneity enters the bound through (σ_i², ∥a_i^⋆∥₂) and through the conditioning of the projected design Z_i. In particular, implies that high-noise clients or clients whose K examples are uninformative in the adapter subspace induce larger estimation variance. The scalar weight ω_i^t = f(Σ_i^t) is designed to downweight exactly these episodes: in the square-loss model, Σ_i^t ∝ ((Z_i^t)^⊤Z_i^t + λI)⁻¹, and choosing ω_i^t increasing in (tr Σ_i^t)⁻¹ or (det Σ_i^t)^−1/r implements a one-round analogue of generalized least squares. Consequently, when the server aggregates O(r)-dimensional sketches, the effective noise in the meta-update is governed by an inverse-variance combination rather than an unweighted average, aligning the contribution of each client with its identifiability in the shared subspace.

Let $F(\theta):=\frac{1}{n}\sum_{i=1}^n \mathbb{E}_{(S,Q)\sim\mathcal{T}_i}\big[\widehat L_Q(w_g+B\,A(S;\theta);p)\big]$ denote the meta-objective, with θ = (w_g, B, p) under a fixed rank-r parameterization of B. Under the convexity and β-smoothness assumptions, if clients upload unbiased (or controlled-bias) one-round gradient sketches and the server performs the weighted stochastic update described in Section~, Theorem~2 yields an outer optimization term of the form

after R rounds with M := |ℐ_t| clients per round. We summarize as an opt_gap(R) term and note that it is independent of K; the role of K is confined to the inner adaptation accuracy captured by (I).

Combining – and averaging over clients, we obtain the canonical guarantee promised in the enclosing scope:
$$ \frac{1}{n}\sum_{i=1}^n \mathbb{E}\Big[\widehat L_{Q_i}(w_g^R+B^R\hat a_i(S_i);p^R)\Big] \;\le\; \frac{1}{n}\sum_{i=1}^n \widehat L_{Q_i}(w_g^\star+B^\star a_i^\star;p^\star) \;+\; O\!\left(\frac{1}{n}\sum_{i=1}^n \frac{\sigma_i^2 r}{K}\right) \;+\; \mathrm{opt\_gap}(R), $$
up to the regularization bias O(λ∥a_i^⋆∥₂²) (or its convex-loss analogue). With the usual tuning λ ≃ σ_i²/K (or a robust pooled choice), the excess query risk is Θ(σ_i²r/K) per client, achieved with one-round communication and an O(r)-size uplink. The next section shows that this O(r) information transfer is not merely sufficient but (up to logarithmic factors) necessary for worst-case one-round personalization at target excess risk ε.

7. Guarantees II (Lower Bounds & Tightness): Communication lower bound for one-round personalization; show near-matching upper bound; discuss necessity of sending Ω(r) information.

We now justify that the one-round design is not merely a convenient engineering constraint but is, in a precise sense, information-theoretically tight: achieving nontrivial personalization accuracy in the low-rank family w_i = w_g + Ba_i requires each client to convey Ω(r) degrees of freedom per episode (up to logarithmic factors in the target error). The conclusion is that the O(r)-dimensional uplink used by our protocol is essentially necessary whenever the goal is to compete with the centralized low-rank oracle under worst-case client heterogeneity.

To isolate the communication bottleneck, we consider the square-loss regime and restrict attention to a class of episodic tasks in which the global state (w_g, B) is already known to the server, and the unknown per-client quantity is the coefficient vector a_i^⋆ ∈ [−1, 1]^r. This restriction can only make the problem easier; hence any lower bound in this setting applies to the general meta-learning problem in which (w_g, B, p) must also be learned.

Concretely, fix an embedding distribution such that the projected features z = (B^⊤ϕ_θ₀(x)) ∈ ℝ^r are well-conditioned (e.g., isotropic subgaussian with 𝔼[zz^⊤] = I). The client observes a K-shot support set S = {(z_j, y_j)}_j = 1^K with
y_j = ⟨a_i^⋆, z_j⟩ + ξ_j, ξ_j ∼ 𝒩(0, σ²),
and aims to produce (after one server download and one upload) a predictor whose query risk is within ε of the oracle that knows a_i^⋆. Since the server already knows (w_g, B) in this reduction, the client-to-server message is the sole channel by which the server can infer a_i^⋆.

The following statement formalizes the necessity of transmitting Ω(r) information in one round.

The core mechanism is that, on a well-conditioned design, prediction accuracy forces coefficient accuracy. Indeed, under 𝔼[zz^⊤] = I and square loss,
𝔼_z[(⟨a, z⟩ − ⟨a^⋆, z⟩)²] = ∥a − a^⋆∥₂²,
so achieving excess risk ≤ ε requires ∥a − a^⋆∥₂² ≲ ε (up to constants and noise-floor terms). Thus any successful protocol must enable the server to localize a^⋆ to an ℓ₂ ball of radius on the order of $\sqrt{\varepsilon}$.

To convert localization into bits, we choose a packing 𝒜 ⊂ [−1, 1]^r with pairwise separation $\|a-a'\|_2\ge c\sqrt{\varepsilon}$ and cardinality $|\mathcal{A}|\ge (c'/\sqrt{\varepsilon})^{r}$ for absolute constants c, c^′ > 0 (a standard volumetric bound). Let A be uniform on 𝒜 and let the support set S be generated from A as above. If a protocol outputs a predictor with excess risk ≤ ε for all A ∈ 𝒜, then the server can decode (with constant probability) which element of 𝒜 generated the data, because two distinct hypotheses are separated by more than the tolerance implied by the risk guarantee. By Fano’s inequality, successful decoding implies
I(A; M) ≥ Ω(log |𝒜|) = Ω(rlog (1/ε)),
where M is the client message. Since mutual information is bounded by the number of communicated bits in any protocol, the stated lower bound follows.

Theorem~ matches our one-round scheme up to logarithmic factors. In particular, sending an r-dimensional coefficient vector (or an r-dimensional gradient sketch) with constant precision uses O(r) real numbers; quantizing each coordinate to b = Θ(log (1/ε)) bits yields an O(rlog (1/ε))-bit uplink sufficient to preserve an ε-level risk target, aligning with the lower bound scaling. Conversely, any attempt to reduce the uplink dimension below r in the worst case necessarily sacrifices the ability to personalize across an r-dimensional family of client shifts.

The lower bound should be read as an ``intrinsic dimension’’ statement: personalization accuracy is bottlenecked not by the ambient embedding dimension d but by the adaptation rank r. This is precisely the regime targeted by low-rank adapters: we may choose r ≪ d to make both statistical adaptation (the r/K phenomenon) and one-round communication feasible, but we cannot, in general, push below Ω(r) information transfer without weakening the target guarantee or imposing additional structure (e.g., sparsity, shared clustering, or multi-round interaction). This establishes the sense in which our upper bounds are near-minimax optimal under the one-round constraint.

8. Extensions: (a) semi-supervised support sets (unlabeled examples) via pseudo-labeling/EM; (b) class-incremental sessions; (c) robustness under cross-domain shifts via shift-aware weighting; (d) privacy add-ons (DP noise) and resulting risk/comm tradeoffs.

We record several natural variants of the one-round template that preserve the core structural constraints (single download, single upload; local adaptation restricted to an r-dimensional subspace) while addressing common deviations from the basic few-shot supervised episodic setting.

In many deployments, the client can cheaply obtain additional unlabeled examples in an episode. We model an episode as (S, U, Q) where S = {(x_j, y_j)}_j = 1^K and U = {u_ℓ}_ℓ = 1^K_u with unknown labels. The client may incorporate U without additional communication by performing a local self-training step in the adapter space. Concretely, define projected features z = B^⊤ϕ_θ₀(x) ∈ ℝ^r and let ỹ(u_ℓ) be a pseudo-label produced by the current global predictor (or a soft label in [0, 1] under logistic loss). The client then solves the augmented regularized problem
$$ \hat{a}(S,U)\in\arg\min_{a\in\mathbb{R}^r}\;\sum_{(x,y)\in S}\ell(\langle w_g+Ba,\phi(x)\rangle,y) \;+\;\alpha\sum_{u\in U}\ell(\langle w_g+Ba,\phi(u)\rangle,\tilde{y}(u)) \;+\;\frac{\lambda}{2}\|a\|_2^2, $$
for a fixed weight α ∈ [0, 1] controlling the influence of pseudo-labels. For square loss, this remains a ridge problem with K + K_u effective samples and admits a closed form; for general convex smooth ℓ, one may run a small, fixed number of Newton/gradient steps in ℝ^r (still purely local, hence compatible with one-round communication). A simple EM interpretation is available when ỹ(u) is replaced by a latent variable: alternately update pseudo-labels using the current a and update a given pseudo-labels. Since all inner iterations occur on-device, the uplink message size is unchanged (still O(r) reals, e.g. â or an r-dimensional sketch). Statistically, if the pseudo-label error rate is controlled (e.g. 𝔼[∥ỹ − y∥²] ≤ η²), the induced excess risk decomposes into an estimation term benefiting from improved conditioning of the empirical Gram matrix (scaling like r/(K + K_u) under standard assumptions) plus an additional bias term on the order of α η²; thus the method interpolates between pure supervision (α = 0) and aggressive self-training.

In class-incremental personalization, a client encounters a sequence of sessions, each introducing new classes. A convenient formulation is a multi-class linear head W_i ∈ ℝ^d × c_i with low-rank personalization
W_i = W_g + BA_i, A_i ∈ ℝ^r × c_i,
where c_i grows over time and B is shared. Within a session that introduces c_new new classes, the client adapts only the corresponding columns of A_i using the session support set, with ridge regularization to prevent drift:
$$ \hat{A}_{\mathrm{new}} \in \arg\min_{A_{\mathrm{new}}}\;\widehat{L}_{S}\!\big(W_g + B[A_{\mathrm{old}},A_{\mathrm{new}}]\big) \;+\;\frac{\lambda}{2}\|A_{\mathrm{new}}\|_F^2 \;+\;\frac{\lambda_{\mathrm{st}}}{2}\|A_{\mathrm{old}}-\bar{A}_{\mathrm{old}}\|_F^2, $$
where Ā_old denotes the previous session’s coefficients retained locally (or a server-provided anchor). This yields one-round updates in which the client uploads only the newly adapted columns (communication O(r c_new) reals), while the server continues to refine (W_g, B) across clients. The same uncertainty-weighting principle applies by estimating a posterior covariance for each new column (or for the block A_new), ensuring that sessions with insufficient shots are downweighted in the meta-update.

When the client’s episode distribution differs substantially from the training mixture implicit in the current global state, naive aggregation can overfit to outlier domains or underfit underrepresented ones. Since the server does not observe raw data, we restrict to shift diagnostics computable locally and transmissible as a low-dimensional scalar. A canonical choice is a discrepancy between the client’s support embedding statistics and a global reference, e.g.
$$ \mu_i \;=\; \frac{1}{K}\sum_{(x,y)\in S_i}\phi_{\theta_0}(x),\qquad D_i \;=\; \|\mu_i-\mu_0\|_2, $$
where μ₀ is a running global mean maintained by the server. The client may upload a single additional scalar D_i (or a quantized version) and the server may modify the aggregation weights to
$$ \omega_i \;=\; \omega_i^{\mathrm{unc}}\cdot \frac{1}{1+\gamma D_i}, $$
with ω_i^unc derived from posterior variance as in the square-loss analysis and γ ≥ 0 a tunable robustness parameter. More generally, one can replace D_i by an MMD proxy computed in a fixed random feature space of dimension O(1), preserving one-round communication. In smooth convex losses, such reweighting can be interpreted as optimizing a distributionally robust surrogate in which heavily shifted episodes contribute less, and it yields stability gains when the per-episode gradients have heavy tails.

If the client message is an r-vector (e.g. â_i or a gradient sketch), the client can enforce record-level differential privacy by adding Gaussian noise and, if needed, clipping in ℓ₂:
m̃_i = clip(m_i; S) + 𝒩(0, τ²I_r),
where τ is calibrated to (ε_dp, δ_dp) and the sensitivity induced by the clipping threshold S. In the square-loss linearized regime, this is equivalent to inflating the effective observation noise, and the excess risk bounds acquire an additional term scaling with τ²; heuristically, one replaces σ_i² by σ_i² + τ² in the O(σ_i²r/K) contribution, while the communication lower bound in r remains unchanged (privacy does not reduce the intrinsic dimension). On the quantization side, if each coordinate is transmitted with b bits, then b = Θ(log (1/ε)) remains sufficient to preserve an ε-level accuracy target, but DP noise may force larger K (or smaller r) to compensate for the added variance. This delineates a three-way tradeoff between personalization rank r, statistical budget K, and privacy level (ε_dp, δ_dp) under the one-round constraint.

9. Experimental Plan (Recommended): Benchmarks, federated splits, baselines (ICL-only, adapter-only, multi-round FL/meta-FL), and metrics (accuracy, bandwidth, wall-clock, energy proxy, calibration).

To validate the one-round template under realistic heterogeneity, we recommend an experimental protocol that (i) instantiates a fixed foundation embedding ϕ_θ₀, (ii) constructs federated episodic task distributions {𝒯_i}_i = 1ⁿ with controlled domain and label shift, and (iii) reports not only accuracy but also communication, wall-clock, energy proxies, and calibration. Throughout, we keep the backbone frozen and measure performance as a function of the few-shot budget K, adapter rank r, and number of server rounds R (with M participating clients per round). Each participating client draws an episode (S_i, Q_i) ∼ 𝒯_i with |S_i| = K and |Q_i| = m, adapts locally to obtain a_i(S_i), and uploads a single summary, thereby respecting the single-download/single-upload constraint.

We advocate covering at least two regimes: (1) few-shot classification using a frozen CLIP/ViT embedding ϕ_θ₀ with standard datasets (e.g. miniImageNet, CIFAR-FS, tieredImageNet), and (2) intent/topic classification using a frozen sentence encoder (e.g. a BERT-family embedding or an instruction-tuned encoder). In both cases, the prediction head is linear in the embedding: x ↦ ⟨w, ϕ_θ₀(x)⟩ (binary) or a multiclass linear head. To stress the federated aspect, we recommend at least one naturally federated benchmark (e.g. LEAF-style user partitions for text or handwriting) where each client corresponds to a user/device, and at least one benchmark where each client corresponds to a domain (camera type, location, topic) so that D_i changes primarily through covariate shift in ϕ_θ₀(x). Since the claims are statistical and communication-theoretic, synthetic controls are also useful: generate z = B^*⊤ϕ_θ₀(x) ∈ ℝ^r with known B^* and heteroscedastic noise σ_i² to verify the predicted Θ(σ_i²r/K) scaling under square loss.

We recommend three families of splits. First, : assign each client a subset of labels or a Dirichlet-mixed label distribution, yielding heterogeneous D_i(y) while keeping ϕ_θ₀ fixed. Second, : partition by domain such that ∥μ_i − μ₀∥₂ differs markedly, with $\mu_i=\frac{1}{K}\sum_{(x,y)\in S_i}\phi_{\theta_0}(x)$, to test shift-aware weighting. Third, : vary the rate at which clients can sample episodes and/or vary K across clients to mimic intermittent participation and unequal data availability. For each setting, we define 𝒯_i by sampling K labeled support points i.i.d. from D_i and sampling m queries from the same distribution; evaluation uses fresh episodes disjoint from those used for meta-updates. We recommend reporting both (a) the average query metric across clients and (b) tail metrics (e.g. the 10th percentile across clients), since uncertainty-weighting is intended to stabilize performance when some episodes are poorly conditioned.

We recommend including baselines spanning the design space: (i) (prompt/prefix p only, no learned adapter basis B and no client-specific a_i beyond what can be encoded in the prompt), evaluated with the same support examples S as context; (ii) where each client fits an r-dimensional adapter coefficient a_i but with a fixed, non-meta-learned basis (e.g. random B, or B obtained once by PCA on a public/held-out embedding stream), isolating the contribution of federated meta-learning of B; (iii) baselines that relax the one-round constraint, such as FedAvg on (w_g, B) with multiple local steps and multiple communication rounds per episode, to quantify the cost of one-roundness; and (iv) baselines (e.g. MAML/Reptile-style adaptations in the r-dimensional space, or FedPer-like shared head plus local fine-tuning) run with comparable total client compute. When possible, we include a centralized oracle that trains B and w_g using pooled episodes (or pooled embeddings) to upper-bound attainable performance under the same low-rank parameterization.

Primary predictive metrics should include accuracy (or macro-F1 for imbalanced labels) and, when probabilistic outputs are available, negative log-likelihood and Brier score. Calibration is essential in few-shot personalization; we recommend expected calibration error (ECE) on Q, reliability diagrams aggregated across clients, and per-client calibration variability. Communication should be reported as (C_down, C_up) in both scalars and bytes: for instance, uploading a_i ∈ ℝ^r costs O(r) reals (or rb/8 bytes at b-bit quantization), while downloading B ∈ ℝ^d × r costs O(dr) reals but can be amortized by infrequent updates (hence we report amortized cost per episode). Wall-clock time should be separated into on-device adaptation time (projection O(Kdr) plus solve O(Kr² + r³)) and server aggregation time, measured on representative hardware. For an energy proxy, we recommend a simple linear model
E_i = α_flop ⋅ FLOPs_i + α_byte ⋅ (C_up + C_down),
with stated coefficients α_flop, α_byte and sensitivity analysis across plausible ratios. Finally, we recommend ablations over (K, r, R), over weighting schemes (uniform versus uncertainty-weighted ω_i), and over message precision (float32 versus quantized), since these directly test the predicted statistical–communication tradeoffs.

10. Discussion & Limitations: When linearization is valid, where ICL priors help/hurt, open questions on nonconvex deep regimes, and implications for Green AI reporting.

Our claims are predicated on a regime in which the frozen embedding ϕ_θ₀ renders each client episode amenable to a low-dimensional, approximately linear predictor. This premise is plausible when (i) the downstream label depends smoothly on semantic features already separated by the foundation model, and (ii) the client-specific variability can be expressed as a low-rank shift Ba_i in the head. Under these conditions, local adaptation is statistically well-behaved and admits closed-form or near-closed-form estimators in the r-dimensional space. However, linearization can fail when the relevant decision boundary is not linearly separable in ϕ_θ₀(x), when spurious correlations differ substantially across clients, or when the task requires compositional generalization not captured by the frozen embedding. In such cases, the observed excess risk may saturate as K increases, and the Θ(σ_i²r/K) scaling should be viewed as describing the error in the adapter space rather than the irreducible error induced by the low-rank model class.

Even within the linear regime, the oracle-type bounds implicitly assume that the projected design Z = (B^⊤ϕ_θ₀(x)) is sufficiently well-conditioned at the episode scale. Few-shot episodes can be ill-posed: for small K, Z^⊤Z may be close to singular, and the ridge estimator then depends delicately on λ and on the spectrum of Z^⊤Z. Our uncertainty-based weighting partially addresses this issue at aggregation time by downweighting high-variance episodes, but it does not eliminate the fact that some clients may systematically operate in poorly conditioned subspaces (e.g. clients with narrowly distributed inputs). Thus, while uncertainty weighting can stabilize averages, it may also shift the effective objective toward clients with easier geometry, raising a fairness question: whether to prioritize minimax (tail) risk or average risk, and how to choose weights to reflect the desired social objective rather than purely statistical efficiency.

The role of in-context learning (ICL) priors and prompts p is similarly double-edged. On one hand, prompts can be interpreted as inducing a task-conditioned feature shift or prior over predictors, which may reduce the effective dimension of adaptation and improve few-shot performance, especially when labels are aligned with natural language descriptions. On the other hand, prompts can introduce that is orthogonal to the client’s data: if the prompt encodes a global prior that mismatches a client domain, it may systematically distort logits and inflate the bias term in the excess risk decomposition. This is most visible under label shift or when the prompt contains demonstrations drawn from a population that under-represents certain clients. Moreover, prompts are not merely parameters: they often carry semantic content, and prompt engineering choices can create hidden degrees of freedom that complicate reproducibility. From a privacy standpoint, one must also ensure that prompts or retrieved demonstrations do not inadvertently memorize or leak client-specific information; the one-round constraint limits interaction but does not by itself prevent leakage through the content of messages.

A further limitation is that our cleanest statements rely on convexity and smoothness in the adapted parameters. While the local problem in a can be convex for generalized linear losses, end-to-end systems that update B or p through a deep network are nonconvex, and the meta-objective may exhibit sharp curvature and instability under client heterogeneity. The present framing can be viewed as an attempt to isolate a tractable subproblem: keep ϕ_θ₀ frozen, restrict personalization to a low-rank linear family, and enforce one-round communication. An open question is whether analogous communication-optimality statements hold when personalization requires updates, e.g. shallow adapters inside the backbone, or when the best response map A(S; ⋅) is computed by several steps of nonconvex optimization. Relatedly, for modern large models, the empirical success of low-rank adaptation suggests that the effective dimension may indeed be small, but translating this phenomenon into a verifiable condition (e.g. a notion of task-dependent intrinsic dimension) remains unresolved.

The one-round requirement itself is both a feature and a constraint. It is well-motivated by intermittent connectivity and by the cost of interactive protocols, but it restricts the richness of information that can flow from client to server. Our lower bound formalizes this in a stylized setting: to achieve oracle-level personalization one must communicate at least Ω(r) degrees of freedom (up to logarithms). Nevertheless, there may be intermediate regimes in which messages—random sketches, quantized sufficient statistics, or client-side coresets in the embedding space—yield better constants than naive coefficient transmission while still respecting the one-shot format. Understanding the optimal tradeoff between message structure, robustness to heteroscedasticity, and privacy mechanisms (secure aggregation, differential privacy noise) is largely open; in particular, privacy noise effectively increases σ_i², and thus interacts directly with the predicted σ_i²r/K scaling.

Finally, the implications for Green AI reporting are immediate. The one-round framing makes communication and on-device adaptation a first-class part of the learning objective, rather than an afterthought. Yet in practice, papers often report only accuracy improvements without the amortized download cost of distributing B (or a prompt) and without energy proxies for repeated on-device projection and solves. We therefore regard it as a limitation of the current culture, not only of our method, that results are rarely accompanied by transparent accounting of C_up, C_down, quantization level, participation rate, and the resulting energy/carbon implications. Because our guarantees explicitly tie excess risk to K, r, and one-round communication, they also provide a natural template for such reporting: when a method improves accuracy, one should also state whether it did so by increasing r, increasing R, increasing message precision, or relaxing one-roundness, and how those changes translate into concrete resource consumption.

11. Conclusion: What the one-round FS-FL framing changes, and the main theoretical takeaways.

The one-round FS-FL framing forces us to treat as a primitive, not as a byproduct of running many rounds of distributed optimization. Concretely, each episode provides a client with only a K-shot support set S and at most one exchange of messages with the server. Under this constraint, the usual separation between local training'' andglobal averaging’’ is no longer adequate: the client must produce, from S alone, a complete description of how it should be specialized, and the server must update a shared state using summaries that are necessarily low bandwidth. The structural assumption w_i = w_g + Ba_i makes this feasible by reducing personalization to an r-dimensional estimation problem.

Within this model, our main statistical takeaway is that few-shot personalization admits an oracle-type rate when the effective adaptation dimension is small. In the square-loss linearized regime, once (w_g, B) are aligned with the underlying low-rank structure, the ridge estimator â_i(S) achieves expected query excess risk scaling as
$$ \mathbb{E}\big[\widehat{L}_Q(w_g+B\hat a_i)-\widehat{L}_Q(w_g+Ba_i^*)\big] = \Theta\!\left(\frac{\sigma_i^2\, r}{K}\right) $$
up to regularization bias. This rate is informative because it isolates the dependence on the personalization degrees of freedom r and on the episode budget K. In particular, it suggests a reporting template in which gains in accuracy are interpreted through reductions in effective dimension (smaller r), reductions in noise (better embeddings), or increases in labeled shots K, rather than through opaque algorithmic complexity.

Our algorithmic takeaway is that one-round personalization is compatible with principled aggregation. Since each client can return only a small message, it is natural to transmit either (i) the adapted coefficients a_i (or a sketch thereof) and a scalar uncertainty proxy, or (ii) a compact meta-gradient estimate together with an uncertainty weight. The uncertainty-calibrated weighting is not merely heuristic: in the linear-Gaussian specialization it is the generalized least-squares solution, and hence minimax-optimal among unbiased linear aggregators given client-specific covariances. Operationally, this provides a clean bridge between what clients can compute locally (posterior covariances in the r-dimensional ridge problem) and what the server should do globally (downweight high-variance or ill-conditioned episodes).

Our optimization takeaway concerns the role of server rounds R. One-round communication is an constraint; it does not preclude repeated participation across rounds. Thus the meta-learning problem becomes a stochastic approximation problem over θ = (w_g, B, p), where each round observes independent per-client episodes and returns a single summary per client. Under convexity and smoothness assumptions on the induced meta-objective, standard SGD analysis yields an opt_gap(R) = O(1/R)-type term (with the usual variance dependence on the number of participating clients per round). In this sense, the framing decouples two sources of error: the estimation error σ_i²r/K that no amount of server-side iteration can remove, and the optimization error that vanishes as R grows.

Our communication takeaway is that the above dependence on r is unavoidable under one-round protocols. The lower bound Ω(rlog (1/ε)) bits per client per episode formalizes a constraint that is often implicit in practice: if personalization genuinely requires r independent degrees of freedom, then any method that achieves oracle-level excess risk must receive Ω(r) worth of information from each client (up to logarithmic factors and precision conventions). This does not dictate a unique message format, but it does delimit what is possible. In particular, it rules out the hope that clever server-side processing can systematically compensate for arbitrarily small client uploads when the client-specific component is intrinsically r-dimensional.

The framing also clarifies how prompts p and low-rank adapters B should be interpreted. Both serve to reduce the effective complexity of the client’s best response map A(S; ⋅): B restricts adaptation to an r-dimensional subspace, while p can be viewed as a shared prior or conditioning variable that reshapes the embedding-to-logit map before adaptation. From the perspective of the theorems, these ingredients are valuable precisely when they reduce the variance term (by improving conditioning or alignment) without inducing a large approximation penalty. The one-round constraint makes this tradeoff explicit: since the client cannot iteratively refine its state through interaction, any misalignment must be compensated by either larger K or larger r, both of which have direct resource consequences.

We therefore regard the principal contribution of the one-round FS-FL viewpoint as a disciplined accounting identity: to obtain low query risk, one must pay either in shots K, in adaptation dimension r, in server rounds R, or in communicated precision—and these payments can be compared on a common axis of on-device compute and bandwidth. The upper and lower bounds together provide a reference envelope against which practical systems can be positioned: if a method claims improvements, one can ask whether it effectively decreases σ_i², decreases r, improves the constants via better weighting and conditioning, or relaxes the one-round restriction. This is the sense in which the theory ``changes the framing’’: it turns personalization under communication constraints from an implementation detail into a measurable, optimizable object.