Reset Is a Scalpel: Safe Partial Reset Operators for Offline-to-Online RL with Stability Guarantees

Metadata

Total Words: 7,519
Export Date: 2026-01-20 02:26:00
Description: Offline-to-online RL exhibits a stability–plasticity trade-off: methods that maximize plasticity (e.g., full parameter reset as in training-from-scratch variants) can catastrophically degrade performance relative to the best offline baseline, while conservative fine-tuning can stall. Recent work formalizes this via stability and plasticity metrics and identifies regimes where either the pretrained policy π₀ or the offline dataset 𝒟 is the stronger knowledge anchor. We push this view to a modern 2026 setting with large, modular policies by introducing partial reset operators that reinitialize only select submodules (critic head, actor head, last layer, adapters/LoRA) while preserving the pretrained representation. We propose SPaR, a safe fine-tuning procedure that trains a partially-reset learner in shadow and deploys it only when a certified performance lower bound exceeds a stability floor. In a clean linear-MDP-with-fixed-features model, we prove (i) a high-probability stability guarantee min_tJ(π_t) ≥ J_off^* − ε, (ii) sample complexity scaling with head dimension rather than full network dimension under a realizability assumption that the optimal policy is reachable by changing only the head, and (iii) matching lower bounds showing full reset can incur constant stability loss and requires larger sample complexity when the encoder must be relearned. Empirically (recommended), we expect partial resets to dominate full reset on the stability–plasticity Pareto frontier, especially in Inferior-regime tasks (e.g., Adroit/Kitchen) and limited online budgets.

1. Introduction: offline-to-online inconsistency, stability–plasticity framing, why full reset is too destructive in 2026-scale modular policies, and the central claim that partial reset + safe deployment dominates.
2. Preliminaries and Metrics: offline-to-online setup; stability and plasticity definitions; regime notion (Superior/Comparable/Inferior) from returns; modular parameterization and reset operators.
3. Problem Formulation: Stability-Constrained Fine-Tuning with Reset Operators: define objective (maximize final return / improvement) subject to stability floor relative to J_off^*; define allowed reset families (head, adapters, last layer).
4. Algorithm: SPaR (Safe Partial Reset): shadow training, deployment gating via certified lower bounds, and optional mixing with offline replay; pseudocode and invariants.
5. Theory I — Stability Guarantees: formal theorem giving min_tJ(π_t^deploy) ≥ J_off^* − ε with high probability; discussion of the role of evaluation/OPE and gating.
6. Theory II — Sample Complexity Benefits of Partial Reset: linear MDP / fixed features; head-only learning rates; upper bounds for reaching ε-optimality; comparison to full reset where encoder must be learned.
7. Lower Bounds and Separations: instances where (a) immediate full reset causes Ω(1) stability loss, (b) learning with encoder reset requires Ω(D/ε²) samples while head reset needs O(d/ε²); interpret as Pareto dominance.
8. Experimental Design (Strengthening Section): benchmark suite, Inferior-regime focus, partial reset variants (critic head / actor head / adapters), metrics (stability floor, time-to-recover, Pareto frontier), and ablations (with/without gating, with/without offline replay).
9. Related Work: offline-to-online RL (WSRL/RLPD/CalQL/ReBRAC), continual learning resets, primacy bias/dormant neurons, adapter fine-tuning, safe policy improvement and OPE.
10. Discussion and Future Work: selecting reset subsets automatically, extending beyond same-MDP, richer knowledge measures beyond return, and implications for foundation-model agents.

Content

1. Introduction: offline-to-online inconsistency, stability–plasticity framing, why full reset is too destructive in 2026-scale modular policies, and the central claim that partial reset + safe deployment dominates.

Offline-to-online reinforcement learning begins with an initialization produced from a fixed dataset and ends with a policy deployed under the true dynamics of the environment. The central difficulty is that the offline objective is optimized under the sampling distribution induced by the dataset, whereas online interaction is governed by the state–action visitation of the deployed policy. Even when the offline policy π₀ is strong on the support of 𝒟, online fine-tuning can be destabilizing because exploratory updates move the policy into regions where value estimation is unreliable, the critic is weakly constrained by data, and the resulting actor updates may be driven by extrapolation error. Conversely, suppressing all change (e.g., by severe regularization toward π₀) can preserve performance but prevent improvement when π₀ is suboptimal for the true MDP. We therefore treat offline-to-online transfer as a constrained optimization problem in which we must negotiate performance preservation and improvement simultaneously.

We formalize this negotiation as a stability–plasticity trade-off along the deployed policy sequence {π_t}. Stability demands that the return of the deployed policy does not fall materially below an offline baseline that we can justify from 𝒟. Plasticity demands that the learning procedure can depart from the offline solution sufficiently to exploit new online evidence and surpass that baseline. In the offline-to-online setting, the trade-off is acute because instability is incurred in real interaction: a single catastrophic deployment can dominate discounted return, violate safety constraints, or incur irreversible costs. Meanwhile, plasticity is often needed precisely in the regime where the offline dataset is incomplete or biased, so that π₀ is not the correct answer to the online task.

A common heuristic response to poor fine-tuning behavior is to ``reset and relearn,’’ either by reinitializing large parts of the network or by training a new policy from scratch using online interaction. This is a coherent strategy when the online budget is large and when performance during learning is not itself constrained. However, for modern modular policies, full resets are structurally misaligned with the problem constraints. First, full reset typically destroys the pretrained representation learned from 𝒟; if the encoder θ_E already contains task-relevant features, then reinitializing θ_E forces the online learner to pay an avoidable sample complexity proportional to the full parameter dimension D rather than the head dimension d. Second, full reset tends to induce an immediate return collapse if the reset policy is ever deployed prior to re-achieving competence; in sparse-reward or brittle-control MDPs, even one episode of near-random behavior can imply an Ω(1) drop from any meaningful offline baseline. Third, from an engineering viewpoint, 2026-scale policies are explicitly modular (encoders, heads, adapters, auxiliary branches), and the hypothesis that modules should be relearned online is usually unwarranted: the distribution shift from π_𝒟 to the deployed online policy is substantial, but it does not follow that the representation itself must be relearned to recover or improve return.

These observations suggest that the relevant design degree of freedom is not whether to reset, but to reset and to deploy the result. We advocate a partial-reset perspective: we keep the encoder fixed (or nearly fixed) and reinitialize only a small submodule—typically the actor and/or critic heads, or lightweight adapters—thereby injecting optimization mobility while preserving the pretrained representation. This choice is justified when the optimal online policy lies in the hypothesis class obtained by varying only the head parameters, i.e., when representational realizability holds for the frozen encoder. In that case, the online learning problem reduces to estimating a low-dimensional parameter under fixed features, and the relevant statistical rates scale with d, not D. Partial reset is thus a controlled form of plasticity: it permits the learner to escape a suboptimal basin induced by offline pretraining without paying the cost of re-learning features that are already correct.

Partial reset alone does not resolve the deployment risk, because even a head-only reset may produce a temporarily incompetent policy while it is being re-optimized online. We therefore combine partial reset with safe deployment gating: the reset learner is trained ``in shadow’’ using mixed replay, and we switch the deployed policy only when a high-probability lower bound L̂(π) certifies that the candidate meets a pre-specified floor derived from offline baselines. The logical effect is that instability is bounded by construction at deployment time, while plasticity is recovered through the reset-induced mobility in the learner. Our central claim is that, under standard assumptions making L̂ valid and making head optimization statistically efficient, this combination dominates both extremes: it strictly improves stability relative to immediate full reset, and it strictly improves plasticity (and sample complexity) relative to no reset or to encoder reset. The improvement is most pronounced in the regime, where J(π₀) underperforms the dataset knowledge level J(π_𝒟): here, naive fine-tuning can be both unstable and unproductive, whereas partial reset with gated deployment can preserve the best offline behavior while permitting reliable online improvement.

2. Preliminaries and Metrics: offline-to-online setup; stability and plasticity definitions; regime notion (Superior/Comparable/Inferior) from returns; modular parameterization and reset operators.

We work in a discounted Markov decision process (MDP) ℳ = (𝒮, 𝒜, P, r, γ) with bounded rewards r(s, a) ∈ [0, 1] and discount γ ∈ (0, 1). For any (possibly stochastic) policy π, we write the discounted return as
J(π) = 𝔼[∑_t ≥ 0γ^tr(s_t, a_t)], a_t ∼ π(⋅ ∣ s_t), s_t + 1 ∼ P(⋅ ∣ s_t, a_t),
where the expectation includes the initial state distribution (suppressed for brevity). Our focus is the offline-to-online setting: we are given a fixed offline dataset 𝒟 of trajectories or transitions collected by an unknown mixture of behavior policies. We denote by π_𝒟 an abstract policy whose induced visitation is representative of 𝒟; this symbol is not meant to be operationally known, but it is convenient for separating (i) what is achievable using only data support and (ii) what is achievable after online interaction.

From 𝒟 we obtain an offline-pretrained actor–critic initialization (π₀, Q₀) with parameters θ₀. We will compare online performance not only to J(π₀), but also to a dataset-derived baseline. Concretely, we define a dataset knowledge level J(π_𝒟) as the mean return of trajectories contained in 𝒟 (or an analogous estimate when only transitions are available). We then set the best offline baseline
J_off^* := max (J(π₀), J(π_𝒟)),
which captures the strongest behavior we can justify before online interaction. This choice is conservative in the sense that it does not presuppose that π₀ is necessarily better than the behavior embedded in the dataset.

During online fine-tuning we are allowed N environment steps (or, equivalently, a bounded number of episodes). The learner may update parameters off-policy using replay, possibly mixing offline and online samples. Since performance during learning matters, we distinguish between (i) the parameters being optimized and (ii) the parameters that actually generate interaction. The deployed policies form a sequence {π_t}_t = 0^N, where π_t is the policy used at online step t (or over a short block of steps between checkpoints). Our guarantees and metrics are stated in terms of the returns J(π_t) along this deployed sequence.

Given a reference level l ∈ ℝ, we quantify stability by the signed violation
Stability(l) := min (min_{0 ≤ t ≤ N}J(π_t) − l, 0).
Thus Stability(l) = 0 indicates that the deployed sequence never falls below l, whereas Stability(l) < 0 measures the worst drop below l. In our setting the relevant floor is l = J_off^* − ε, where ε ≥ 0 is a user-chosen slack capturing tolerable degradation relative to the best offline baseline.

Plasticity is intended to capture the capacity to improve during fine-tuning. We record the range of achieved deployed performance,
Plasticity := max_{0 ≤ t ≤ N}J(π_t) − min_{0 ≤ t ≤ N}J(π_t),
and we will also consider the improvement over the offline baseline, max_tJ(π_t) − J_off^*, as an objective-level summary. When stability is enforced by design (so the minimum is controlled), these quantities are closely aligned with the ability to surpass J_off^* within the interaction budget.

Because offline initialization quality varies substantially across tasks and datasets, it is useful to stratify instances by comparing J(π₀) and J(π_𝒟). Fixing a tolerance τ ≥ 0 to ignore statistical noise, we define:
Superior: J(π₀) ≥ J(π_𝒟) + τ, Comparable: |J(π₀) − J(π_𝒟)| ≤ τ, Inferior: J(π₀) ≤ J(π_𝒟) − τ.
The Inferior regime is the one in which naive fine-tuning is most delicate: the initialization underperforms the dataset behavior, yet the online procedure must still preserve at least J_off^* while searching for improvements.

We assume the policy is parametrized by θ = (θ_E, θ_H), where θ_E denotes an encoder or representation block (possibly shared by actor and critic) and θ_H denotes a task-specific head (actor head and/or critic head). Let D be the total parameter dimension and d ≪ D the effective dimension of the head submodule to be reset. For any subset S ⊆ {E, H} (or a finer module partition), we define a reset operator R_S that reinitializes θ_S while leaving the complement fixed:
$$ R_S(\theta)\;=\;\big(\theta_{\overline{S}},\ \mathrm{Init}(\theta_S)\big), $$
where Init(⋅) denotes the chosen random or heuristic initialization. We write ℛ = {R_S} for the allowed family of such operators (e.g., head-only, adapters only, last-layer only). Finally, since deployment decisions will rely on performance certification, we assume access to a high-probability lower-bound estimator L̂(π) for J(π), obtained via limited rollouts or conservative off-policy evaluation; we will use L̂ only as a certificate, not as a training signal.

These definitions allow us to state the online fine-tuning task as a stability-constrained optimization problem over policy updates and reset choices, which we formalize next.

3. Problem Formulation: Stability-Constrained Fine-Tuning with Reset Operators: define objective (maximize final return / improvement) subject to stability floor relative to J_off^*; define allowed reset families (head, adapters, last layer).

We formalize offline-to-online fine-tuning as a constrained control problem in which we may modify the initialization by applying a and then perform online updates under a strict stability requirement relative to the best offline baseline. The central tension is that resets can increase optimization mobility (plasticity) but may catastrophically reduce immediate performance unless deployment is handled cautiously.

Fix an online interaction budget of N environment steps. We consider any fine-tuning procedure that produces a (possibly piecewise-constant) deployed policy sequence {π_t}_t = 0^N while training via off-policy updates on replay. The procedure may maintain auxiliary state such as target networks, a learner/deployed parameter split, and replay buffers containing offline and online transitions; these details are abstracted away in the formulation, except insofar as they affect the feasible set of policy sequences.

The procedure is additionally allowed to apply a reset operator R_S ∈ ℛ to some subset of parameters at (or near) the beginning of online fine-tuning, thereby selecting an initialization in a restricted manner. We view the choice of S as part of the algorithmic decision.

Let the stability floor be
ℓ := J_off^* − ε,
where ε ≥ 0 is user-specified slack. The primary constraint is a high-probability lower bound on the worst deployed performance:

Equivalently, the probability of violation of the floor during fine-tuning is at most δ. This constraint is intended to capture the operational requirement that online fine-tuning should not underperform the best offline baseline by more than ε at any point in deployment.

In many settings J(π_t) cannot be observed exactly at every t, hence we allow a variant of based on a lower-bound estimator L̂(π). Concretely, at a finite collection of deployment checkpoints 1 ≤ k ≤ M (with M determined by the procedure), the algorithm may compute L̂(π) using either limited rollouts or conservative off-policy evaluation, and the stability requirement is enforced by restricting policy deployments to those that pass the certificate L̂(π) ≥ ℓ. This yields a tractable way to satisfy without continuously estimating J(π_t).

Subject to stability, we seek maximal online improvement. We consider the objective

and, equivalently, the improvement-over-baseline objective max_t ≤ NJ(π_t) − J_off^*. We emphasize that is evaluated on the sequence, since only deployed performance is operationally relevant under the stability constraint.

Because resets can induce transient underperformance even if eventual recovery is possible, we also track a secondary desideratum: fast return to the offline baseline after any reset-induced disruption. One formalization is the recovery time
T_rec := inf {t ∈ {0, …, N}: J(π_t) ≥ J_off^*},
with the convention T_rec = ∞ if recovery does not occur within budget. While not always optimized explicitly, T_rec is a useful diagnostic for comparing reset choices under the same stability floor.

The feasible set of resets is given by a family ℛ = {R_S} that reflects architectural modularity and what can be reinitialized cheaply and safely. Typical choices include:

More generally, we may refine {E, H} into a module partition and allow S to range over a restricted collection (e.g., only critic-side modules), reflecting that stability risk differs across components.

Putting the pieces together, we can write the abstract design problem as
max_{R_S ∈ ℛ} max_{{π_t} feasible
under budget
N} max_{0 ≤ t ≤ N}J(π_t) subject to Pr [min_{0 ≤ t ≤ N}J(π_t) ≥ J_off^* − ε] ≥ 1 − δ,
where feasibility includes the algorithm’s permissible use of 𝒟, online replay, and any chosen offline/online mixing schedule (e.g., ratio α). This formulation makes explicit that reset selection is not merely an optimization trick but a first-class decision coupled to a stability-constrained deployment policy. It also motivates an algorithmic separation between (i) an exploratory, possibly reset learner trained in the background and (ii) a guarded deployment mechanism that only admits policies certified to meet the stability floor.

4. Algorithm: SPaR (Safe Partial Reset): shadow training, deployment gating via certified lower bounds, and optional mixing with offline replay; pseudocode and invariants.

We now instantiate the design principles of partial reset and deployment gating as an explicit procedure, (Safe Partial Reset). The algorithm maintains two parameter states: a state θ^dep, which alone interacts with the environment, and a state θ^lrn, which may be aggressively modified (including by resets) and trained in the background using off-policy updates. The key separation is that resets are applied only to the learner, while the deployed policy is updated only through a certification gate.

We start from the offline-pretrained parameters θ₀ = (θ_E, 0, θ_H, 0). The deployed parameters are set to the offline-safe incumbent,
θ^dep ← θ₀,
so the first deployed policy is π_θ₀. Independently, we choose a reset subset S from an allowed family (e.g., critic head, actor head, adapters), and initialize the learner by applying the corresponding reset operator:
θ^lrn ← R_S(θ₀).
In the canonical head-only case S = {H}, we keep θ_E fixed at θ_E, 0 and reinitialize θ_H; this injects plasticity while preserving the offline representation.

SPaR keeps two replay sources: an offline buffer B_off containing 𝒟, and an online buffer B_on that accumulates transitions collected under π_θ^dep. Updates sample mini-batches from the union of these sources using a mixing ratio α ∈ [0, 1]: e.g., each learner update draws an α-fraction from B_off and a (1 − α)-fraction from B_on. This mechanism serves two roles: (i) it regularizes learning toward the offline support to mitigate distribution shift, and (ii) it ensures that the learner can improve even when online data are initially sparse.

At each environment step, we collect a transition (s, a, r, s^′) with a ∼ π_θ^dep(⋅ ∣ s) (optionally with exploration noise) and append it to B_on. The learner performs U off-policy gradient steps per environment step (UTD), updating critic and actor parameters using any standard offline-to-online objective (e.g., TD/Bellman residual for Q and an actor objective). Importantly, SPaR does not constrain the learner to remain safe at intermediate times; safety is enforced only at the deployment interface.

Every K steps (or at a chosen schedule), we compute a conservative performance certificate L̂(π_θ^lrn), intended as a high-probability lower bound on J(π_θ^lrn). This may be obtained via (i) limited evaluation rollouts (if feasible) with concentration bounds, or (ii) conservative off-policy evaluation (OPE) on the mixture replay. Define the stability floor ℓ := J_off^* − ε. The gating rule is:
deploy π_θ^lrn only if L̂(π_θ^lrn) ≥ ℓ.
When the gate passes, we update the deployed parameters. The simplest choice is a hard switch θ^dep ← θ^lrn. In practice and in some analyses, it is also natural to use a conservative interpolation,
θ^dep ← Mix (θ^dep, θ^lrn),
where Mix may be Polyak averaging or a trust-region step; this reduces oscillations without changing the certification logic (the deployed policy is either unchanged or replaced by a policy whose performance has been certified).

The algorithm is organized to satisfy three invariants that will be used in the stability analysis.

With these invariants, the subsequent theory reduces stability of the entire deployed sequence to correctness of the lower-bound estimator and the gating protocol, while plasticity and sample complexity depend primarily on the dimension of the reset submodule.

5. Theory I — Stability Guarantees: formal theorem giving min_tJ(π_t^deploy) ≥ J_off^* − ε with high probability; discussion of the role of evaluation/OPE and gating.

We formalize the sense in which deployment gating converts a per-checkpoint performance certificate into a high-probability stability guarantee for the deployed policies. The key point is that SPaR constrains the interface between learner and deployed parameters: the learner may be arbitrarily unsafe while training, but the deployed parameters change only when we can certify that the candidate policy exceeds a fixed stability floor.

Fix a stability floor ℓ := J_off^* − ε. We assume access to an estimator L̂(π) satisfying a one-sided validity property: for any policy π queried at a deployment checkpoint,

The content of is that L̂ is conservative with high probability; it may underestimate J(π), but (except with probability δ^′) it does not overestimate it.

This property can be realized in multiple ways. If we can afford m on-policy evaluation rollouts of π, then for bounded per-episode return G ∈ [0, G_max] we can take
$$ \widehat{L}(\pi)\ :=\ \frac{1}{m}\sum_{i=1}^m G_i\ -\ G_{\max}\sqrt{\frac{\log(1/\delta')}{2m}}, $$
which satisfies by Hoeffding’s inequality. Alternatively, L̂ may be produced by conservative off-policy evaluation on the replay mixture (e.g., concentration-corrected importance sampling, doubly robust estimators with pessimism, or other certified OPE schemes); our stability argument uses only and is agnostic to the particular construction.

Let π_k^dep denote the deployed policy after the k-th checkpoint decision (so π₀^dep = π₀), and let π_k^cand denote the learner policy evaluated at that checkpoint. The gating rule is:
if L̂(π_k^cand) ≥ ℓ, then deploy π_k^cand; else keep π_k^dep = π_k − 1^dep.
If we deploy a mixed policy Mix(π_k − 1^dep, π_k^cand) rather than π_k^cand directly, we require the certificate to be computed for the deployed candidate (i.e., the post-mix policy), so that applies to what is deployed.

Suppose holds for every checkpoint query, and suppose there are at most M checkpoints over the N online interaction steps. Then, under the gating rule above, with probability at least 1 − Mδ^′,

Equivalently, the stability violation event {min_t ≤ NJ(π_t^deploy) < ℓ} occurs with probability at most Mδ^′. In particular, choosing δ^′ = δ/M yields Pr [min_t ≤ NJ(π_t^deploy) < ℓ] ≤ δ.

We argue inductively over checkpoints. At checkpoint k, either (i) the gate fails and π_k^dep = π_k − 1^dep, so the deployed return is unchanged, or (ii) the gate passes and we deploy a candidate policy π̃_k (either π_k^cand or a post-mix variant) satisfying L̂(π̃_k) ≥ ℓ. On the event that holds for π̃_k, we have J(π̃_k) ≥ L̂(π̃_k) ≥ ℓ, hence the newly deployed policy is safe. Therefore, the only way to deploy an unsafe policy at checkpoint k is for the lower-bound validity event to fail at that checkpoint. A union bound over at most M checkpoints gives probability at most Mδ^′ of any such failure, which implies . Since the deployed policy is constant between checkpoints, min_t ≤ NJ(π_t^deploy) is attained at (or equals) a checkpoint policy value, and the bound extends to all t ≤ N.

The theorem isolates the stability mechanism: sequence-level safety reduces to the calibration of L̂ and the discipline of gating. Importantly, this guarantee does require the learner to be stable during training, nor does it require monotone improvement. All instability risk is confined to the certification procedure; correspondingly, evaluation budget m (for rollouts) or conservatism in OPE directly controls δ^′ and thus the overall failure probability via the Mδ^′ factor. This separation is what allows SPaR to inject plasticity through resets while maintaining a high-probability floor relative to the best offline baseline J_off^*.

6. Theory II — Sample Complexity Benefits of Partial Reset: linear MDP / fixed features; head-only learning rates; upper bounds for reaching ε-optimality; comparison to full reset where encoder must be learned.

We now quantify the statistical benefit of partial reset in a setting where freezing the encoder makes the online learning problem essentially linear. Throughout, we regard the encoder parameters as inducing a fixed feature map
ϕ_{θ_E}(s, a) ∈ ℝ^d, ∥ϕ_{θ_E}(s, a)∥₂ ≤ 1,
and we analyze learning when θ_E = θ_E, 0 is held fixed and only a head parameter (critic head, actor head, or both) is trained from a reset initialization. This isolates the effect of ``plasticity injection’’ as an increase in optimization mobility within a low-dimensional hypothesis class, rather than as a wholesale change of representation.

Assume the MDP is with the frozen encoder in the sense that there exists w^* ∈ ℝ^d such that the optimal action-value function is linear:

This is the standard linear MDP or linear value-function approximation model; in either case, once ϕ_{θ_E, 0} is fixed, the critic-learning problem reduces to estimating w^* from temporal-difference (TD) targets. The head-reset operation R_H simply reinitializes w (and any small actor head parameters), without altering ϕ. Consequently, the learning dynamics and concentration are governed by dimension d, not by the full network size D.

Let Σ denote the feature covariance under the sampling distribution induced by the replay mixture (offline data 𝒟 and online data collected during fine-tuning):
Σ := 𝔼[ϕ(s, a)ϕ(s, a)^⊤],
where the expectation is taken over the (time-averaged) state–action marginal of the update batches. We assume λ_min(Σ) ≥ λ > 0, which may be ensured by mild online exploration together with coverage already present in 𝒟. This condition is the linear analogue of requiring that the head parameters are identifiable from data; it is also the point where the offline-to-online mixture ratio α matters, since too small an α may reduce coverage early, while too large an α may slow adaptation to novel online regions.

Consider fitted Q-iteration or least-squares TD (possibly with target networks) applied to mixed replay. Standard self-normalized concentration bounds for linear regression/TD yield, after n effectively independent samples, an estimation guarantee of the form

up to problem-dependent constants and logarithmic factors. Translating into a uniform value-function error introduces the factor λ^−1/2:
$$ \sup_{s,a}\bigl|Q_{w_n}(s,a)-Q^*(s,a)\bigr|\ \lesssim\ \lambda^{-1/2}\sqrt{\frac{d\log(1/\delta)}{n}}. $$
In discounted control, converting value-function error to return suboptimality incurs additional factors of (1 − γ)⁻¹ via standard performance-difference or approximate dynamic programming arguments. Aggregating these effects yields a head-only sample complexity scaling as

where Õ(⋅) suppresses polylogarithmic terms. The salient point is that the dependence is linear in the head dimension d and independent of the encoder dimension D.

In the fixed-feature regime, resetting the head does not change the approximation class {Q_w : w ∈ ℝ^d}; it only changes the starting point of optimization. Thus, the statistical rate is unaffected by the reset, while the optimization trajectory may improve: a fresh head can rapidly move toward a different greedy policy without being trapped near the offline-pretrained head, yet it cannot ``forget’’ the representation encoded by θ_E, 0. In SPaR, this plasticity is exploited in the shadow learner; the deployed policy is updated only after the candidate is certified safe, so the stability mechanism remains orthogonal to the head-learning rate.

If we instead reset (and train) the encoder, then the learning problem includes representation identification. Even in stylized cases where the encoder is a linear map producing ϕ_{θ_E}(s, a) ∈ ℝ^D, the unknown representation effectively introduces D degrees of freedom that must be inferred from data. Information-theoretic lower bounds for linear bandits and linear prediction imply that, without prior knowledge pinning down the representation, achieving ε-accurate value estimates requires at least Ω(D/ε²) samples in worst-case instances. When D ≫ d, this separates head-only adaptation from encoder relearning: under realizability with the frozen encoder, the encoder-reset learner pays a dimension-dependent price that is unnecessary for control performance.

Taken together, and the encoder-relearning lower bound formalize the intended benefit of partial reset: when a competent representation is already available offline, we can obtain online improvement with sample complexity controlled by the small head dimension d, while reserving encoder updates (and the associated D-scaling cost) for regimes where realizability with θ_E, 0 genuinely fails.

7. Lower Bounds and Separations: instances where (a) immediate full reset causes Ω(1) stability loss, (b) learning with encoder reset requires Ω(D/ε²) samples while head reset needs O(d/ε²); interpret as Pareto dominance.

We complement the head-only upper bounds by exhibiting instances in which (i) full reset necessarily violates stability by a constant margin, and (ii) provably incurs a dimension-dependent sample complexity penalty relative to head-only adaptation. These two phenomena formalize the sense in which partial reset yields a strict improvement in the stability–plasticity trade-off when the offline representation is already competent.

The key point is that stability is a of the deployed sequence {π_t}, not merely of the final policy. If an algorithm fully resets at time 0 and then uses the reset policy to interact before any certification step, then in some environments the deployed policy will, with high probability, take catastrophic actions at least once, forcing min_tJ(π_t) below any nontrivial offline baseline.

We sketch a canonical construction. Consider an episodic sparse-reward MDP (or a continuing MDP with episodic resets) with a narrow corridor'': a unique action $a^\star$ at the initial state avoids transition to an absorbing failure state with reward $0$, while any other action transitions to failure. Let the offline data $\mathcal{D}$ contain trajectories generated by a behavior mixture that selects $a^\star$ with constant probability, implying $J(\pi_{\mathcal{D}})\ge c_0$, and suppose $\pi_0$ is at least as good as $\pi_{\mathcal{D}}$ so that $J^*_{\mathrm{off}}\ge c_0$. A fully reset policy, before learning, behaves essentially randomly at the decision point; hence it selects $a^\star$ with probability bounded away from $1$, causing an episode with near-zero return with constant probability. Since the definition of stability depends on the \emph{minimum} return along the deployed sequence, a single such episode enforces an $\Omega(1)$ drop. The conclusion does not rely on our specific algorithm: it is an information-free obstruction toreset-and-deploy’’ when rewards encode safety-critical constraints.

We next isolate a statistical separation: when the pretrained encoder already induces a realizable feature map, any procedure that insists on relearning the encoder pays the full representation dimension D, while head-only adaptation depends only on d. The phenomenon can be formalized already in a contextual bandit (a one-step MDP), where value learning reduces to linear prediction.

The proof reduces to classical lower bounds for linear bandits / linear regression. We construct contexts x ∈ ℝ^D and rewards r = ⟨β^⋆, x⟩ + ξ with sub-Gaussian noise, where identifying β^⋆ to accuracy ε requires Ω(D/ε²) samples. We then embed this into an MDP in which the encoder implements a linear map producing the correct low-dimensional sufficient statistic ϕ ∈ ℝ^d. If the learner preserves θ_E, 0, it solves a d-dimensional problem; if it resets θ_E, it effectively reintroduces the D-dimensional identification burden. Crucially, we may choose the instance so that ; thus the lower bound reflects avoidable statistical work induced by encoder reset.

Theorems~– jointly yield a strict separation between three regimes: (i) suffers an unavoidable stability loss on some problems; (ii) may preserve stability but can require Ω(D/ε²) samples to recover and improve; (iii) preserves stability while enabling improvement at a rate controlled by d. In particular, in the regime J(π₀) < J(π_𝒟), naive fine-tuning without additional plasticity can be slow or stuck, whereas full reset is unsafe to deploy, and encoder reset is statistically expensive; partial reset therefore occupies a region of the stability–plasticity plane that is unattainable by these baselines in worst case. This motivates an experimental evaluation that reports not only final performance but also stability-floor violations and time-to-recover as first-class metrics.

8. Experimental Design (Strengthening Section): benchmark suite, Inferior-regime focus, partial reset variants (critic head / actor head / adapters), metrics (stability floor, time-to-recover, Pareto frontier), and ablations (with/without gating, with/without offline replay).

Our experimental goal is to test, in the strongest form compatible with finite interaction budgets, the claim suggested by the preceding separations: partial reset can increase online plasticity without paying either (i) a path-wise stability violation (as in reset-and-deploy) or (ii) a representation-dimension sample complexity cost (as in encoder reset). We therefore evaluate algorithms as {π_t}_t = 0^N rather than only by their final performance. Concretely, each method is run for a fixed interaction budget N, with periodic checkpoints every K steps at which we compute a conservative performance certificate L̂(π) (via a small rollout budget m when allowed, or via conservative OPE when rollouts are restricted). Methods that support gating deploy a stable incumbent policy between checkpoints and switch deployment only when the learner passes the floor J_off^* − ε; methods without gating deploy their continually updated parameters.

We select a benchmark suite spanning (i) continuous-control locomotion, (ii) sparse-reward or long-horizon control, and (iii) domains with sharp failure modes, since the stability definition is most meaningful when sub-baseline behavior is qualitatively undesirable. For each domain we use standard offline datasets 𝒟 (with mixed-quality behavior) and compute J(π_𝒟) as the mean return of trajectories in 𝒟 (or the best available estimate thereof), thereby defining J_off^* = max (J(π₀), J(π_𝒟)). We emphasize the regime J(π₀) < J(π_𝒟) by constructing initializations π₀ that are plausibly obtained in practice yet underperform the dataset: e.g., (a) offline pretraining with an overly conservative objective (excessive penalty or pessimism), (b) partial distribution shift between 𝒟 and the online environment, or (c) controlled corruption of the policy head while retaining the pretrained encoder θ_E, 0. This regime isolates the setting in which additional plasticity is needed to improve beyond the best offline baseline, but naive aggressive updates risk transient collapse.

We compare a family of reset operators R_S that isolate plasticity is injected:

All methods are trained with a common off-policy backbone and identical replay settings, mixing offline and online samples with a specified ratio α, to ensure that observed differences can be attributed to reset location and deployment logic rather than optimizer idiosyncrasies.

We report three primary metrics aligned with our definitions.
First, is summarized by the empirical frequency and magnitude of floor violations relative to J_off^* − ε, including the realized minimum return min_t ≤ NJ(π_t) (estimated from evaluation episodes) and its gap to the floor. Second, we measure : after applying a reset (or after training begins, for methods without an explicit reset event), we record the smallest t such that the deployed policy satisfies J(π_t) ≥ J_off^* − ε; for gated methods this corresponds to the first safe switch time, while for ungated methods it measures how long the algorithm spends below baseline. Third, we report via max_t ≤ NJ(π_t) − min_t ≤ NJ(π_t) and the improvement max_t ≤ NJ(π_t) − J_off^*. We aggregate these into a Pareto-style view by plotting achieved improvement against worst-case stability loss, thereby making explicit when gains are purchased by unacceptable dips.

To isolate the role of each design choice, we conduct two ablation axes that mirror the algorithmic invariants. (i) we remove the certification-and-switch rule and deploy the learner continuously, keeping the reset choice fixed; this tests whether observed stability is due to the reset itself or due to cautious deployment. (ii) we set α = 0 to train only on online data after the reset, testing the extent to which the offline buffer acts as an anchor preventing drift below J_off^*. Additional diagnostics include sensitivity to ε and m (evaluation budget), the effect of resetting progressively larger head submodules (controlling effective dimension d), and reporting calibration of L̂ by checking empirical coverage Pr [L̂(π) ≤ J(π)]. Collectively, these choices ensure that our conclusions are expressed in the stability–plasticity language rather than solely as end performance, thereby enabling a direct empirical counterpart to the worst-case phenomena established above.

A growing line of work studies how to combine an offline dataset 𝒟 with a limited online interaction budget to obtain rapid improvement while avoiding catastrophic degradation. Representative approaches include fine-tuning with offline replay and conservative regularization, e.g., by penalizing value overestimation or constraining the learned policy toward the data distribution . Recent algorithms explicitly target the setting by interleaving offline replay with online data collection, often with a large update-to-data ratio, as in RLPD and related pipelines that treat offline data as an anchor during early online learning. Calibrated conservative objectives such as Cal-QL refine value pessimism to improve online fine-tuning stability. Methods such as ReBRAC revisit behavior-regularized actor–critic fine-tuning with improved empirical robustness. Our emphasis differs in two respects: (i) we formalize evaluation in terms of the deployed {π_t} under a stability floor tied to J_off^*, rather than only final performance; and (ii) we treat (head vs. encoder) as a first-class design variable controlling the effective dimension that must be relearned online. This framing is complementary to conservative objectives: SPaR can be instantiated with the same underlying losses while changing only the parameter-reset and deployment logic.

The requirement that online fine-tuning not underperform a baseline connects to safe policy improvement (SPI), where one seeks performance guarantees relative to a reference policy using offline data and conservative estimation . Many SPI methods impose explicit constraints (e.g., trust regions, policy constraints around π_𝒟, or uncertainty-aware pessimism) to guarantee monotonic improvement under modeling assumptions. Our deployment-gating mechanism is closer in spirit to : we maintain a stable incumbent for data collection and only deploy a new candidate when a high-probability lower bound L̂(π) exceeds a specified floor. The use of L̂ allows us to express stability directly as a path-wise property of {π_t}, separating (a) learning dynamics in a shadow learner from (b) what is actually deployed. This separation is natural in practical systems where evaluation rollouts or conservative OPE are available at checkpoints, and it parallels conservative selection procedures in bandits and RL that rely on confidence bounds for action/policy choice .

The practical viability of gating depends on constructing a lower bound L̂(π) with meaningful coverage under distribution shift. Classical OPE estimators include importance sampling and its variants, doubly robust estimators, and model-based approaches . In long-horizon continuous control, direct importance weighting is often brittle, motivating fitted Q evaluation (FQE), marginalized importance sampling, and pessimistic value learning . Conservative OPE methods can be interpreted as producing lower confidence bounds on J(π) by combining function approximation with uncertainty penalties . Our use of L̂ is intentionally modular: SPaR requires only a calibrated lower bound at checkpoints, obtained either from a small rollout budget m or from a conservative OPE procedure consistent with available data and compute constraints. The theoretical stability statement then reduces to a union bound over checkpoints, decoupling estimation from control.

Resetting parameters is a standard technique in continual learning and non-stationary optimization, where partial reinitialization can restore plasticity after convergence or mitigate interference . In deep RL, resets have been used to address value-function pathologies (e.g., critic drift) and to escape suboptimal basins, sometimes via optimizer restarts or periodic target-network refreshes. Our setting differs in that the environment is fixed, but the learner transitions from an offline-pretrained initialization to online adaptation under a stability constraint tied to J_off^*. Here, reset is not merely a training heuristic: it explicitly trades off statistical efficiency and optimization mobility through the dimension of the reset submodule. This perspective aligns with classical results where learning rates depend on the number of unknown parameters, and it motivates distinguishing head-only resets (small effective dimension d) from encoder resets (large dimension D).

Empirically, offline pretraining can induce primacy effects: early-learned representations and action preferences constrain subsequent learning, especially under limited online data. In deep networks, related phenomena include dormant or inactive neurons, saturation, and feature collapse, which reduce gradient signal and impede adaptation . In actor–critic methods, critic miscalibration can also bias policy updates, leading to conservative or unstable behavior when transitioning online. Partial reset targets these mechanisms by injecting plasticity where optimization is most constrained (typically in the head), while preserving pretrained features that are still informative. Our assumptions formalize this decoupling as realizability with frozen encoder θ_E = θ_E, 0, thereby isolating the regime where representation reuse is statistically beneficial.

Partial reset is also conceptually related to parameter-efficient fine-tuning, where one freezes a backbone and trains small adapters or low-rank updates (e.g., LoRA) . While these methods are most developed in supervised and language-model settings, the underlying motivation is shared: restrict the trainable subspace to control sample complexity and preserve pretrained knowledge. Our adapter/LoRA reset variant imports this idea into offline-to-online RL and evaluates it through stability–plasticity metrics under deployment gating. In contrast to standard adapter training, we emphasize the constraint and the need to certify safety relative to J_off^* during the adaptation process.

10. Discussion and Future Work: selecting reset subsets automatically, extending beyond same-MDP, richer knowledge measures beyond return, and implications for foundation-model agents.

Our formulation treats R_S as an explicit control knob, yet in practice the most effective subset may be task- and dataset-dependent. A direct approach is to view ℛ = {R_S} as a finite hypothesis class and perform : for each candidate S, we run a shadow learner initialized by R_S(θ₀), and we deploy the best candidate whose certified lower bound exceeds the floor J_off^* − ε. This converts reset selection into a resource-allocation problem: we must decide how to spend the online budget N and evaluation budget m across candidates while maintaining Invariant~I1. Since exact subset selection is combinatorial (cf. the knapsack-style hardness intuition), we anticipate that practical systems will rely on restricted families (e.g., {H}, critic-head only, actor-head only, adapters only) or structured choices (e.g., reset the last k layers) that admit efficient search. A promising direction is to couple gating with bandit-style allocation: treat each S as an arm with reward proxy L̂(π_S^lrn) and cost measured in interaction steps, and allocate data adaptively subject to a that only certified policies are deployed. Even without a full theory, this perspective suggests concrete heuristics: begin with head-only resets (small d), escalate to larger resets only if certified progress stalls, and amortize certification by reusing shared rollouts across candidates.

Our guarantees are stated for a single discounted MDP ℳ. Many deployment settings violate this assumption, either because the online environment differs from the offline data-generating process, or because the task changes gradually over time. The partial-reset view remains relevant, but the stability floor must be reinterpreted. One option is to define a time-indexed baseline J_off^*(t) derived from a moving window of recent performance (or from a library of offline policies), and to impose Pr [min_t ≤ NJ_t(π_t^deploy) < J_off^*(t) − ε] ≤ δ for an appropriate notion of J_t. Another is to cast the problem as : offline data provides representations θ_E that are broadly useful, while resets determine how rapidly the agent can adapt its head to a shifted reward or dynamics model. In such settings, it is natural to allow or resets (e.g., upon drift detection in the value residuals), together with a gating rule that uses either conservative OPE under shift or a small number of online rollouts to re-establish a certified floor. Establishing end-to-end stability under nonstationarity likely requires new assumptions (e.g., bounded total variation drift, or slowly-varying optimal policies), but we expect the dimension-based separation—relearning d head parameters versus D encoder parameters—to persist as the main statistical lever.

We have measured knowledge by J(π), which is canonical but incomplete. In safety-critical or risk-sensitive applications, we may prefer constraints on tail risk (e.g., CVaR), constraint violation probabilities, or worst-case return under a disturbance set. These objectives interact nontrivially with gating: a lower bound L̂(π) on mean return does not imply a lower bound on risk-sensitive performance, and vice versa. A direct extension is to replace J with a vector of criteria and require certification of a feasible region, or to gate on a conservative bound for a coherent risk measure. Separately, return may fail to capture and in multi-goal settings; here, a more faithful knowledge measure could be the set of goals achieved above threshold, or the entropy of visited states subject to safety constraints. Finally, when π is a conditional policy (e.g., language-conditioned), it may be appropriate to certify per-context performance, yielding a family of bounds L̂(π; c) indexed by context c, and to gate deployment only on the subset of contexts for which certification is available.

The encoder–head decomposition is especially natural for foundation-model agents: a large pretrained backbone (vision, language, or multimodal) serves as θ_E, while task-specific control and value heads (or lightweight adapters) comprise θ_H. In this regime D ≫ d, so the dimension-based lower bounds provide a concrete justification for parameter-efficient online adaptation. Moreover, deployment gating aligns with how such agents are used in practice: one may maintain a stable, pretrained ``incumbent’’ policy for user-facing interaction, while training candidate adapters in the background and switching only when certified. Two technical challenges become central. First, certification must scale: computing L̂ for large, partially observed systems may require compositional OPE (e.g., decomposing long-horizon interaction into skill-level segments) or conservative model-based evaluation. Second, the action space may be structured (tools, programs, or natural language); then resets may target not only numeric heads but also components such as tool-selection logits, memory modules, or planning temperature parameters. We view SPaR as a template for these agents: the reset operator determines , and gating determines .