Offline-to-online reinforcement learning begins with an initialization produced from a fixed dataset and ends with a policy deployed under the true dynamics of the environment. The central difficulty is that the offline objective is optimized under the sampling distribution induced by the dataset, whereas online interaction is governed by the state–action visitation of the deployed policy. Even when the offline policy π0 is strong on the support of 𝒟, online fine-tuning can be destabilizing because exploratory updates move the policy into regions where value estimation is unreliable, the critic is weakly constrained by data, and the resulting actor updates may be driven by extrapolation error. Conversely, suppressing all change (e.g., by severe regularization toward π0) can preserve performance but prevent improvement when π0 is suboptimal for the true MDP. We therefore treat offline-to-online transfer as a constrained optimization problem in which we must negotiate performance preservation and improvement simultaneously.
We formalize this negotiation as a stability–plasticity trade-off along the deployed policy sequence {πt}. Stability demands that the return of the deployed policy does not fall materially below an offline baseline that we can justify from 𝒟. Plasticity demands that the learning procedure can depart from the offline solution sufficiently to exploit new online evidence and surpass that baseline. In the offline-to-online setting, the trade-off is acute because instability is incurred in real interaction: a single catastrophic deployment can dominate discounted return, violate safety constraints, or incur irreversible costs. Meanwhile, plasticity is often needed precisely in the regime where the offline dataset is incomplete or biased, so that π0 is not the correct answer to the online task.
A common heuristic response to poor fine-tuning behavior is to ``reset and relearn,’’ either by reinitializing large parts of the network or by training a new policy from scratch using online interaction. This is a coherent strategy when the online budget is large and when performance during learning is not itself constrained. However, for modern modular policies, full resets are structurally misaligned with the problem constraints. First, full reset typically destroys the pretrained representation learned from 𝒟; if the encoder θE already contains task-relevant features, then reinitializing θE forces the online learner to pay an avoidable sample complexity proportional to the full parameter dimension D rather than the head dimension d. Second, full reset tends to induce an immediate return collapse if the reset policy is ever deployed prior to re-achieving competence; in sparse-reward or brittle-control MDPs, even one episode of near-random behavior can imply an Ω(1) drop from any meaningful offline baseline. Third, from an engineering viewpoint, 2026-scale policies are explicitly modular (encoders, heads, adapters, auxiliary branches), and the hypothesis that modules should be relearned online is usually unwarranted: the distribution shift from π𝒟 to the deployed online policy is substantial, but it does not follow that the representation itself must be relearned to recover or improve return.
These observations suggest that the relevant design degree of freedom is not whether to reset, but to reset and to deploy the result. We advocate a partial-reset perspective: we keep the encoder fixed (or nearly fixed) and reinitialize only a small submodule—typically the actor and/or critic heads, or lightweight adapters—thereby injecting optimization mobility while preserving the pretrained representation. This choice is justified when the optimal online policy lies in the hypothesis class obtained by varying only the head parameters, i.e., when representational realizability holds for the frozen encoder. In that case, the online learning problem reduces to estimating a low-dimensional parameter under fixed features, and the relevant statistical rates scale with d, not D. Partial reset is thus a controlled form of plasticity: it permits the learner to escape a suboptimal basin induced by offline pretraining without paying the cost of re-learning features that are already correct.
Partial reset alone does not resolve the deployment risk, because even a head-only reset may produce a temporarily incompetent policy while it is being re-optimized online. We therefore combine partial reset with safe deployment gating: the reset learner is trained ``in shadow’’ using mixed replay, and we switch the deployed policy only when a high-probability lower bound L̂(π) certifies that the candidate meets a pre-specified floor derived from offline baselines. The logical effect is that instability is bounded by construction at deployment time, while plasticity is recovered through the reset-induced mobility in the learner. Our central claim is that, under standard assumptions making L̂ valid and making head optimization statistically efficient, this combination dominates both extremes: it strictly improves stability relative to immediate full reset, and it strictly improves plasticity (and sample complexity) relative to no reset or to encoder reset. The improvement is most pronounced in the regime, where J(π0) underperforms the dataset knowledge level J(π𝒟): here, naive fine-tuning can be both unstable and unproductive, whereas partial reset with gated deployment can preserve the best offline behavior while permitting reliable online improvement.
We work in a discounted Markov decision process (MDP) ℳ = (𝒮, 𝒜, P, r, γ)
with bounded rewards r(s, a) ∈ [0, 1]
and discount γ ∈ (0, 1). For
any (possibly stochastic) policy π, we write the discounted return
as
J(π) = 𝔼[∑t ≥ 0γtr(st, at)], at ∼ π(⋅ ∣ st), st + 1 ∼ P(⋅ ∣ st, at),
where the expectation includes the initial state distribution
(suppressed for brevity). Our focus is the offline-to-online setting: we
are given a fixed offline dataset 𝒟 of
trajectories or transitions collected by an unknown mixture of behavior
policies. We denote by π𝒟 an abstract policy
whose induced visitation is representative of 𝒟; this symbol is not meant to be
operationally known, but it is convenient for separating (i) what is
achievable using only data support and (ii) what is achievable after
online interaction.
From 𝒟 we obtain an
offline-pretrained actor–critic initialization (π0, Q0)
with parameters θ0.
We will compare online performance not only to J(π0), but also
to a dataset-derived baseline. Concretely, we define a dataset knowledge
level J(π𝒟) as the
mean return of trajectories contained in 𝒟 (or an analogous estimate when only
transitions are available). We then set the best offline baseline
Joff* := max (J(π0), J(π𝒟)),
which captures the strongest behavior we can justify before online
interaction. This choice is conservative in the sense that it does not
presuppose that π0
is necessarily better than the behavior embedded in the dataset.
During online fine-tuning we are allowed N environment steps (or, equivalently, a bounded number of episodes). The learner may update parameters off-policy using replay, possibly mixing offline and online samples. Since performance during learning matters, we distinguish between (i) the parameters being optimized and (ii) the parameters that actually generate interaction. The deployed policies form a sequence {πt}t = 0N, where πt is the policy used at online step t (or over a short block of steps between checkpoints). Our guarantees and metrics are stated in terms of the returns J(πt) along this deployed sequence.
Given a reference level l ∈ ℝ, we quantify stability by the
signed violation
Stability(l) := min (min0 ≤ t ≤ NJ(πt) − l, 0).
Thus Stability(l) = 0
indicates that the deployed sequence never falls below l, whereas Stability(l) < 0 measures the
worst drop below l. In our
setting the relevant floor is l = Joff* − ε,
where ε ≥ 0 is a user-chosen
slack capturing tolerable degradation relative to the best offline
baseline.
Plasticity is intended to capture the capacity to improve during
fine-tuning. We record the range of achieved deployed performance,
Plasticity := max0 ≤ t ≤ NJ(πt) − min0 ≤ t ≤ NJ(πt),
and we will also consider the improvement over the offline baseline,
maxtJ(πt) − Joff*,
as an objective-level summary. When stability is enforced by design (so
the minimum is controlled), these quantities are closely aligned with
the ability to surpass Joff* within
the interaction budget.
Because offline initialization quality varies substantially across
tasks and datasets, it is useful to stratify instances by comparing
J(π0) and
J(π𝒟).
Fixing a tolerance τ ≥ 0 to
ignore statistical noise, we define:
Superior:
J(π0) ≥ J(π𝒟) + τ, Comparable:
|J(π0) − J(π𝒟)| ≤ τ, Inferior:
J(π0) ≤ J(π𝒟) − τ.
The Inferior regime is the one in which naive fine-tuning is most
delicate: the initialization underperforms the dataset behavior, yet the
online procedure must still preserve at least Joff* while
searching for improvements.
We assume the policy is parametrized by θ = (θE, θH),
where θE
denotes an encoder or representation block (possibly shared by actor and
critic) and θH denotes a
task-specific head (actor head and/or critic head). Let D be the total parameter dimension
and d ≪ D the
effective dimension of the head submodule to be reset. For any subset
S ⊆ {E, H}
(or a finer module partition), we define a reset operator RS that
reinitializes θS while leaving
the complement fixed:
$$
R_S(\theta)\;=\;\big(\theta_{\overline{S}},\
\mathrm{Init}(\theta_S)\big),
$$
where Init(⋅) denotes the chosen random
or heuristic initialization. We write ℛ = {RS} for the
allowed family of such operators (e.g., head-only, adapters only,
last-layer only). Finally, since deployment decisions will rely on
performance certification, we assume access to a high-probability
lower-bound estimator L̂(π) for J(π), obtained via limited
rollouts or conservative off-policy evaluation; we will use L̂ only as a certificate, not as a
training signal.
These definitions allow us to state the online fine-tuning task as a stability-constrained optimization problem over policy updates and reset choices, which we formalize next.
We formalize offline-to-online fine-tuning as a constrained control problem in which we may modify the initialization by applying a and then perform online updates under a strict stability requirement relative to the best offline baseline. The central tension is that resets can increase optimization mobility (plasticity) but may catastrophically reduce immediate performance unless deployment is handled cautiously.
Fix an online interaction budget of N environment steps. We consider any fine-tuning procedure that produces a (possibly piecewise-constant) deployed policy sequence {πt}t = 0N while training via off-policy updates on replay. The procedure may maintain auxiliary state such as target networks, a learner/deployed parameter split, and replay buffers containing offline and online transitions; these details are abstracted away in the formulation, except insofar as they affect the feasible set of policy sequences.
The procedure is additionally allowed to apply a reset operator RS ∈ ℛ to some subset of parameters at (or near) the beginning of online fine-tuning, thereby selecting an initialization in a restricted manner. We view the choice of S as part of the algorithmic decision.
Let the stability floor be
ℓ := Joff* − ε,
where ε ≥ 0 is user-specified
slack. The primary constraint is a high-probability lower bound on the
worst deployed performance:
Equivalently, the probability of violation of the floor during
fine-tuning is at most δ. This
constraint is intended to capture the operational requirement that
online fine-tuning should not underperform the best offline baseline by
more than ε at any point in
deployment.
In many settings J(πt) cannot be observed exactly at every t, hence we allow a variant of based on a lower-bound estimator L̂(π). Concretely, at a finite collection of deployment checkpoints 1 ≤ k ≤ M (with M determined by the procedure), the algorithm may compute L̂(π) using either limited rollouts or conservative off-policy evaluation, and the stability requirement is enforced by restricting policy deployments to those that pass the certificate L̂(π) ≥ ℓ. This yields a tractable way to satisfy without continuously estimating J(πt).
Subject to stability, we seek maximal online improvement. We consider
the objective
and, equivalently, the improvement-over-baseline objective maxt ≤ NJ(πt) − Joff*.
We emphasize that is evaluated on the sequence, since only deployed
performance is operationally relevant under the stability
constraint.
Because resets can induce transient underperformance even if eventual
recovery is possible, we also track a secondary desideratum: fast return
to the offline baseline after any reset-induced disruption. One
formalization is the recovery time
Trec := inf {t ∈ {0, …, N}: J(πt) ≥ Joff*},
with the convention Trec = ∞ if recovery does
not occur within budget. While not always optimized explicitly, Trec is a useful
diagnostic for comparing reset choices under the same stability
floor.
More generally, we may refine {E, H} into a module partition and allow S to range over a restricted collection (e.g., only critic-side modules), reflecting that stability risk differs across components.
Putting the pieces together, we can write the abstract design problem
as
maxRS ∈ ℛ max{πt} feasible
under budget
N max0 ≤ t ≤ NJ(πt) subject
to Pr [min0 ≤ t ≤ NJ(πt) ≥ Joff* − ε] ≥ 1 − δ,
where feasibility includes the algorithm’s permissible use of 𝒟, online replay, and any chosen
offline/online mixing schedule (e.g., ratio α). This formulation makes explicit
that reset selection is not merely an optimization trick but a
first-class decision coupled to a stability-constrained deployment
policy. It also motivates an algorithmic separation between (i) an
exploratory, possibly reset learner trained in the background and (ii) a
guarded deployment mechanism that only admits policies certified to meet
the stability floor.
We now instantiate the design principles of partial reset and deployment gating as an explicit procedure, (Safe Partial Reset). The algorithm maintains two parameter states: a state θdep, which alone interacts with the environment, and a state θlrn, which may be aggressively modified (including by resets) and trained in the background using off-policy updates. The key separation is that resets are applied only to the learner, while the deployed policy is updated only through a certification gate.
We start from the offline-pretrained parameters θ0 = (θE, 0, θH, 0).
The deployed parameters are set to the offline-safe incumbent,
θdep ← θ0,
so the first deployed policy is πθ0.
Independently, we choose a reset subset S from an allowed family (e.g.,
critic head, actor head, adapters), and initialize the learner by
applying the corresponding reset operator:
θlrn ← RS(θ0).
In the canonical head-only case S = {H}, we keep θE fixed at
θE, 0 and
reinitialize θH; this injects
plasticity while preserving the offline representation.
SPaR keeps two replay sources: an offline buffer Boff containing 𝒟, and an online buffer Bon that accumulates transitions collected under πθdep. Updates sample mini-batches from the union of these sources using a mixing ratio α ∈ [0, 1]: e.g., each learner update draws an α-fraction from Boff and a (1 − α)-fraction from Bon. This mechanism serves two roles: (i) it regularizes learning toward the offline support to mitigate distribution shift, and (ii) it ensures that the learner can improve even when online data are initially sparse.
At each environment step, we collect a transition (s, a, r, s′) with a ∼ πθdep(⋅ ∣ s) (optionally with exploration noise) and append it to Bon. The learner performs U off-policy gradient steps per environment step (UTD), updating critic and actor parameters using any standard offline-to-online objective (e.g., TD/Bellman residual for Q and an actor objective). Importantly, SPaR does not constrain the learner to remain safe at intermediate times; safety is enforced only at the deployment interface.
Every K steps (or at a
chosen schedule), we compute a conservative performance certificate
L̂(πθlrn),
intended as a high-probability lower bound on J(πθlrn).
This may be obtained via (i) limited evaluation rollouts (if feasible)
with concentration bounds, or (ii) conservative off-policy evaluation
(OPE) on the mixture replay. Define the stability floor ℓ := Joff* − ε.
The gating rule is:
deploy
πθlrn only
if L̂(πθlrn) ≥ ℓ.
When the gate passes, we update the deployed parameters. The simplest
choice is a hard switch θdep ← θlrn.
In practice and in some analyses, it is also natural to use a
conservative interpolation,
θdep ← Mix (θdep, θlrn),
where Mix may be Polyak averaging or a
trust-region step; this reduces oscillations without changing the
certification logic (the deployed policy is either unchanged or replaced
by a policy whose performance has been certified).
With these invariants, the subsequent theory reduces stability of the entire deployed sequence to correctness of the lower-bound estimator and the gating protocol, while plasticity and sample complexity depend primarily on the dimension of the reset submodule.
We formalize the sense in which deployment gating converts a per-checkpoint performance certificate into a high-probability stability guarantee for the deployed policies. The key point is that SPaR constrains the interface between learner and deployed parameters: the learner may be arbitrarily unsafe while training, but the deployed parameters change only when we can certify that the candidate policy exceeds a fixed stability floor.
Fix a stability floor ℓ := Joff* − ε.
We assume access to an estimator L̂(π) satisfying a one-sided
validity property: for any policy π queried at a deployment
checkpoint,
The content of is that L̂ is
conservative with high probability; it may underestimate J(π), but (except with
probability δ′) it
does not overestimate it.
This property can be realized in multiple ways. If we can afford
m on-policy evaluation
rollouts of π, then for
bounded per-episode return G ∈ [0, Gmax] we
can take
$$
\widehat{L}(\pi)\ :=\ \frac{1}{m}\sum_{i=1}^m G_i\ -\
G_{\max}\sqrt{\frac{\log(1/\delta')}{2m}},
$$
which satisfies by Hoeffding’s inequality. Alternatively, L̂ may be produced by conservative
off-policy evaluation on the replay mixture (e.g.,
concentration-corrected importance sampling, doubly robust estimators
with pessimism, or other certified OPE schemes); our stability argument
uses only and is agnostic to the particular construction.
Let πkdep
denote the deployed policy after the k-th checkpoint decision (so π0dep = π0),
and let πkcand
denote the learner policy evaluated at that checkpoint. The gating rule
is:
if
L̂(πkcand) ≥ ℓ, then
deploy πkcand; else keep
πkdep = πk − 1dep.
If we deploy a mixed policy Mix(πk − 1dep, πkcand)
rather than πkcand
directly, we require the certificate to be computed for the deployed
candidate (i.e., the post-mix policy), so that applies to what is
deployed.
Suppose holds for every checkpoint query, and suppose there are at
most M checkpoints over the
N online interaction steps.
Then, under the gating rule above, with probability at least 1 − Mδ′,
Equivalently, the stability violation event {mint ≤ NJ(πtdeploy) < ℓ}
occurs with probability at most Mδ′. In
particular, choosing δ′ = δ/M
yields Pr [mint ≤ NJ(πtdeploy) < ℓ] ≤ δ.
We argue inductively over checkpoints. At checkpoint k, either (i) the gate fails and πkdep = πk − 1dep, so the deployed return is unchanged, or (ii) the gate passes and we deploy a candidate policy π̃k (either πkcand or a post-mix variant) satisfying L̂(π̃k) ≥ ℓ. On the event that holds for π̃k, we have J(π̃k) ≥ L̂(π̃k) ≥ ℓ, hence the newly deployed policy is safe. Therefore, the only way to deploy an unsafe policy at checkpoint k is for the lower-bound validity event to fail at that checkpoint. A union bound over at most M checkpoints gives probability at most Mδ′ of any such failure, which implies . Since the deployed policy is constant between checkpoints, mint ≤ NJ(πtdeploy) is attained at (or equals) a checkpoint policy value, and the bound extends to all t ≤ N.
The theorem isolates the stability mechanism: sequence-level safety reduces to the calibration of L̂ and the discipline of gating. Importantly, this guarantee does require the learner to be stable during training, nor does it require monotone improvement. All instability risk is confined to the certification procedure; correspondingly, evaluation budget m (for rollouts) or conservatism in OPE directly controls δ′ and thus the overall failure probability via the Mδ′ factor. This separation is what allows SPaR to inject plasticity through resets while maintaining a high-probability floor relative to the best offline baseline Joff*.
We now quantify the statistical benefit of partial reset in a setting
where freezing the encoder makes the online learning problem essentially
linear. Throughout, we regard the encoder parameters as inducing a fixed
feature map
ϕθE(s, a) ∈ ℝd, ∥ϕθE(s, a)∥2 ≤ 1,
and we analyze learning when θE = θE, 0
is held fixed and only a head parameter (critic head, actor head, or
both) is trained from a reset initialization. This isolates the effect
of ``plasticity injection’’ as an increase in optimization mobility
within a low-dimensional hypothesis class, rather than as a wholesale
change of representation.
Assume the MDP is with the frozen encoder in the sense that there
exists w* ∈ ℝd
such that the optimal action-value function is linear:
This is the standard linear MDP or linear value-function approximation
model; in either case, once ϕθE, 0
is fixed, the critic-learning problem reduces to estimating w* from
temporal-difference (TD) targets. The head-reset operation RH simply
reinitializes w (and any small
actor head parameters), without altering ϕ. Consequently, the learning
dynamics and concentration are governed by dimension d, not by the full network size
D.
Let Σ denote the feature
covariance under the sampling distribution induced by the replay mixture
(offline data 𝒟 and online data
collected during fine-tuning):
Σ := 𝔼[ϕ(s, a)ϕ(s, a)⊤],
where the expectation is taken over the (time-averaged) state–action
marginal of the update batches. We assume λmin(Σ) ≥ λ > 0,
which may be ensured by mild online exploration together with coverage
already present in 𝒟. This condition is
the linear analogue of requiring that the head parameters are
identifiable from data; it is also the point where the offline-to-online
mixture ratio α matters, since
too small an α may reduce
coverage early, while too large an α may slow adaptation to novel
online regions.
Consider fitted Q-iteration or least-squares TD (possibly with target
networks) applied to mixed replay. Standard self-normalized
concentration bounds for linear regression/TD yield, after n effectively independent samples,
an estimation guarantee of the form
up to problem-dependent constants and logarithmic factors. Translating
into a uniform value-function error introduces the factor λ−1/2:
$$
\sup_{s,a}\bigl|Q_{w_n}(s,a)-Q^*(s,a)\bigr|\ \lesssim\
\lambda^{-1/2}\sqrt{\frac{d\log(1/\delta)}{n}}.
$$
In discounted control, converting value-function error to return
suboptimality incurs additional factors of (1 − γ)−1 via standard
performance-difference or approximate dynamic programming arguments.
Aggregating these effects yields a head-only sample complexity scaling
as
where Õ(⋅) suppresses
polylogarithmic terms. The salient point is that the dependence is
linear in the head dimension d
and independent of the encoder dimension D.
In the fixed-feature regime, resetting the head does not change the approximation class {Qw : w ∈ ℝd}; it only changes the starting point of optimization. Thus, the statistical rate is unaffected by the reset, while the optimization trajectory may improve: a fresh head can rapidly move toward a different greedy policy without being trapped near the offline-pretrained head, yet it cannot ``forget’’ the representation encoded by θE, 0. In SPaR, this plasticity is exploited in the shadow learner; the deployed policy is updated only after the candidate is certified safe, so the stability mechanism remains orthogonal to the head-learning rate.
If we instead reset (and train) the encoder, then the learning problem includes representation identification. Even in stylized cases where the encoder is a linear map producing ϕθE(s, a) ∈ ℝD, the unknown representation effectively introduces D degrees of freedom that must be inferred from data. Information-theoretic lower bounds for linear bandits and linear prediction imply that, without prior knowledge pinning down the representation, achieving ε-accurate value estimates requires at least Ω(D/ε2) samples in worst-case instances. When D ≫ d, this separates head-only adaptation from encoder relearning: under realizability with the frozen encoder, the encoder-reset learner pays a dimension-dependent price that is unnecessary for control performance.
Taken together, and the encoder-relearning lower bound formalize the intended benefit of partial reset: when a competent representation is already available offline, we can obtain online improvement with sample complexity controlled by the small head dimension d, while reserving encoder updates (and the associated D-scaling cost) for regimes where realizability with θE, 0 genuinely fails.
We complement the head-only upper bounds by exhibiting instances in which (i) full reset necessarily violates stability by a constant margin, and (ii) provably incurs a dimension-dependent sample complexity penalty relative to head-only adaptation. These two phenomena formalize the sense in which partial reset yields a strict improvement in the stability–plasticity trade-off when the offline representation is already competent.
The key point is that stability is a of the deployed sequence {πt}, not merely of the final policy. If an algorithm fully resets at time 0 and then uses the reset policy to interact before any certification step, then in some environments the deployed policy will, with high probability, take catastrophic actions at least once, forcing mintJ(πt) below any nontrivial offline baseline.
We sketch a canonical construction. Consider an episodic sparse-reward
MDP (or a continuing MDP with episodic resets) with a
narrow corridor'': a unique action \(a^\star\) at the initial state avoids transition to an absorbing failure state with reward \(0\), while any other action transitions to failure. Let the offline data \(\mathcal{D}\) contain trajectories generated by a behavior mixture that selects \(a^\star\) with constant probability, implying \(J(\pi_{\mathcal{D}})\ge c_0\), and suppose \(\pi_0\) is at least as good as \(\pi_{\mathcal{D}}\) so that \(J^*_{\mathrm{off}}\ge c_0\). A fully reset policy, before learning, behaves essentially randomly at the decision point; hence it selects \(a^\star\) with probability bounded away from \(1\), causing an episode with near-zero return with constant probability. Since the definition of stability depends on the \emph{minimum} return along the deployed sequence, a single such episode enforces an \(\Omega(1)\) drop. The conclusion does not rely on our specific algorithm: it is an information-free obstruction toreset-and-deploy’’
when rewards encode safety-critical constraints.
We next isolate a statistical separation: when the pretrained encoder already induces a realizable feature map, any procedure that insists on relearning the encoder pays the full representation dimension D, while head-only adaptation depends only on d. The phenomenon can be formalized already in a contextual bandit (a one-step MDP), where value learning reduces to linear prediction.
The proof reduces to classical lower bounds for linear bandits / linear
regression. We construct contexts x ∈ ℝD and
rewards r = ⟨β⋆, x⟩ + ξ
with sub-Gaussian noise, where identifying β⋆ to accuracy ε requires Ω(D/ε2)
samples. We then embed this into an MDP in which the encoder implements
a linear map producing the correct low-dimensional sufficient statistic
ϕ ∈ ℝd. If
the learner preserves θE, 0, it solves
a d-dimensional problem; if it
resets θE,
it effectively reintroduces the D-dimensional identification burden.
Crucially, we may choose the instance so that ; thus the lower bound
reflects avoidable statistical work induced by encoder reset.
Theorems~– jointly yield a strict separation between three regimes: (i) suffers an unavoidable stability loss on some problems; (ii) may preserve stability but can require Ω(D/ε2) samples to recover and improve; (iii) preserves stability while enabling improvement at a rate controlled by d. In particular, in the regime J(π0) < J(π𝒟), naive fine-tuning without additional plasticity can be slow or stuck, whereas full reset is unsafe to deploy, and encoder reset is statistically expensive; partial reset therefore occupies a region of the stability–plasticity plane that is unattainable by these baselines in worst case. This motivates an experimental evaluation that reports not only final performance but also stability-floor violations and time-to-recover as first-class metrics.
Our experimental goal is to test, in the strongest form compatible with finite interaction budgets, the claim suggested by the preceding separations: partial reset can increase online plasticity without paying either (i) a path-wise stability violation (as in reset-and-deploy) or (ii) a representation-dimension sample complexity cost (as in encoder reset). We therefore evaluate algorithms as {πt}t = 0N rather than only by their final performance. Concretely, each method is run for a fixed interaction budget N, with periodic checkpoints every K steps at which we compute a conservative performance certificate L̂(π) (via a small rollout budget m when allowed, or via conservative OPE when rollouts are restricted). Methods that support gating deploy a stable incumbent policy between checkpoints and switch deployment only when the learner passes the floor Joff* − ε; methods without gating deploy their continually updated parameters.
We select a benchmark suite spanning (i) continuous-control locomotion, (ii) sparse-reward or long-horizon control, and (iii) domains with sharp failure modes, since the stability definition is most meaningful when sub-baseline behavior is qualitatively undesirable. For each domain we use standard offline datasets 𝒟 (with mixed-quality behavior) and compute J(π𝒟) as the mean return of trajectories in 𝒟 (or the best available estimate thereof), thereby defining Joff* = max (J(π0), J(π𝒟)). We emphasize the regime J(π0) < J(π𝒟) by constructing initializations π0 that are plausibly obtained in practice yet underperform the dataset: e.g., (a) offline pretraining with an overly conservative objective (excessive penalty or pessimism), (b) partial distribution shift between 𝒟 and the online environment, or (c) controlled corruption of the policy head while retaining the pretrained encoder θE, 0. This regime isolates the setting in which additional plasticity is needed to improve beyond the best offline baseline, but naive aggressive updates risk transient collapse.
We compare a family of reset operators RS that isolate plasticity is injected:All methods are trained with a common off-policy backbone and identical replay settings, mixing offline and online samples with a specified ratio α, to ensure that observed differences can be attributed to reset location and deployment logic rather than optimizer idiosyncrasies.
We report three primary metrics aligned with our definitions.
First, is summarized by the empirical frequency and magnitude of floor
violations relative to Joff* − ε,
including the realized minimum return mint ≤ NJ(πt)
(estimated from evaluation episodes) and its gap to the floor. Second,
we measure : after applying a reset (or after training begins, for
methods without an explicit reset event), we record the smallest t such that the deployed policy
satisfies J(πt) ≥ Joff* − ε;
for gated methods this corresponds to the first safe switch time, while
for ungated methods it measures how long the algorithm spends below
baseline. Third, we report via maxt ≤ NJ(πt) − mint ≤ NJ(πt)
and the improvement maxt ≤ NJ(πt) − Joff*.
We aggregate these into a Pareto-style view by plotting achieved
improvement against worst-case stability loss, thereby making explicit
when gains are purchased by unacceptable dips.
To isolate the role of each design choice, we conduct two ablation axes that mirror the algorithmic invariants. (i) we remove the certification-and-switch rule and deploy the learner continuously, keeping the reset choice fixed; this tests whether observed stability is due to the reset itself or due to cautious deployment. (ii) we set α = 0 to train only on online data after the reset, testing the extent to which the offline buffer acts as an anchor preventing drift below Joff*. Additional diagnostics include sensitivity to ε and m (evaluation budget), the effect of resetting progressively larger head submodules (controlling effective dimension d), and reporting calibration of L̂ by checking empirical coverage Pr [L̂(π) ≤ J(π)]. Collectively, these choices ensure that our conclusions are expressed in the stability–plasticity language rather than solely as end performance, thereby enabling a direct empirical counterpart to the worst-case phenomena established above.
A growing line of work studies how to combine an offline dataset 𝒟 with a limited online interaction budget to obtain rapid improvement while avoiding catastrophic degradation. Representative approaches include fine-tuning with offline replay and conservative regularization, e.g., by penalizing value overestimation or constraining the learned policy toward the data distribution . Recent algorithms explicitly target the setting by interleaving offline replay with online data collection, often with a large update-to-data ratio, as in RLPD and related pipelines that treat offline data as an anchor during early online learning. Calibrated conservative objectives such as Cal-QL refine value pessimism to improve online fine-tuning stability. Methods such as ReBRAC revisit behavior-regularized actor–critic fine-tuning with improved empirical robustness. Our emphasis differs in two respects: (i) we formalize evaluation in terms of the deployed {πt} under a stability floor tied to Joff*, rather than only final performance; and (ii) we treat (head vs. encoder) as a first-class design variable controlling the effective dimension that must be relearned online. This framing is complementary to conservative objectives: SPaR can be instantiated with the same underlying losses while changing only the parameter-reset and deployment logic.
The requirement that online fine-tuning not underperform a baseline connects to safe policy improvement (SPI), where one seeks performance guarantees relative to a reference policy using offline data and conservative estimation . Many SPI methods impose explicit constraints (e.g., trust regions, policy constraints around π𝒟, or uncertainty-aware pessimism) to guarantee monotonic improvement under modeling assumptions. Our deployment-gating mechanism is closer in spirit to : we maintain a stable incumbent for data collection and only deploy a new candidate when a high-probability lower bound L̂(π) exceeds a specified floor. The use of L̂ allows us to express stability directly as a path-wise property of {πt}, separating (a) learning dynamics in a shadow learner from (b) what is actually deployed. This separation is natural in practical systems where evaluation rollouts or conservative OPE are available at checkpoints, and it parallels conservative selection procedures in bandits and RL that rely on confidence bounds for action/policy choice .
The practical viability of gating depends on constructing a lower bound L̂(π) with meaningful coverage under distribution shift. Classical OPE estimators include importance sampling and its variants, doubly robust estimators, and model-based approaches . In long-horizon continuous control, direct importance weighting is often brittle, motivating fitted Q evaluation (FQE), marginalized importance sampling, and pessimistic value learning . Conservative OPE methods can be interpreted as producing lower confidence bounds on J(π) by combining function approximation with uncertainty penalties . Our use of L̂ is intentionally modular: SPaR requires only a calibrated lower bound at checkpoints, obtained either from a small rollout budget m or from a conservative OPE procedure consistent with available data and compute constraints. The theoretical stability statement then reduces to a union bound over checkpoints, decoupling estimation from control.
Resetting parameters is a standard technique in continual learning and non-stationary optimization, where partial reinitialization can restore plasticity after convergence or mitigate interference . In deep RL, resets have been used to address value-function pathologies (e.g., critic drift) and to escape suboptimal basins, sometimes via optimizer restarts or periodic target-network refreshes. Our setting differs in that the environment is fixed, but the learner transitions from an offline-pretrained initialization to online adaptation under a stability constraint tied to Joff*. Here, reset is not merely a training heuristic: it explicitly trades off statistical efficiency and optimization mobility through the dimension of the reset submodule. This perspective aligns with classical results where learning rates depend on the number of unknown parameters, and it motivates distinguishing head-only resets (small effective dimension d) from encoder resets (large dimension D).
Empirically, offline pretraining can induce primacy effects: early-learned representations and action preferences constrain subsequent learning, especially under limited online data. In deep networks, related phenomena include dormant or inactive neurons, saturation, and feature collapse, which reduce gradient signal and impede adaptation . In actor–critic methods, critic miscalibration can also bias policy updates, leading to conservative or unstable behavior when transitioning online. Partial reset targets these mechanisms by injecting plasticity where optimization is most constrained (typically in the head), while preserving pretrained features that are still informative. Our assumptions formalize this decoupling as realizability with frozen encoder θE = θE, 0, thereby isolating the regime where representation reuse is statistically beneficial.
Partial reset is also conceptually related to parameter-efficient fine-tuning, where one freezes a backbone and trains small adapters or low-rank updates (e.g., LoRA) . While these methods are most developed in supervised and language-model settings, the underlying motivation is shared: restrict the trainable subspace to control sample complexity and preserve pretrained knowledge. Our adapter/LoRA reset variant imports this idea into offline-to-online RL and evaluates it through stability–plasticity metrics under deployment gating. In contrast to standard adapter training, we emphasize the constraint and the need to certify safety relative to Joff* during the adaptation process.
Our formulation treats RS as an explicit control knob, yet in practice the most effective subset may be task- and dataset-dependent. A direct approach is to view ℛ = {RS} as a finite hypothesis class and perform : for each candidate S, we run a shadow learner initialized by RS(θ0), and we deploy the best candidate whose certified lower bound exceeds the floor Joff* − ε. This converts reset selection into a resource-allocation problem: we must decide how to spend the online budget N and evaluation budget m across candidates while maintaining Invariant~I1. Since exact subset selection is combinatorial (cf. the knapsack-style hardness intuition), we anticipate that practical systems will rely on restricted families (e.g., {H}, critic-head only, actor-head only, adapters only) or structured choices (e.g., reset the last k layers) that admit efficient search. A promising direction is to couple gating with bandit-style allocation: treat each S as an arm with reward proxy L̂(πSlrn) and cost measured in interaction steps, and allocate data adaptively subject to a that only certified policies are deployed. Even without a full theory, this perspective suggests concrete heuristics: begin with head-only resets (small d), escalate to larger resets only if certified progress stalls, and amortize certification by reusing shared rollouts across candidates.
Our guarantees are stated for a single discounted MDP ℳ. Many deployment settings violate this assumption, either because the online environment differs from the offline data-generating process, or because the task changes gradually over time. The partial-reset view remains relevant, but the stability floor must be reinterpreted. One option is to define a time-indexed baseline Joff*(t) derived from a moving window of recent performance (or from a library of offline policies), and to impose Pr [mint ≤ NJt(πtdeploy) < Joff*(t) − ε] ≤ δ for an appropriate notion of Jt. Another is to cast the problem as : offline data provides representations θE that are broadly useful, while resets determine how rapidly the agent can adapt its head to a shifted reward or dynamics model. In such settings, it is natural to allow or resets (e.g., upon drift detection in the value residuals), together with a gating rule that uses either conservative OPE under shift or a small number of online rollouts to re-establish a certified floor. Establishing end-to-end stability under nonstationarity likely requires new assumptions (e.g., bounded total variation drift, or slowly-varying optimal policies), but we expect the dimension-based separation—relearning d head parameters versus D encoder parameters—to persist as the main statistical lever.
We have measured knowledge by J(π), which is canonical but incomplete. In safety-critical or risk-sensitive applications, we may prefer constraints on tail risk (e.g., CVaR), constraint violation probabilities, or worst-case return under a disturbance set. These objectives interact nontrivially with gating: a lower bound L̂(π) on mean return does not imply a lower bound on risk-sensitive performance, and vice versa. A direct extension is to replace J with a vector of criteria and require certification of a feasible region, or to gate on a conservative bound for a coherent risk measure. Separately, return may fail to capture and in multi-goal settings; here, a more faithful knowledge measure could be the set of goals achieved above threshold, or the entropy of visited states subject to safety constraints. Finally, when π is a conditional policy (e.g., language-conditioned), it may be appropriate to certify per-context performance, yielding a family of bounds L̂(π; c) indexed by context c, and to gate deployment only on the subset of contexts for which certification is available.
The encoder–head decomposition is especially natural for foundation-model agents: a large pretrained backbone (vision, language, or multimodal) serves as θE, while task-specific control and value heads (or lightweight adapters) comprise θH. In this regime D ≫ d, so the dimension-based lower bounds provide a concrete justification for parameter-efficient online adaptation. Moreover, deployment gating aligns with how such agents are used in practice: one may maintain a stable, pretrained ``incumbent’’ policy for user-facing interaction, while training candidate adapters in the background and switching only when certified. Two technical challenges become central. First, certification must scale: computing L̂ for large, partially observed systems may require compositional OPE (e.g., decomposing long-horizon interaction into skill-level segments) or conservative model-based evaluation. Second, the action space may be structured (tools, programs, or natural language); then resets may target not only numeric heads but also components such as tool-selection logits, memory modules, or planning temperature parameters. We view SPaR as a template for these agents: the reset operator determines , and gating determines .