Model-based offline reinforcement learning aims to extract a
high-performing policy from a fixed batch of environment experience
while optionally leveraging a learned dynamics model to generate
additional synthetic rollouts.
The difficulty is that the optimization procedure is intrinsically in at
least two distinct senses.
First, the state–action distribution induced by the learned policy
typically differs from that of the unknown behavior policy that
generated the offline dataset; second, synthetic transitions produced by
the learned model need not match the true environment dynamics.
Both shifts can be beneficial when used prudently (e.g., for improved
exploration of the dataset support), yet they are also precisely the
mechanism by which offline RL methods can become arbitrarily
over-optimistic.
Our objective is therefore not merely to regularize shift, but to it,
for it in a principled manner, and the residual uncertainty introduced
by the correction.
A convenient organizing principle is to view distribution shift as a
multiplicative reweighting of trajectory probabilities.
If we denote by p the true
transition kernel and by m a
learned model used to generate synthetic transitions, then any method
that trains on a mixture of environment transitions and model
transitions implicitly mixes samples from different Markov chains.
Likewise, the policy we optimize, π, induces visitation frequencies
that generally differ from those of the behavior policy πb that produced
the offline dataset.
At a formal level, this suggests a decomposition of the relevant
Radon–Nikodym derivative into a component and a component, which, in log
form, yields additive terms of the type
$$
\ell_p(s,a,s') \;=\; \log\frac{p(s'\mid s,a)}{m(s'\mid s,a)},
\qquad
\ell_\pi(s,a) \;=\; \log\frac{\pi(a\mid s)}{\pi_b(a\mid s)}.
$$
The Shift-Aware Reward (SAR) perspective proposes to incorporate such
log-ratio terms directly into reward shaping so that maximizing the
shaped return aligns with a pessimistic objective that accounts for the
mismatch between training and evaluation distributions.
In particular, when rewards are strictly positive and bounded, it is
natural to work with log r(s, a) (for
numerical stability and additivity across time), and to augment it with
weighted shift-correction terms.
This provides a single scalar learning signal that can be optimized by
standard off-policy actor–critic machinery, while still encoding a
precise notion of how far a transition or action is from the
distribution under which we have reliable evidence.
However, SAR is only as trustworthy as the density-ratio estimators
that supply ℓp and ℓπ.
In modern offline RL pipelines, these ratios are typically obtained via
binary classification: we train a discriminator to distinguish
environment transitions from model transitions (for ℓp) and to
distinguish actions sampled from the current policy from actions in the
offline dataset (for ℓπ).
Under standard equal-prior or known-prior reductions, the discriminator
logit is (up to an additive constant) the desired log-density
ratio.
Thus, in the idealized large-sample regime with perfect probability
estimates, the SAR correction is well-founded.
In the finite-sample, function-approximation regime, the bottleneck is
not merely classification error but : the classifier may rank examples
correctly while still outputting probabilities whose numerical values
are systematically biased.
Since the logit map $u \mapsto \log
\frac{u}{1-u}$ amplifies errors near 0 and 1,
even modest miscalibration can produce large errors in the inferred
log-ratio, which then appear additively in the shaped reward and
accumulate over an effective horizon H ≍ (1 − γ)−1.
Consequently, without explicit control of calibration, the SAR
correction can be either overly aggressive (destroying useful learning
signal) or falsely reassuring (failing to enforce pessimism where it is
needed).
The contribution of this work is to replace heuristic
discriminator-based correction with a procedure that propagates
finite-sample uncertainty through the classification-to-logit-to-reward
pipeline.
Concretely, we treat each discriminator as a probabilistic predictor
that is subsequently calibrated (e.g., by temperature scaling and/or
conformal prediction), and we extract for each input an interval of
plausible probabilities with prescribed coverage.
By mapping this interval through the logit transform and then applying
explicit clipping, we obtain both (i) a point estimate of the log-ratio
used for learning and (ii) a certificate radius that upper bounds its
error with high probability on a specified target distribution
(typically a training mixture distribution).
We then enforce pessimism by subtracting these certificate radii inside
the shaped reward, yielding a Certified SAR objective whose optimism is
quantitatively controlled.
This addresses the core fragility: rather than hoping the classifier is
well-calibrated, we ensure that whatever uncertainty remains is
explicitly paid for in the learning signal.
The remainder of the paper proceeds as follows.
In the next section we fix notation and describe the offline–synthetic
data mixture setting, including the two distinct sources of shift (model
bias and policy shift) and the SAR reward form that motivates our
construction.
We also explain why the logit-based implementation of SAR is numerically
and statistically fragile, thereby motivating calibration, clipping, and
certification as essential components rather than optional
refinements.
Subsequent sections present the certified estimators, the certified
reward shaping rule, and the corresponding performance guarantees.
We consider an infinite-horizon discounted Markov decision process
(MDP) M = (𝒮, 𝒜, p, r, μ0, γ)
with discount factor γ ∈ (0, 1) and initial-state
distribution μ0.
At each time t, the agent
observes st ∈ 𝒮, selects
at ∼ π(⋅ ∣ st),
receives reward r(st, at),
and transitions to st + 1 ∼ p(⋅ ∣ st, at).
Throughout we assume rewards are strictly positive and bounded,
0 < rmin ≤ r(s, a) ≤ rmax < ∞,
so that log r(s, a) is
well-defined and uniformly bounded.
We write H for an effective
horizon, which in the discounted case scales as H ≍ (1 − γ)−1.
Our starting point is an offline dataset
Denv = {(s, a, r, s′)}
collected in the true environment by an unknown behavior policy πb.
We do not assume access to πb(a ∣ s)
nor to the transition density p(s′ ∣ s, a);
the only information available is through samples in Denv.
In addition, we fit a dynamics model m(⋅ ∣ s, a) from
Denv and use it as
a generative simulator to produce synthetic transitions.
Given a current policy π, we
may generate synthetic rollouts by initializing from states drawn from
Denv (or an
estimate of the state marginal), then iterating a ∼ π(⋅ ∣ s) and
s′ ∼ m(⋅ ∣ s, a).
The resulting synthetic buffer is denoted
Dm = {(s, a, r̃, s′)},
where r̃ may be the true reward
r(s, a) if
reward is known as a function of (s, a), or a modeled reward
otherwise.
For the purposes of this section, it suffices that both Denv and Dm provide
tuples of the form (s, a, r, s′)
and that Dm is
conditionally distributed according to m rather than p.
It is convenient to describe training as occurring under a mixture of
transition sources.
Fix a mixing fraction f ∈ (0, 1), and let dmix denote the induced
distribution over transitions when we sample from Denv with probability
f and from Dm with
probability 1 − f (with the
understanding that Dm itself
depends on π through the
rollout procedure).
We emphasize that dmix is a distribution;
the distribution is instead the discounted occupancy induced by π in the true MDP.
Both shifts are multiplicative at the trajectory level.
To make this precise, consider a finite prefix τ0 : T = (s0, a0, …, sT).
Under a policy π and kernel
p, its probability density is
proportional to
$$
\mu_0(s_0)\prod_{t=0}^{T-1}\pi(a_t\mid s_t)\,p(s_{t+1}\mid s_t,a_t).
$$
If, instead, actions are compared against πb and
transitions against m, then
the trajectory likelihood ratio factorizes pointwise as
$$
\prod_{t=0}^{T-1}\frac{\pi(a_t\mid s_t)}{\pi_b(a_t\mid s_t)}\cdot
\frac{p(s_{t+1}\mid s_t,a_t)}{m(s_{t+1}\mid s_t,a_t)}.
$$
Taking logs yields an additive decomposition into per-step terms,
$$
\sum_{t=0}^{T-1}\ell_\pi(s_t,a_t) + \ell_p(s_t,a_t,s_{t+1}),
\qquad
\ell_\pi(s,a):=\log\frac{\pi(a\mid s)}{\pi_b(a\mid s)},\quad
\ell_p(s,a,s'):=\log\frac{p(s'\mid s,a)}{m(s'\mid s,a)}.
$$
This decomposition is the basic mechanism by which we can represent, and
potentially correct for, the mismatch between the distributions that
generate training samples and the distribution under which we evaluate a
learned policy.
The SAR viewpoint uses the preceding log-ratios as reward-shaping
terms.
Working with log r(s, a) (which
is bounded thanks to rmin, rmax)
is convenient because it converts multiplicative reweighting at the
trajectory level into additivity across time.
At an informal level, a shaped reward of the form
r̃(s, a, s′) = log r(s, a) + α ℓp(s, a, s′) + β ℓπ(s, a),
with weights α, β ≥ 0, encourages
policies whose returns remain high after accounting for both model
discrepancy and policy deviation from the dataset.
In practice, training uses a mixture of environment and synthetic
transitions; thus one may interpret α as modulating how aggressively
synthetic rollouts are corrected for model bias, and β as modulating how aggressively
policy improvement is corrected for deviation from the behavior
policy.
The concrete form of SAR used in a given algorithm depends on where the
correction is applied (only on Dm, only on
Denv, or on both),
but the central structural feature is that the correction enters
additively via ℓp and ℓπ.
The practical appeal of SAR is that ℓp and ℓπ can be
estimated by binary classification.
For instance, to estimate ℓp, we train a
transition classifier to distinguish samples (s, a, s′)
coming from Denv
(hence distributed according to p(⋅ ∣ s, a))
versus samples coming from Dm (distributed
according to m(⋅ ∣ s, a)).
Under standard equal-prior (or known-prior) reductions, the optimal
discriminator satisfies
$$
\logit\big(\mathbb P(\mathrm{env}\mid s,a,s')\big) \;=\;
\log\frac{p(s'\mid s,a)}{m(s'\mid s,a)} \;+\; c_p,
$$
where cp
is a constant determined by class priors.
An analogous action discriminator on (s, a) yields ℓπ(s, a)
up to an additive prior-dependent constant.
This reduction highlights the primary numerical and statistical
fragility.
First, the map $u\mapsto
\logit(u)=\log\frac{u}{1-u}$ is steep near 0 and 1, so
a small absolute probability error can induce a large logit error.
Second, modern discriminators trained by cross-entropy may be accurate
in ranking while miscalibrated in probability, especially under
distribution shift between training and deployment; such miscalibration
is precisely what SAR cannot tolerate, since logits enter additively
into rewards and thus accumulate over horizon.
Third, limited data in tail regions (which are exactly the regions where
offline RL is most vulnerable) encourages overconfident discriminator
outputs close to 0 or 1, amplifying errors and potentially
producing extreme shaped rewards.
A standard stabilizing device is to clip logits at a level L > 0,
clipL(x) := max (−L, min (L, x)),
which ensures that shaped rewards remain bounded and that Bellman
backups do not propagate unbounded optimism.
However, clipping alone does not resolve the statistical issue: it caps
worst-case magnitude but provides no explicit accounting of how
uncertain a given logit estimate is on the distribution where the policy
is being optimized.
This motivates a formal treatment of ratio estimation as an inference
problem with explicit uncertainty quantification, so that any residual
error in ℓp and ℓπ is
represented and controlled when constructing the learning signal.
We now formalize the inference task underlying SAR: we wish to
estimate the per-transition log-likelihood ratios
$$
\ell_p(s,a,s') \;:=\; \log\frac{p(s'\mid s,a)}{m(s'\mid s,a)},
\qquad
\ell_\pi(s,a) \;:=\; \log\frac{\pi(a\mid s)}{\pi_b(a\mid s)},
$$
together with explicit, finite-sample uncertainty quantification that
remains meaningful on the distribution where the resulting shaped reward
is optimized.
Since neither p(⋅ ∣ s, a) nor
πb(⋅ ∣ s)
is available in likelihood form, both ratios must be inferred from
samples, and any performance guarantee must therefore be expressed in
terms of (i) where these ratios are certified and (ii) how certification
error propagates through Bellman backups.
We regard training as drawing transitions from two sources.
With probability f ∈ (0, 1) we
draw an environment transition (s, a, r, s′)
from the offline buffer Denv, and with
probability 1 − f we draw a
synthetic transition from Dm obtained by
rolling out the current policy in the model m.
This induces a (policy-dependent) training mixture distribution over
tuples, which we denote by dmix.
The key point is that will be stated with respect to a target
distribution ν that is either
dmix itself or a
dominated distribution chosen to upper bound the deployment distribution
of interest.
In later results we move from dmix to the true
discounted occupancy dπ by a standard
concentrability assumption; here we only record that such a change of
measure is necessary.
For the transition ratio, we introduce a binary label Y ∈ {0, 1} indicating whether a
triple x = (s, a, s′)
originates from Denv (Y = 1) or from Dm (Y = 0).
Let P and Q denote the (unknown) distributions
of (s, a, s′)
under these two sources.
A classifier Cϕ(x) ≈ ℙ(Y = 1 ∣ x)
induces a logit
$$
z_\phi(x)\;:=\;\logit(\mathbb{P}(Y=1\mid x)) \;=\;
\log\frac{\mathbb{P}(Y=1\mid x)}{\mathbb{P}(Y=0\mid x)}.
$$
By Bayes’ rule,
$$
z_\phi(x) \;=\; \log\frac{P(x)}{Q(x)} \;+\;
\log\frac{\mathbb{P}(Y=1)}{\mathbb{P}(Y=0)}.
$$
Under the usual reduction used in density-ratio estimation, the
likelihood ratio P(x)/Q(x)
corresponds to p(s′ ∣ s, a)/m(s′ ∣ s, a)
up to the state–action marginal induced by the sampling scheme.
In particular, if the two buffers are constructed so that the class
prior ℙ(Y = 1) is known (e.g.,
by balancing minibatches), then ℓp differs from
zϕ by a
known additive constant cp accounting
for priors and any mismatch in (s, a) marginals:
ℓp(s, a, s′) = zϕ(s, a, s′) + cp,
with an analogous identity for the action ratio obtained from an action
classifier Cψ(s, a) ≈ ℙ(“current
policy” ∣ s, a):
ℓπ(s, a) = zψ(s, a) + cπ.
In what follows we treat cp, cπ
as known constants (or, more generally, as bounded quantities absorbed
into the clipping level); the substantive difficulty is to certify zϕ and zψ as functions
of their inputs under ν.
Fix a confidence level δ ∈ (0, 1) and clipping level L > 0.
We seek estimators ℓ̂p, ℓ̂π
and radii εp, επ
(potentially input-dependent) such that, with probability at least 1 − δ over the randomness of the
data splits and calibration procedure,
The distribution ν should be
read as the distribution on which the shaped reward is during policy
optimization; a canonical choice is ν = dmix.
We emphasize that is a certificate (coverage under ν), which is the natural object
delivered by calibration and conformal prediction procedures; it is
weaker than a uniform-in-x
guarantee but is the appropriate notion once learning and evaluation are
both distributional.
Operationally, we implement through calibrated classifier
logits.
Given a calibrated probability interval for u(x) = ℙ(Y = 1 ∣ x)
of the form $[\underline u(x),\overline
u(x)]$, we obtain a logit interval $[\underline z(x),\overline z(x)]$ by
monotonicity of $\logit(\cdot)$, then
define the midpoint estimate and radius
$$
\widehat z(x) \;:=\; \mathrm{clip}_L\!\left(\frac{\underline
z(x)+\overline z(x)}{2}\right),
\qquad
\varepsilon_z(x) \;:=\; \min\!\left\{L,\frac{\overline z(x)-\underline
z(x)}{2}\right\}.
$$
Finally, we set ℓ̂p(x) = ẑϕ(x) + cp
and εp(x) = εzϕ(x)
(and similarly for ℓπ), noting that
additive constants do not affect radii.
Clipping is essential: it yields bounded shaped rewards and prevents
vacuous radii when calibrated intervals approach probabilities near
0 or 1.
Given (ℓ̂p, εp)
and (ℓ̂π, επ),
we define a shaped reward by subtracting the certificate radius
corresponding to the correction term applied on a given sample.
Concretely, for an environment transition (s, a, r, s′) ∈ Denv,
we apply policy-shift correction and its penalty,
whereas for a synthetic transition in Dm we apply
model-bias correction and its penalty,
This choice matches the provenance of the shift: policy shift is
relevant when leveraging Denv, and model bias is
relevant when leveraging Dm.
If desired, one may apply both penalties uniformly on mixed batches to
enforce pessimism regardless of source; our analysis accommodates either
convention, as it only changes constants in the reward perturbation
bound.
Two structural properties are immediate.
First, the shaped reward is uniformly bounded:
|r̃C| ≤ max {|log rmin|,|log rmax|} + (α + β)L.
Second, at any (s, a, s′)
where the certificates hold, r̃C is a lower
bound on the corresponding ``ideal’’ SAR reward that would use the true
ratios (after the same clipping convention).
Thus, the certificates enter the learning problem exactly as a
controlled pessimism term, ensuring that any residual uncertainty in
log-ratio estimation is paid for explicitly in the reward signal rather
than implicitly through uncontrolled optimism.
The resulting problem is: choose π̂ ∈ Π by applying an
off-policy RL method (e.g., SAC) to the surrogate MDP defined by the
shaped reward r̃C on the
mixture replay stream, while selecting calibration procedures so that
holds for ν = dmix (or a
dominating distribution), and ensuring that the target policy occupancy
is not too far from dmix (formalized later
via concentrability).
Under these conditions, the value gap between π̂ and the optimal policy decomposes
into intrinsic shift terms and an unavoidable term proportional to the
average certificate radii, which is precisely the quantity our
calibration stage is designed to estimate.
Our certificates are only as trustworthy as the probabilistic statements produced by the discriminators. Accordingly, we separate the construction into two layers: (i) training a that orders examples correctly and (ii) post-processing its scores into calibrated probabilities, augmented with finite-sample prediction intervals under a ν (typically dmix). We then transport these probability intervals through the $\logit(\cdot)$ map (and finally through clipping) to obtain the log-ratio intervals used by C-SAR.
Let gθ(x) ∈ ℝ
denote the raw score of a discriminator on input x (either x = (s, a, s′)
for transitions or x = (s, a) for
actions), and define the associated probability model σ(gθ(x)),
where σ(t) = (1 + e−t)−1.
We train gθ by minimizing
an empirical risk built from a strictly proper scoring rule; the
canonical choice is logistic loss
$$
\widehat{\mathcal{L}}_{\mathrm{log}}(\theta)
\;:=\;
\frac{1}{n}\sum_{i=1}^n
\Bigl(-y_i\log\sigma(g_\theta(x_i))-(1-y_i)\log(1-\sigma(g_\theta(x_i)))\Bigr),
$$
but alternatives such as the Brier score are equally admissible.
Properness ensures that, in the realizable limit and under the training
distribution, the Bayes-optimal predictor satisfies σ(gθ*(x)) = ℙ(Y = 1 ∣ x),
so that the score gθ*(x)
equals the desired logit up to an additive constant induced by class
priors.
This separation is convenient: the scoring model may be optimized by any
standard classification pipeline, while calibration (below) is
responsible for turning scores into .
In practice the label prior πY := ℙ(Y = 1)
in the discriminator training stream is seldom equal to the population
prior under the target ν. To
make the logit-to-ratio correspondence explicit, we either (a) enforce a
known πY
by balancing minibatches or (b) record the empirical π̂Y and correct
for it.
Concretely, if a calibrated estimate ũ(x) ≈ ℙ(Y = 1 ∣ x)
is obtained πYtr,
then the corresponding logit under a desired prior πYν
is
$$
\logit(u^{\nu}(x))
\;=\;
\logit(\tilde u(x))
\;+\;
\log\frac{\pi_Y^{\nu}(1-\pi_Y^{\mathrm{tr}})}{(1-\pi_Y^{\nu})\pi_Y^{\mathrm{tr}}},
$$
provided the conditional class-likelihoods are unchanged.
For our purposes, this is precisely the additive offset that is absorbed
into the constants relating logits to log-ratios; the essential point is
that any for ũ(x)
transports to an interval for $\logit(u^{\nu}(x))$ with the same coverage
after adding the corresponding constant.
We begin with a holdout split (or cross-fitting) so that calibration
does not reuse samples employed to fit gθ.
Let s(x) := gθ(x)
be the frozen score.
Temperature scaling fits a single scalar T > 0 by minimizing negative
log-likelihood on the calibration set:
T̂ ∈ arg minT > 0∑i ∈ ℐcal( − yilog σ(si/T) − (1 − yi)log (1 − σ(si/T))), si := s(xi),
and outputs ũ(x) = σ(s(x)/T̂).
This preserves the score ordering and typically suffices when the
classifier is well-specified but overconfident.
Isotonic regression instead learns a nondecreasing function ĥ : ℝ → [0, 1] such that ũ(x) = ĥ(s(x))
minimizes squared error on the calibration set.
Isotonic calibration is nonparametric and robust to misspecification, at
the cost of requiring enough calibration data to avoid staircase
artifacts in the tails (which are exactly the regions where logits can
explode without clipping).
Point calibration alone does not yield a certificate.
We therefore require an interval $[\underline
u(x),\overline u(x)]$ that is valid under the target ν.
A convenient route is , in which we treat the calibrated probability
ũ(x) as a base
regressor for the label Y ∈ {0, 1} and conformalize its
residuals on a calibration set.
One simple construction uses nonconformity scores
αi := |yi − ũ(xi)|, i ∈ ℐcal,
and sets q̂1 − δ to be the
(1 − δ) empirical quantile of
{αi}.
Then we may take
$$
\underline u(x)\;:=\;\max\{0,\tilde u(x)-\widehat{q}_{1-\delta}\},
\qquad
\overline u(x)\;:=\;\min\{1,\tilde u(x)+\widehat{q}_{1-\delta}\}.
$$
By standard split-conformal arguments, the interval contains Y with probability at least 1 − δ under exchangeability.
To convert this into an interval for the ℙ(Y = 1 ∣ x), we employ the
binary-probability specific variants such as Venn–Abers predictors,
which return (p0(x), p1(x))
forming an interval with calibration guarantees; we then set $[\underline u(x),\overline
u(x)]=[\min\{p_0(x),p_1(x)\},\max\{p_0(x),p_1(x)\}]$.
The advantage is that the output is natively an interval in [0, 1], well-suited for monotone transport
through $\logit(\cdot)$.
The discriminator is trained on a distribution determined by buffer
construction and rollout policies, whereas certificates are required on
ν.
If ν differs from the
calibration distribution νcal, we require a shift
assumption.
A standard choice is dominated covariate shift: there exists a known or
estimable weight function $w(x)\propto
\frac{d\nu}{d\nu_{\mathrm{cal}}}(x)$ with 0 ≤ w(x) ≤ wmax.
In this setting we may use calibration, replacing the empirical quantile
by a weighted quantile computed with weights {w(xi)}i ∈ ℐcal.
Under the usual conditions for weighted conformal prediction, the
resulting interval attains marginal coverage at level 1 − δ under ν.
When w is only approximately
known, we propagate its estimation error into a slightly inflated δ (or, equivalently, into a
conservative enlargement of the interval), which ultimately appears as a
larger ε in the logit
domain.
Given any valid probability interval $[\underline u(x),\overline u(x)]$,
monotonicity yields the logit interval
$$
\underline z(x)\;:=\;\logit(\underline u(x)),\qquad \overline
z(x)\;:=\;\logit(\overline u(x)),
$$
and after incorporating the (known or bounded) prior-offset constant we
obtain an interval for the desired log-ratio.
Near {0, 1}, the logit map is
unbounded; thus we always apply clipping at level L to both the midpoint estimate and
to the radius.
This step is not merely a technical convenience: it ensures the shaped
rewards remain bounded uniformly and prevents the certification radii
from becoming vacuous due to rare but extreme calibration outputs in the
tails.
Under these constructions, the calibration stage delivers exactly the
objects required by our later value analysis: distributional logit
(hence log-ratio) intervals with explicit confidence, stable under the
mixture sampling and robust to moderate prior mismatch and covariate
shift.
We now assemble the preceding components into an end-to-end offline model-based RL pipeline. The algorithm maintains three coupled objects: a world model m, a policy–critic pair (e.g., SAC) optimized on a shaped reward, and two discriminators whose calibrated logits provide log-ratio estimates. At a high level, we alternate between generating synthetic experience under m, updating discriminators to measure the discrepancy between real and synthetic transitions and between π and the (unknown) behavior policy, calibrating these discrepancies into logit intervals, and performing actor–critic updates using a clipped-logit, certificate-penalized reward.
C-SAR operates with two replay buffers. The first is the fixed offline dataset Denv = {(s, a, r, s′)} sampled from the environment under πb. The second is a growing synthetic buffer Dm obtained by rolling out the current policy π in the learned model m starting from states sampled from Denv. We train the critic and actor on a mixture distribution dmix induced by sampling transitions from fDenv + (1 − f)Dm for some user-chosen mixing fraction f ∈ (0, 1]. The parameter f has a dual role: it controls the degree of extrapolation (smaller f uses more synthetic rollouts) and determines the reference distribution on which the certificates are required to hold.
Given a minibatch of starting states {s0} from Denv, we simulate short-horizon rollouts of length h in m under the current policy π, producing tuples (st, at, r̂t, st + 1) where at ∼ π(⋅ ∣ st) and st + 1 ∼ m(⋅ ∣ st, at). In the simplest instantiation we set r̂t = r(st, at) if rewards are modeled as part of m, or we reuse the logged reward model otherwise; our analysis only requires that the shaped reward used for optimization is bounded, which we ensure by operating on log r and clipping the ratio terms. Short rollouts are not incidental: they restrict compounding model bias and stabilize both discriminator training (by limiting the support drift of Dm) and actor–critic updates.
We train two binary classifiers on labeled membership data.We emphasize that neither p nor πb need be evaluable; only membership labels are used. The discriminators may be updated more frequently than the actor–critic, but in practice it is often beneficial to decouple timescales: a small number of discriminator steps per epoch suffices once the classifiers track the slowly changing policy distribution.
At regular intervals we calibrate each discriminator (with a holdout
split or cross-fitting) to obtain a probability interval $[\underline u(x),\overline u(x)]$ for the
relevant conditional probability under the target distribution ν (typically ν = dmix). We
then transport this interval through the logit map and incorporate the
prior-offset constant, producing a logit interval $[\underline z(x),\overline z(x)]$ for the
desired log-ratio. From this interval we define
$$
\widehat \ell(x)\;:=\;\mathrm{clip}_L\!\left(\frac{\underline
z(x)+\overline z(x)}{2}\right),
\qquad
\varepsilon(x)\;:=\;\min\!\left\{L,\frac{\overline z(x)-\underline
z(x)}{2}\right\},
$$
where clipL(t) = max {−L, min {L, t}}.
The clipping level L is a
design parameter that simultaneously bounds the shaped reward and
prevents rare calibration failures in the tails from injecting unbounded
signals into Bellman backups.
The core of C-SAR is a shaped reward that inserts the estimated
log-ratio terms and subtracts their certificate radii to enforce
pessimism. We define the per-transition certified reward r̃C differently
depending on whether the transition comes from Denv or Dm:
r̃C(s, a, r, s′) = log r + 1{(s, a, r, s′) ∈ Denv} (β ℓ̂π(s, a) − β επ(s, a)) + 1{(s, a, r, s′) ∈ Dm} (α ℓ̂p(s, a, s′) − α εp(s, a, s′)).
The weights α, β ≥ 0
tune the relative strength of model-bias correction and policy-shift
correction. Optionally, one may subtract both penalty terms on mixed
batches (i.e., always include −αεp − βεπ)
to obtain a uniformly pessimistic surrogate irrespective of the data
source; this can simplify downstream analysis at the cost of additional
conservatism. The use of log r
rather than r is compatible
with our standing assumption r ∈ [rmin, rmax]
and yields an additive reward-shaping form naturally aligned with
log-ratios.
We run a standard off-policy algorithm (e.g., SAC) on the replay
mixture, replacing the reward in the Bellman target by r̃C. Concretely,
if Qω
denotes the critic and πθ the actor, we
compute targets using
y = r̃C + γ 𝔼a′ ∼ πθ(⋅ ∣ s′)[Qω̄(s′, a′) − τlog πθ(a′ ∣ s′)],
with the usual target network Qω̄ and
temperature τ, and apply SGD
updates on squared TD error. The only modification relative to the base
algorithm is the certified reward, which is computed on-the-fly from
discriminator outputs and calibration-derived intervals.
Two invariants are enforced by construction. First, boundedness: for
all transitions,
|r̃C| ≤ log rmax + (α + β)L,
ensuring that value targets remain uniformly bounded and that
contraction-based arguments apply with an effective horizon scaling as
H ≍ (1 − γ)−1.
Second, pessimism at certified points: whenever |ℓ̂ − ℓ| ≤ ε holds
(as guaranteed with high probability under the target distribution), we
have
αℓ̂p − αεp ≤ αℓp, βℓ̂π − βεπ ≤ βℓπ,
so r̃C
lower-bounds the corresponding ideal shaped reward termwise, up to
clipping effects. In implementation, we additionally recommend (i)
balancing discriminator minibatches to fix class priors (thereby making
the logit-to-ratio constant explicit), (ii) limiting rollout horizon
h to keep Dm within the
region where both m and the
certificates are informative, and (iii) cross-fitting calibration to
avoid reusing data for score fitting and interval construction. These
choices do not alter the formal definitions above, but they materially
improve numerical stability and prevent the certificates from becoming
trivially large due to uncontrolled distribution drift.
In this section we formalize the passage from calibrated classification uncertainty to certified errors on the log-ratio terms used by SAR. The logical structure is modular: (i) a calibration procedure produces, for each input, a prediction interval for a conditional class probability under a designated target distribution, and (ii) the standard density-ratio-by-classification identity converts that probability (equivalently its logit) into a log density ratio up to an additive constant determined by class priors. Combining the two yields pointwise log-ratio certificates on the training mixture distribution, which will be the sole interface to the value analysis in Section~7.
Let ν denote the
distribution on which we require valid uncertainty statements. In our
use case ν is the
replay-mixture induced by sampling transitions from fDenv + (1 − f)Dm,
i.e. ν = dmix for
transition inputs x = (s, a, s′)
and similarly for action inputs x = (s, a). A
(possibly cross-fitted) calibration method takes a trained classifier
C(x) ∈ (0, 1) and
returns an interval-valued predictor
$$
x \longmapsto [\underline u(x),\overline u(x)] \subset (0,1)
$$
such that, with probability at least 1 − δ over the calibration
randomness and sample draw (in the sense appropriate to the chosen
calibration tool), we have
We do not fix a particular calibration technique; conformal prediction,
split conformal, or other distribution-free methods are admissible
provided they deliver on ν.
The first step is purely analytic: the logit map $\logit(t)=\log\frac{t}{1-t}$ is monotone, hence it transports probability intervals to logit intervals. The difficulty is that $\logit(\cdot)$ is unbounded near 0 and 1, so we incorporate clipping to obtain bounded shaped rewards and bounded error radii.
By monotonicity of $\logit$, implies
$z(x)\in[\underline z(x),\overline
z(x)]$ under the same event. Therefore $|z(x)-\tfrac{1}{2}(\underline z+\overline z)|\le
\tfrac{1}{2}(\overline z-\underline z)$. Clipping can only reduce
distances to [−L, L]
and we cap the radius by L,
yielding the stated bound. ▫
We next recall the standard identity linking the optimal class
probability to a density ratio. Let P and Q be two distributions on a common
space 𝒳 with densities p and q (with respect to a dominating
measure). Consider a binary label Y ∈ {1, 0} with class priors ρ := ℙ(Y = 1) and 1 − ρ = ℙ(Y = 0), and
conditional X ∣ (Y = 1) ∼ P,
X ∣ (Y = 0) ∼ Q.
Then the Bayes conditional is
$$
u(x)=\mathbb{P}(Y=1\mid X=x)=\frac{\rho p(x)}{\rho p(x)+(1-\rho)q(x)}.
$$
Algebra yields
Thus, up to the additive constant $c(\rho):=\log\frac{1-\rho}{\rho}$ determined
by the class balance used in discriminator training, the desired log
density ratio is the logit of the true conditional class
probability.
For the transition discriminator, x = (s, a, s′) and the two class-conditionals are the real and synthetic transition sources restricted to the training mixture support. Under the idealized picture in which (s, a) is drawn from a common marginal and only the conditional next-state differs, reduces pointwise to $\ell_p(s,a,s')=\log\frac{p(s'\mid s,a)}{m(s'\mid s,a)}$ up to the prior constant. For the action discriminator, x = (s, a) and we analogously obtain $\ell_\pi(s,a)=\log\frac{\pi(a\mid s)}{\pi_b(a\mid s)}$ up to the corresponding constant. In both cases we either (i) enforce balanced minibatches so that $\rho=\tfrac{1}{2}$ and c(ρ) = 0, or (ii) record the sampling ratio used to train the discriminator and correct by the known c(ρ).
Combine with Lemma~ and note that adding the constant c(ρ) commutes with the
midpoint construction; clipping is handled identically to Lemma~. ▫
Applying Lemma~ to the transition discriminator with target
distribution ν = dmix on
x = (s, a, s′)
yields, at level 1 − δp,
|clipL(ℓp(s, a, s′)) − clipL(ℓ̂p(s, a, s′))| ≤ εp(s, a, s′).
Applying the same lemma to the action discriminator on x = (s, a) yields,
at level 1 − δπ,
|clipL(ℓπ(s, a)) − clipL(ℓ̂π(s, a))| ≤ επ(s, a).
A union bound gives simultaneous validity at level at least 1 − (δp + δπ)
on dmix, and
cross-fitting ensures that the events above hold for the distributions
induced by the current epoch without reusing the same samples for both
fitting and interval construction. These are precisely the certificates
required to justify the pessimistic reward adjustment in C-SAR: on the
(high-probability) event of validity, subtracting εp and επ produces
termwise lower bounds on the corresponding unclipped log-ratio
contributions, and clipping ensures the entire shaped reward remains
bounded. In Section~7 we treat (ℓ̂p, εp)
and (ℓ̂π, επ)
as primitive certified inputs and propagate their effect through Bellman
operators to obtain explicit value guarantees.
In this section we take the certified log-ratio inputs produced in Section~ and propagate their effect through Bellman operators to obtain explicit performance guarantees. The guiding viewpoint is that C-SAR induces a surrogate control problem: we optimize a shaped reward which (i) corrects for model bias and policy shift via log-ratio terms, and (ii) is made by subtracting certificate radii. The analysis therefore has two separable components: a mismatch term reflecting intrinsic distribution shift (which would persist even with exact ratios), and a certificate term reflecting finite-sample uncertainty in the ratios.
Let r̃* denote the (unimplementable) shift-aware shaped reward which uses the true log-ratios ℓp, ℓπ (and the same clipping level L as in the algorithm). Let r̃C denote the shaped reward used by C-SAR, i.e. the same functional form but with (ℓp, ℓπ) replaced by (ℓ̂p, ℓ̂π) and with the certificate radii subtracted (as in Algorithm C-SAR). Since the training batches are drawn from the replay mixture dmix, it is convenient to write both rewards as functions of a generic transition input x sampled from dmix; the precise dependence on whether x originated from Denv or Dm is immaterial to the algebra below, except through which ratio term is active.
On the high-probability event ℰ that the certificates hold on dmix, we obtain a pointwise sandwich: the certified reward is pessimistic with a controlled gap.
The first inequality is immediate from the construction (subtracting
radii), while the second inequality uses the certificate bounds and the
triangle inequality. We emphasize that is and does not require any
Bellman-style argument.
Let Ṽπ, * and Ṽπ, C denote the discounted values of a fixed policy π under the same transition dynamics used for training (i.e. the kernel implicit in dmix), but with rewards r̃* and r̃C, respectively. Since r̃* and r̃C are bounded by clipping (and log r is bounded by r ∈ [rmin, rmax]), both value functions are well-defined and satisfy standard contraction properties.
A direct consequence of Lemma~ is that the value loss due solely to certification is at most horizon-linear in the radii.
The proof is a one-line application of along trajectories and the
geometric series ∑t ≥ 0γt = (1 − γ)−1;
no further structure is needed.
We now connect the surrogate (C-SAR) control problem to the true
environment objective. This step is necessarily assumption-dependent: no
offline method can control the environment value without some overlap
between the target policy visitation and the training distribution. We
therefore assume a standard concentrability condition: there exists
κ ≥ 1 such that for any
candidate policy π under
consideration,
In addition, we assume that the SAR construction yields a controlled
mismatch between the surrogate backups and the true environment backups.
We keep this component abstract and quantify it by nonnegative terms
Δmodel and Δpolicy (depending on the
intrinsic discrepancy between p and m and between π and πb), scaled by
(α, β). Under these
premises, the only additional degradation introduced by certification is
the horizon-linear certificate term from Lemma~, transferred from dmix to dπ using .
Finally, we must account for the fact that we do not run exact value iteration: the algorithm updates an actor–critic (e.g. SAC) on samples. We represent this by an optimization/approximation residual opt_err, meaning that the output policy π̂ is near-optimal for the certified surrogate objective up to opt_err.
The proof is a standard contraction-based argument: we compare (a) the
optimal value in the environment, (b) the optimal value under the ideal
SAR reward (incurring the intrinsic mismatch terms), and (c) the optimal
value under the certified reward (incurring the certificate terms via
Lemma~); we then include opt_err to
reflect approximate solution of the certified surrogate control problem.
The role of κ is only to move
expectations from the training mixture (where certificates are valid) to
the occupancy of the comparator policy; without such a transfer
inequality, the bound is necessarily vacuous.
Theorem~ isolates the precise price of certification: even if the intrinsic shift terms vanish (perfect model and no policy shift), we cannot beat an O(H(αε̄p + βε̄π)) degradation when we insist on high-probability correctness of the log-ratio terms. In Section~8 we show that this dependence is not an artifact of the analysis but is information-theoretically unavoidable.
We now justify, in an information-theoretic sense, why the horizon-linear dependence on certified log-ratio uncertainty that appears in Theorem~ is not merely a proof artifact. The conclusion is twofold. First, even if we grant perfect function approximation and exact optimization of the surrogate objective, any method that relies on estimated log-ratios must pay a price proportional to the uncertainty in those ratios. Second, no such statement can avoid an overlap condition: without support coverage, offline method (ratio-based or otherwise) can provide non-vacuous guarantees, and this impossibility composes with our certificate-driven one.
We consider algorithms that receive only samples from Denv and Dm and may post-process them arbitrarily (including training m, training discriminators, calibrating them, and performing any actor–critic updates). Fix any such algorithm Alg which outputs a policy π̂ together with any shaped reward or pessimism mechanism that depends on the data only through these samples. The lower bounds we state are of the following form: we construct two instances ℐ0, ℐ1 such thatThe indistinguishability forces Alg to behave similarly on both instances with nontrivial probability, while the difference in the correct objective forces it to be suboptimal in at least one of them. This is the standard ``two-point method’’ (Le Cam), specialized to the particular nuisance parameters that C-SAR tries to estimate, namely ℓp and ℓπ.
We sketch the core construction behind Theorem~4. For simplicity,
consider a finite-horizon H
MDP (or discounted with H ≍ (1 − γ)−1)
with a chain structure in which the agent repeatedly encounters a state
st and
must choose between two actions a ∈ {0, 1}. Action 0 is
safe'' and yields deterministic next state and moderate reward. Action $1$ isrisky’’
and transitions to a high-reward absorbing state if a particular
transition probability is large, and to a low-reward absorbing state if
it is small. We choose the two instances ℐ0, ℐ1 so that they
agree on everything except the transition kernel of the risky action on
a small region; in particular,
p0(⋅ ∣ s, a = 1) ≠ p1(⋅ ∣ s, a = 1), p0(⋅ ∣ s, a = 0) = p1(⋅ ∣ s, a = 0),
and the difference is calibrated so that the induced likelihood ratio
between p and the learned
model m differs by an additive
log amount εp on the risky
transition: ℓp, 1 − ℓp, 0 ≈ 2εp
(after clipping). We then arrange Denv and Dm so that the
samples contain too little information to resolve which of ℐ0, ℐ1 holds on the
risky region (e.g., by making visits to (s, a = 1) sufficiently
rare under πb, and by
ensuring m produces similar
synthetic transitions there). Formally, one ensures that the total
variation (or KL) between the induced data distributions satisfies TV(ℙ0, ℙ1) ≤ c < 1,
so that any test has error bounded away from 0.
Under ℐ0, the correct shift-aware correction would justify selecting the risky action; under ℐ1, it would not. Since the per-step shaped reward discrepancy is Θ(αεp) on the risky branch, the value discrepancy between the two instances under the two competing policies accumulates over horizon, yielding a gap of order Θ(Hαεp). Le Cam’s inequality then implies that for any Alg there exists i ∈ {0, 1} such that, with constant probability under ℐi, the returned policy is suboptimal by at least Ω(Hαεp) in environment value. An analogous argument applies to the policy-shift term ℓπ by constructing two behavior policies πb, 0, πb, 1 that induce indistinguishable (s, a) marginals on the observed data but different true action log-ratios ℓπ on the relevant region, yielding Ω(Hβεπ).
The substantive point is that this lower bound is : if the data (plus calibration) can only certify |ℓ̂p − ℓp| ≤ εp on the relevant region, then no downstream control algorithm can guarantee value loss o(Hαεp) uniformly over compatible instances, because the ambiguity in ℓp is itself compatible with multiple environments that imply different optimal decisions.
Clipping and pessimism are necessary for stability and valid high-probability control of errors, but they do not create information. Clipping merely bounds the influence of regions where the ratio is extreme or poorly estimated; pessimism (subtracting radii) protects against over-optimistic errors. The lower bound above precisely matches this logic: if the algorithm chooses to be pessimistic by an amount comparable to ε, it will avoid catastrophic errors but necessarily sacrifices Θ(ε) shaped reward on the ambiguous region, and therefore Θ(Hε) value in the worst case. Conversely, if it refuses to be pessimistic, it must be wrong on at least one indistinguishable instance.
Independently of ratio uncertainty, we recall the classical offline RL barrier: without overlap, policy improvement is impossible in general. Concretely, one may embed a two-armed bandit into the first step of an MDP, with action a = 1 unobserved (or extremely rare) in Denv. Two environments that differ only in the reward of a = 1 then induce identical offline data with nontrivial probability, forcing any algorithm to have large regret on at least one of them. This establishes that some form of concentrability, such as , is not merely technical but logically necessary for non-vacuous guarantees.
Our contribution in the present work is orthogonal: even overlap, the use of model rollouts and policy improvement introduces additional shift terms that must be corrected, and the correction cannot be more accurate than the certified uncertainty in the implied log-ratios. Thus, the final guarantee must contain (i) an overlap-dependent transfer factor κ and (ii) an uncertainty-dependent term of order H(αεp + βεπ), up to logarithmic factors. Theorem~4 shows that, once the certificates are fixed, this latter dependence is unavoidable.
We outline implementation choices that make the certified shaping terms operational in modern offline RL pipelines, and we propose an experimental plan designed to isolate the roles of calibration, clipping, and pessimism under realistic distribution shift.
Both discriminators are trained on membership labels derived from buffers whose class proportions may vary over time. Concretely, the transition discriminator Cϕ sees positives from Denv and negatives from Dm, while the action discriminator Cψ sees positives from on-policy samples Dπ and negatives from Denv. Since the log-ratio identity includes an additive constant depending on class priors, we recommend either (i) explicit class balancing in each minibatch to enforce equal priors, or (ii) explicit prior correction by adding $c_p=\log\frac{\pi_{\mathrm{neg}}}{\pi_{\mathrm{pos}}}$ (and analogously for cπ) to the calibrated logits. The second option becomes necessary when |Dm| grows much larger than |Denv| or when we use non-uniform sampling from replay. In both cases we log the effective sampling priors used by the discriminator to ensure the constant offsets are not silently drifting across epochs.
We split data for each discriminator into (a) a training subset for fitting raw logits, and (b) a calibration subset (held-out) for constructing prediction intervals for ℙ(Y = 1 ∣ x). We will compare two calibration families: (i) temperature scaling (or Platt-style scaling) producing a calibrated point probability C̃(x), augmented with bootstrap-based intervals; and (ii) conformal calibration producing finite-sample marginal coverage intervals $[\underline p(x),\overline p(x)]$ under a specified target distribution (taken to be the replay sampling distribution for the discriminator). In either case, we report the empirical coverage of the induced intervals on a second holdout set (not used in either training or calibration), since our certificates are applied after the $\logit(\cdot)$ transform and clipping. We emphasize that calibration is assessed on the same distribution on which the shaped reward is optimized; if we change the mixture ratio f or the rollout horizon h, we recalibrate to avoid invalidating the intended coverage level.
The clipping parameter L serves two roles: it bounds shaped rewards and prevents extreme logits (near 0 or 1 probabilities) from dominating learning. In practice, saturation is common precisely on the transitions of greatest interest (where model rollouts deviate from the environment), so clipping must be treated as a primary design parameter rather than an afterthought. We will sweep L on a logarithmic grid and report not only final return but also the fraction of samples whose logits are clipped and the resulting average certificate radius after clipping, $\mathbb{E}[\min\{L,(\overline z-\underline z)/2\}]$. This makes explicit the tradeoff between expressivity of the shaping term and the stability of pessimistic backups.
We will include targeted ablations that correspond to removing individual logical components.On controlled synthetic environments (described below) where true ratios are computable, we will add an ablation that uses the true clipped log-ratios. This separates estimation error from the effect of the SAR functional form.
Class imbalance affects both discrimination and calibration: a discriminator can achieve high accuracy while producing poorly calibrated probabilities when the base rate is extreme. We will systematically vary the effective class prior in training by changing the sampling ratio of Denv to Dm (and of Denv to Dπ), and we will compare (i) explicit rebalancing, (ii) prior correction constants, and (iii) importance-weighted calibration procedures. The primary outcome is whether the reported logit intervals maintain coverage under the replay distribution used by the actor–critic update. Secondary outcomes include stability of learning (variance across seeds) and the magnitude of the pessimism penalties induced by widened intervals under severe imbalance.
Because Dm is generated by rolling out the evolving policy π in the learned model m, the negative class for Cϕ is nonstationary, and similarly the positive class for Cψ evolves with π. We will therefore treat discriminator retraining and recalibration as part of the control loop: every K policy updates we refresh Dm (possibly with a sliding window) and refit/calibrate (Cϕ, Cψ). We will vary K and the rollout horizon h to quantify how drift degrades calibration, measured by holdout ECE and by interval under-coverage. We will also test whether shorter rollouts (smaller h) reduce drift sufficiently to yield tighter certificates, trading off synthetic data diversity against certificate tightness.
When m is misspecified, synthetic transitions can fall outside regions where either discriminator generalizes, producing near-deterministic predictions with wide or unreliable intervals. Our plan is to treat the certificates as a diagnostic: large εp concentrated on specific state–action regions indicates model mismatch (or insufficient data) that cannot be corrected by downstream optimization. We will report spatial statistics of εp (e.g., by binning in state features or using learned embeddings) and correlate them with empirical model error metrics (one-step prediction error and multi-step rollout discrepancy). Operationally, we will also evaluate a conservative gating heuristic that drops or downweights synthetic transitions with εp above a threshold, to test whether certificates can be used not only for shaping but also for .
We will evaluate on standard offline RL suites (D4RL locomotion and navigation tasks, including regimes known to stress extrapolation) and NeoRL tasks that include stochasticity and structured dataset shifts. In parallel, we will include controlled synthetic environments: (i) tabular chain MDPs and gridworlds where p is known and m can be perturbed to induce a tunable ℓp gap; and (ii) synthetic behavior-policy shifts where πb is known, allowing direct computation of ℓπ. These controlled settings enable direct measurement of certificate validity and tightness against ground truth.
We will report: (a) policy performance (normalized return and worst-seed return); (b) calibration metrics for both discriminators (ECE, reliability curves, and AUC as a non-calibration baseline); (c) empirical coverage of the logit intervals and average radius ε̄p, ε̄π; and (d) the fraction of clipped logits and the induced average pessimism penalty in shaped reward units. This set of metrics is designed to link performance changes to specific failure modes (miscalibration, saturation, drift, or mismatch) rather than attributing them generically to ``model bias.’’
Our guarantees are deliberately instance-conditional: they certify performance only to the extent that (i) the shaped reward is a valid pessimistic surrogate on the distribution actually used for learning, and (ii) the target policy does not place substantial mass outside that distribution. This section records what is covered by the present analysis, when the certificates become vacuous, and which extensions appear technically plausible.
The calibration statements underpinning εp and επ are formulated under a specified target distribution ν (in our implementation, the replay sampling distribution, i.e., dmix). Consequently, even if the calibration procedure yields finite-sample marginal coverage under ν, it does not imply that |ℓ̂p − ℓp| or |ℓ̂π − ℓπ| are controlled uniformly over all (s, a, s′). The performance bound therefore necessarily depends on a concentrability factor κ relating dπ to dmix. When κ is large (or infinite), the bound can be loose (or void) regardless of how tight the certificates are on dmix. In particular, no discriminator-based ratio correction can manufacture support where none exists: if dmix(s, a) = 0 but dπ(s, a) > 0, then neither calibration nor clipping can prevent extrapolation error from dominating.
Our development is stated for an MDP over states s ∈ 𝒮 with transition kernel p(s′ ∣ s, a). In partially observed settings, one observes ot ∼ 𝒪(⋅ ∣ st) and acts based on histories ht = (o0 : t, a0 : t − 1) or a learned belief/latent state. If we na"ively apply C-SAR with s replaced by o, then the transition discriminator estimates a ratio between transition mixtures rather than the underlying state transitions, and the implied shaping term need not correspond to a valid correction of model bias in the latent dynamics. More subtly, even if we use recurrent policies and critics, the relevant ``state’’ becomes ht (or a sufficient statistic thereof), and the overlap assumption must hold in this enlarged space. Establishing an analogue of Thm.~3 in POMDPs thus requires (i) a precise choice of information state, (ii) calibration and certificates for ratios defined on that information state, and (iii) an analysis of approximation error when the learned representation is not sufficient. None of these steps is automatic, and we do not claim MDP-style guarantees under arbitrary state aliasing.
A common modern choice is an implicit or simulator-style dynamics model m (e.g., a diffusion model for next states) for which m(s′ ∣ s, a) is samplable but not tractably evaluable. Our approach already treats m as a black box for sampling and never requires likelihood evaluation; however, the of $\ell_p=\log\frac{p}{m}$ remains that of a Radon–Nikodym derivative, hence it exists only when p(⋅ ∣ s, a) ≪ m(⋅ ∣ s, a) (or vice versa, depending on the direction). When p and m have near-disjoint support for some (s, a), the true log-ratio is unbounded and any bounded surrogate necessarily incurs irreducible error. Clipping at level L makes optimization stable, but it also means we are optimizing a clipped objective whose relationship to the unclipped correction saturates in precisely the hard regions. Thus, in regimes of severe model misspecification or support mismatch, certificates do not ``fix’’ the problem; they merely quantify (often pessimistically) that the correction is unreliable.
There are several concrete failure modes in which the shaped reward
becomes overly pessimistic or uninformative.
First, calibrated probability intervals that approach [0, 1] yield logit intervals of essentially
unbounded width; after clipping, this manifests as εp ≈ L
or επ ≈ L,
so the penalty term αεp + βεπ
can dominate the signal log r.
This occurs under severe class imbalance, nonstationarity of
discriminator inputs, insufficient calibration data, or simply when the
classification task is intrinsically hard on the replay distribution.
Second, even with tight per-sample intervals, the overall bound in
Thm.~3 scales with κH; for long effective
horizons H ≍ (1 − γ)−1,
moderate per-step uncertainty compounds additively. Third, our use of
log r presumes r(s, a) > 0.
While one may shift rewards by a constant to enforce positivity, such
transformations alter the effective objective and can interact
nontrivially with entropy regularization and function approximation. In
short, when pessimism penalties are large or horizon/coverage constants
are unfavorable, the certificate-induced bound correctly indicates that
meaningful guarantees are unattainable from the available data/model
pair.
Our analysis isolates uncertainty arising from estimating ℓp and ℓπ, but it does not provide a tight decomposition of (i) approximation error from the critic and policy classes, (ii) optimization error from finite SGD, and (iii) instability induced by bootstrapping. These effects are subsumed into an aggregate term (cf. opt_err in Thm.~3) and can dominate in practice. Moreover, the discriminators are trained on data that are themselves generated by the evolving policy via Dm and Dπ; although recalibration mitigates drift empirically, our theorems do not model this feedback loop as an adaptive data-collection process with time-uniform guarantees.
A natural extension is to perform C-SAR in a learned latent space
z = fθ(s)
(or z = fθ(o)
in POMDPs), using a latent dynamics model mϑ(z′ ∣ z, a)
and discriminators defined on (z, a, z′).
Technically, one must address two issues. First, the density-ratio
identity applies to the distributions induced by the encoder; thus, the
relevant correction becomes
$$
\ell_{p,z}(z,a,z')=\log\frac{p_z(z'\mid z,a)}{m_\vartheta(z'\mid z,a)},
$$
which coincides with the desired correction only if z is (approximately) sufficient for
control and the encoder is stable across the environment/model
distributions. Second, calibration must be performed under the replay
distribution in latent space, which may shift as the encoder is updated;
this suggests either freezing the encoder during calibration epochs or
calibrating conditionally on the encoder parameters (a substantially
harder problem). On the likelihood-free side, one can replace explicit
logits with simulation-based inference objectives (e.g.,
classifier-based mutual information estimators or noise-contrastive
estimation) and then apply conformal calibration to the resulting
scores. The conceptual requirement remains unchanged: we need
finite-sample intervals for a monotone transform of the true likelihood
ratio under the distribution on which the actor–critic trains.
We view C-SAR as a method for making a common heuristic—reward shaping by model-bias and policy-shift penalties—: whenever the discriminators are uncertain, the algorithm is forced to be pessimistic by an explicit, measurable amount. The corresponding limitation is equally explicit: in low-coverage or high-shift regimes, the only valid certificate may be a large one, and the resulting performance guarantee can be unavoidably weak. Extending these ideas beyond fully observed MDPs and toward latent, implicit, and partially observed models appears feasible, but will require careful redefinition of the ratio objects, as well as calibration procedures robust to representation drift.