← Back

Certified Shifts-Aware Rewards: Calibration-to-Value Guarantees for Model-Based Offline RL

Metadata

Table of Contents

  1. 1. Introduction: distribution shift in model-based offline RL; SAR as density-ratio reward correction; why classifier miscalibration is the bottleneck; contributions and roadmap.
  2. 2. Background and Setup: MDPs, offline + synthetic data mixture, trajectory shift weighting decomposition (model bias vs policy shift), SAR reward form, and why logits are numerically/statistically fragile.
  3. 3. Problem Formulation: Certified Ratio Estimation for SAR — define target log-ratios, training distributions, certificate guarantees, and how certificates are used to construct pessimistic rewards.
  4. 4. Calibration Methods for Log-Ratio Estimation: proper scoring rules, temperature scaling, isotonic regression, conformalized calibration; deriving high-probability logit intervals under class imbalance and covariate shift in the discriminator training distribution.
  5. 5. The Certified SAR Algorithm (C-SAR): end-to-end pipeline; clipped-logit reward construction; integration into MBPO/SAMBO training with synthetic rollouts; design invariants and stability considerations.
  6. 6. Main Theory I — From Calibration to Certified Log-Ratio Error: lemmas converting probability calibration to bounded logit error and to bounded log density-ratio error on the training mixture.
  7. 7. Main Theory II — From Certified Log-Ratio Error to Value Guarantees: pessimistic value iteration viewpoint; Bellman error bounds; suboptimality theorem with explicit dependence on (εp, επ), horizon, and mismatch divergences.
  8. 8. Lower Bounds / Impossibility: show unavoidable dependence on log-ratio uncertainty; connect to off-policy evaluation lower bounds and coverage limitations.
  9. 9. Practical Considerations and Experimental Plan: ablations (no calibration, calibration without clipping, clipping without calibration), robustness under class imbalance, rollout-induced drift, and implicit model mismatch; benchmarks (D4RL/NeoRL + controlled synthetic environments) and metrics (ECE, logit CI coverage, performance).
  10. 10. Discussion and Limitations: what the guarantees do not cover (POMDPs, implicit models without overlap), when certificates become vacuous, and how to extend to likelihood-free/latent world models.

Content

1. Introduction: distribution shift in model-based offline RL; SAR as density-ratio reward correction; why classifier miscalibration is the bottleneck; contributions and roadmap.

Model-based offline reinforcement learning aims to extract a high-performing policy from a fixed batch of environment experience while optionally leveraging a learned dynamics model to generate additional synthetic rollouts.
The difficulty is that the optimization procedure is intrinsically in at least two distinct senses.
First, the state–action distribution induced by the learned policy typically differs from that of the unknown behavior policy that generated the offline dataset; second, synthetic transitions produced by the learned model need not match the true environment dynamics.
Both shifts can be beneficial when used prudently (e.g., for improved exploration of the dataset support), yet they are also precisely the mechanism by which offline RL methods can become arbitrarily over-optimistic.
Our objective is therefore not merely to regularize shift, but to it, for it in a principled manner, and the residual uncertainty introduced by the correction.

A convenient organizing principle is to view distribution shift as a multiplicative reweighting of trajectory probabilities.
If we denote by p the true transition kernel and by m a learned model used to generate synthetic transitions, then any method that trains on a mixture of environment transitions and model transitions implicitly mixes samples from different Markov chains.
Likewise, the policy we optimize, π, induces visitation frequencies that generally differ from those of the behavior policy πb that produced the offline dataset.
At a formal level, this suggests a decomposition of the relevant Radon–Nikodym derivative into a component and a component, which, in log form, yields additive terms of the type
$$ \ell_p(s,a,s') \;=\; \log\frac{p(s'\mid s,a)}{m(s'\mid s,a)}, \qquad \ell_\pi(s,a) \;=\; \log\frac{\pi(a\mid s)}{\pi_b(a\mid s)}. $$
The Shift-Aware Reward (SAR) perspective proposes to incorporate such log-ratio terms directly into reward shaping so that maximizing the shaped return aligns with a pessimistic objective that accounts for the mismatch between training and evaluation distributions.
In particular, when rewards are strictly positive and bounded, it is natural to work with log r(s, a) (for numerical stability and additivity across time), and to augment it with weighted shift-correction terms.
This provides a single scalar learning signal that can be optimized by standard off-policy actor–critic machinery, while still encoding a precise notion of how far a transition or action is from the distribution under which we have reliable evidence.

However, SAR is only as trustworthy as the density-ratio estimators that supply p and π.
In modern offline RL pipelines, these ratios are typically obtained via binary classification: we train a discriminator to distinguish environment transitions from model transitions (for p) and to distinguish actions sampled from the current policy from actions in the offline dataset (for π).
Under standard equal-prior or known-prior reductions, the discriminator logit is (up to an additive constant) the desired log-density ratio.
Thus, in the idealized large-sample regime with perfect probability estimates, the SAR correction is well-founded.
In the finite-sample, function-approximation regime, the bottleneck is not merely classification error but : the classifier may rank examples correctly while still outputting probabilities whose numerical values are systematically biased.
Since the logit map $u \mapsto \log \frac{u}{1-u}$ amplifies errors near 0 and 1, even modest miscalibration can produce large errors in the inferred log-ratio, which then appear additively in the shaped reward and accumulate over an effective horizon H ≍ (1 − γ)−1.
Consequently, without explicit control of calibration, the SAR correction can be either overly aggressive (destroying useful learning signal) or falsely reassuring (failing to enforce pessimism where it is needed).

The contribution of this work is to replace heuristic discriminator-based correction with a procedure that propagates finite-sample uncertainty through the classification-to-logit-to-reward pipeline.
Concretely, we treat each discriminator as a probabilistic predictor that is subsequently calibrated (e.g., by temperature scaling and/or conformal prediction), and we extract for each input an interval of plausible probabilities with prescribed coverage.
By mapping this interval through the logit transform and then applying explicit clipping, we obtain both (i) a point estimate of the log-ratio used for learning and (ii) a certificate radius that upper bounds its error with high probability on a specified target distribution (typically a training mixture distribution).
We then enforce pessimism by subtracting these certificate radii inside the shaped reward, yielding a Certified SAR objective whose optimism is quantitatively controlled.
This addresses the core fragility: rather than hoping the classifier is well-calibrated, we ensure that whatever uncertainty remains is explicitly paid for in the learning signal.

At a high level, our technical and algorithmic contributions may be summarized as follows:

The remainder of the paper proceeds as follows.
In the next section we fix notation and describe the offline–synthetic data mixture setting, including the two distinct sources of shift (model bias and policy shift) and the SAR reward form that motivates our construction.
We also explain why the logit-based implementation of SAR is numerically and statistically fragile, thereby motivating calibration, clipping, and certification as essential components rather than optional refinements.
Subsequent sections present the certified estimators, the certified reward shaping rule, and the corresponding performance guarantees.


2. Background and Setup: MDPs, offline + synthetic data mixture, trajectory shift weighting decomposition (model bias vs policy shift), SAR reward form, and why logits are numerically/statistically fragile.

We consider an infinite-horizon discounted Markov decision process (MDP) M = (𝒮, 𝒜, p, r, μ0, γ) with discount factor γ ∈ (0, 1) and initial-state distribution μ0.
At each time t, the agent observes st ∈ 𝒮, selects at ∼ π(⋅ ∣ st), receives reward r(st, at), and transitions to st + 1 ∼ p(⋅ ∣ st, at).
Throughout we assume rewards are strictly positive and bounded,
0 < rmin ≤ r(s, a) ≤ rmax < ∞,
so that log r(s, a) is well-defined and uniformly bounded.
We write H for an effective horizon, which in the discounted case scales as H ≍ (1 − γ)−1.

Our starting point is an offline dataset
Denv = {(s, a, r, s)}
collected in the true environment by an unknown behavior policy πb.
We do not assume access to πb(a ∣ s) nor to the transition density p(s ∣ s, a); the only information available is through samples in Denv.
In addition, we fit a dynamics model m(⋅ ∣ s, a) from Denv and use it as a generative simulator to produce synthetic transitions.
Given a current policy π, we may generate synthetic rollouts by initializing from states drawn from Denv (or an estimate of the state marginal), then iterating a ∼ π(⋅ ∣ s) and s ∼ m(⋅ ∣ s, a).
The resulting synthetic buffer is denoted
Dm = {(s, a, , s)},
where may be the true reward r(s, a) if reward is known as a function of (s, a), or a modeled reward otherwise.
For the purposes of this section, it suffices that both Denv and Dm provide tuples of the form (s, a, r, s) and that Dm is conditionally distributed according to m rather than p.

It is convenient to describe training as occurring under a mixture of transition sources.
Fix a mixing fraction f ∈ (0, 1), and let dmix denote the induced distribution over transitions when we sample from Denv with probability f and from Dm with probability 1 − f (with the understanding that Dm itself depends on π through the rollout procedure).
We emphasize that dmix is a distribution; the distribution is instead the discounted occupancy induced by π in the true MDP.

For a policy π, let dπ(s, a) denote the discounted state–action occupancy under the true kernel p:
dπ(s, a) := (1 − γ)∑t ≥ 0γt ℙπ(st = s, at = a),
with the analogous definition for state occupancies dπ(s).
Offline RL is difficult because dπ is not under our direct control during training: the dataset comes from πb, and synthetic data come from m rather than p.
Accordingly, there are (at least) two distinct sources of distribution shift relevant for learning:

Both shifts are multiplicative at the trajectory level.
To make this precise, consider a finite prefix τ0 : T = (s0, a0, …, sT).
Under a policy π and kernel p, its probability density is proportional to
$$ \mu_0(s_0)\prod_{t=0}^{T-1}\pi(a_t\mid s_t)\,p(s_{t+1}\mid s_t,a_t). $$
If, instead, actions are compared against πb and transitions against m, then the trajectory likelihood ratio factorizes pointwise as
$$ \prod_{t=0}^{T-1}\frac{\pi(a_t\mid s_t)}{\pi_b(a_t\mid s_t)}\cdot \frac{p(s_{t+1}\mid s_t,a_t)}{m(s_{t+1}\mid s_t,a_t)}. $$
Taking logs yields an additive decomposition into per-step terms,
$$ \sum_{t=0}^{T-1}\ell_\pi(s_t,a_t) + \ell_p(s_t,a_t,s_{t+1}), \qquad \ell_\pi(s,a):=\log\frac{\pi(a\mid s)}{\pi_b(a\mid s)},\quad \ell_p(s,a,s'):=\log\frac{p(s'\mid s,a)}{m(s'\mid s,a)}. $$
This decomposition is the basic mechanism by which we can represent, and potentially correct for, the mismatch between the distributions that generate training samples and the distribution under which we evaluate a learned policy.

The SAR viewpoint uses the preceding log-ratios as reward-shaping terms.
Working with log r(s, a) (which is bounded thanks to rmin, rmax) is convenient because it converts multiplicative reweighting at the trajectory level into additivity across time.
At an informal level, a shaped reward of the form
(s, a, s) = log r(s, a) + αp(s, a, s) + βπ(s, a),
with weights α, β ≥ 0, encourages policies whose returns remain high after accounting for both model discrepancy and policy deviation from the dataset.
In practice, training uses a mixture of environment and synthetic transitions; thus one may interpret α as modulating how aggressively synthetic rollouts are corrected for model bias, and β as modulating how aggressively policy improvement is corrected for deviation from the behavior policy.
The concrete form of SAR used in a given algorithm depends on where the correction is applied (only on Dm, only on Denv, or on both), but the central structural feature is that the correction enters additively via p and π.

The practical appeal of SAR is that p and π can be estimated by binary classification.
For instance, to estimate p, we train a transition classifier to distinguish samples (s, a, s) coming from Denv (hence distributed according to p(⋅ ∣ s, a)) versus samples coming from Dm (distributed according to m(⋅ ∣ s, a)).
Under standard equal-prior (or known-prior) reductions, the optimal discriminator satisfies
$$ \logit\big(\mathbb P(\mathrm{env}\mid s,a,s')\big) \;=\; \log\frac{p(s'\mid s,a)}{m(s'\mid s,a)} \;+\; c_p, $$
where cp is a constant determined by class priors.
An analogous action discriminator on (s, a) yields π(s, a) up to an additive prior-dependent constant.

This reduction highlights the primary numerical and statistical fragility.
First, the map $u\mapsto \logit(u)=\log\frac{u}{1-u}$ is steep near 0 and 1, so a small absolute probability error can induce a large logit error.
Second, modern discriminators trained by cross-entropy may be accurate in ranking while miscalibrated in probability, especially under distribution shift between training and deployment; such miscalibration is precisely what SAR cannot tolerate, since logits enter additively into rewards and thus accumulate over horizon.
Third, limited data in tail regions (which are exactly the regions where offline RL is most vulnerable) encourages overconfident discriminator outputs close to 0 or 1, amplifying errors and potentially producing extreme shaped rewards.

A standard stabilizing device is to clip logits at a level L > 0,
clipL(x) := max (−L, min (L, x)),
which ensures that shaped rewards remain bounded and that Bellman backups do not propagate unbounded optimism.
However, clipping alone does not resolve the statistical issue: it caps worst-case magnitude but provides no explicit accounting of how uncertain a given logit estimate is on the distribution where the policy is being optimized.
This motivates a formal treatment of ratio estimation as an inference problem with explicit uncertainty quantification, so that any residual error in p and π is represented and controlled when constructing the learning signal.


3. Problem Formulation: Certified Ratio Estimation for SAR — define target log-ratios, training distributions, certificate guarantees, and how certificates are used to construct pessimistic rewards.

We now formalize the inference task underlying SAR: we wish to estimate the per-transition log-likelihood ratios
$$ \ell_p(s,a,s') \;:=\; \log\frac{p(s'\mid s,a)}{m(s'\mid s,a)}, \qquad \ell_\pi(s,a) \;:=\; \log\frac{\pi(a\mid s)}{\pi_b(a\mid s)}, $$
together with explicit, finite-sample uncertainty quantification that remains meaningful on the distribution where the resulting shaped reward is optimized.
Since neither p(⋅ ∣ s, a) nor πb(⋅ ∣ s) is available in likelihood form, both ratios must be inferred from samples, and any performance guarantee must therefore be expressed in terms of (i) where these ratios are certified and (ii) how certification error propagates through Bellman backups.

We regard training as drawing transitions from two sources.
With probability f ∈ (0, 1) we draw an environment transition (s, a, r, s) from the offline buffer Denv, and with probability 1 − f we draw a synthetic transition from Dm obtained by rolling out the current policy in the model m.
This induces a (policy-dependent) training mixture distribution over tuples, which we denote by dmix.
The key point is that will be stated with respect to a target distribution ν that is either dmix itself or a dominated distribution chosen to upper bound the deployment distribution of interest.
In later results we move from dmix to the true discounted occupancy dπ by a standard concentrability assumption; here we only record that such a change of measure is necessary.

For the transition ratio, we introduce a binary label Y ∈ {0, 1} indicating whether a triple x = (s, a, s) originates from Denv (Y = 1) or from Dm (Y = 0).
Let P and Q denote the (unknown) distributions of (s, a, s) under these two sources.
A classifier Cϕ(x) ≈ ℙ(Y = 1 ∣ x) induces a logit
$$ z_\phi(x)\;:=\;\logit(\mathbb{P}(Y=1\mid x)) \;=\; \log\frac{\mathbb{P}(Y=1\mid x)}{\mathbb{P}(Y=0\mid x)}. $$
By Bayes’ rule,
$$ z_\phi(x) \;=\; \log\frac{P(x)}{Q(x)} \;+\; \log\frac{\mathbb{P}(Y=1)}{\mathbb{P}(Y=0)}. $$
Under the usual reduction used in density-ratio estimation, the likelihood ratio P(x)/Q(x) corresponds to p(s ∣ s, a)/m(s ∣ s, a) up to the state–action marginal induced by the sampling scheme.
In particular, if the two buffers are constructed so that the class prior ℙ(Y = 1) is known (e.g., by balancing minibatches), then p differs from zϕ by a known additive constant cp accounting for priors and any mismatch in (s, a) marginals:
p(s, a, s) = zϕ(s, a, s) + cp,
with an analogous identity for the action ratio obtained from an action classifier Cψ(s, a) ≈ ℙ(“current policy” ∣ s, a):
π(s, a) = zψ(s, a) + cπ.
In what follows we treat cp, cπ as known constants (or, more generally, as bounded quantities absorbed into the clipping level); the substantive difficulty is to certify zϕ and zψ as functions of their inputs under ν.

Fix a confidence level δ ∈ (0, 1) and clipping level L > 0.
We seek estimators ℓ̂p, ℓ̂π and radii εp, επ (potentially input-dependent) such that, with probability at least 1 − δ over the randomness of the data splits and calibration procedure,

The distribution ν should be read as the distribution on which the shaped reward is during policy optimization; a canonical choice is ν = dmix.
We emphasize that is a certificate (coverage under ν), which is the natural object delivered by calibration and conformal prediction procedures; it is weaker than a uniform-in-x guarantee but is the appropriate notion once learning and evaluation are both distributional.

Operationally, we implement through calibrated classifier logits.
Given a calibrated probability interval for u(x) = ℙ(Y = 1 ∣ x) of the form $[\underline u(x),\overline u(x)]$, we obtain a logit interval $[\underline z(x),\overline z(x)]$ by monotonicity of $\logit(\cdot)$, then define the midpoint estimate and radius
$$ \widehat z(x) \;:=\; \mathrm{clip}_L\!\left(\frac{\underline z(x)+\overline z(x)}{2}\right), \qquad \varepsilon_z(x) \;:=\; \min\!\left\{L,\frac{\overline z(x)-\underline z(x)}{2}\right\}. $$
Finally, we set ℓ̂p(x) = ϕ(x) + cp and εp(x) = εzϕ(x) (and similarly for π), noting that additive constants do not affect radii.
Clipping is essential: it yields bounded shaped rewards and prevents vacuous radii when calibrated intervals approach probabilities near 0 or 1.

Given (ℓ̂p, εp) and (ℓ̂π, επ), we define a shaped reward by subtracting the certificate radius corresponding to the correction term applied on a given sample.
Concretely, for an environment transition (s, a, r, s) ∈ Denv, we apply policy-shift correction and its penalty,

whereas for a synthetic transition in Dm we apply model-bias correction and its penalty,

This choice matches the provenance of the shift: policy shift is relevant when leveraging Denv, and model bias is relevant when leveraging Dm.
If desired, one may apply both penalties uniformly on mixed batches to enforce pessimism regardless of source; our analysis accommodates either convention, as it only changes constants in the reward perturbation bound.

Two structural properties are immediate.
First, the shaped reward is uniformly bounded:
|C| ≤ max {|log rmin|,|log rmax|} + (α + β)L.
Second, at any (s, a, s) where the certificates hold, C is a lower bound on the corresponding ``ideal’’ SAR reward that would use the true ratios (after the same clipping convention).
Thus, the certificates enter the learning problem exactly as a controlled pessimism term, ensuring that any residual uncertainty in log-ratio estimation is paid for explicitly in the reward signal rather than implicitly through uncontrolled optimism.

The resulting problem is: choose π̂ ∈ Π by applying an off-policy RL method (e.g., SAC) to the surrogate MDP defined by the shaped reward C on the mixture replay stream, while selecting calibration procedures so that holds for ν = dmix (or a dominating distribution), and ensuring that the target policy occupancy is not too far from dmix (formalized later via concentrability).
Under these conditions, the value gap between π̂ and the optimal policy decomposes into intrinsic shift terms and an unavoidable term proportional to the average certificate radii, which is precisely the quantity our calibration stage is designed to estimate.


4. Calibration Methods for Log-Ratio Estimation: proper scoring rules, temperature scaling, isotonic regression, conformalized calibration; deriving high-probability logit intervals under class imbalance and covariate shift in the discriminator training distribution.

Our certificates are only as trustworthy as the probabilistic statements produced by the discriminators. Accordingly, we separate the construction into two layers: (i) training a that orders examples correctly and (ii) post-processing its scores into calibrated probabilities, augmented with finite-sample prediction intervals under a ν (typically dmix). We then transport these probability intervals through the $\logit(\cdot)$ map (and finally through clipping) to obtain the log-ratio intervals used by C-SAR.

Let gθ(x) ∈ ℝ denote the raw score of a discriminator on input x (either x = (s, a, s) for transitions or x = (s, a) for actions), and define the associated probability model σ(gθ(x)), where σ(t) = (1 + et)−1.
We train gθ by minimizing an empirical risk built from a strictly proper scoring rule; the canonical choice is logistic loss
$$ \widehat{\mathcal{L}}_{\mathrm{log}}(\theta) \;:=\; \frac{1}{n}\sum_{i=1}^n \Bigl(-y_i\log\sigma(g_\theta(x_i))-(1-y_i)\log(1-\sigma(g_\theta(x_i)))\Bigr), $$
but alternatives such as the Brier score are equally admissible.
Properness ensures that, in the realizable limit and under the training distribution, the Bayes-optimal predictor satisfies σ(gθ*(x)) = ℙ(Y = 1 ∣ x), so that the score gθ*(x) equals the desired logit up to an additive constant induced by class priors.
This separation is convenient: the scoring model may be optimized by any standard classification pipeline, while calibration (below) is responsible for turning scores into .

In practice the label prior πY := ℙ(Y = 1) in the discriminator training stream is seldom equal to the population prior under the target ν. To make the logit-to-ratio correspondence explicit, we either (a) enforce a known πY by balancing minibatches or (b) record the empirical π̂Y and correct for it.
Concretely, if a calibrated estimate (x) ≈ ℙ(Y = 1 ∣ x) is obtained πYtr, then the corresponding logit under a desired prior πYν is
$$ \logit(u^{\nu}(x)) \;=\; \logit(\tilde u(x)) \;+\; \log\frac{\pi_Y^{\nu}(1-\pi_Y^{\mathrm{tr}})}{(1-\pi_Y^{\nu})\pi_Y^{\mathrm{tr}}}, $$
provided the conditional class-likelihoods are unchanged.
For our purposes, this is precisely the additive offset that is absorbed into the constants relating logits to log-ratios; the essential point is that any for (x) transports to an interval for $\logit(u^{\nu}(x))$ with the same coverage after adding the corresponding constant.

We begin with a holdout split (or cross-fitting) so that calibration does not reuse samples employed to fit gθ.
Let s(x) := gθ(x) be the frozen score.
Temperature scaling fits a single scalar T > 0 by minimizing negative log-likelihood on the calibration set:
 ∈ arg minT > 0i ∈ ℐcal( − yilog σ(si/T) − (1 − yi)log (1 − σ(si/T))),   si := s(xi),
and outputs (x) = σ(s(x)/).
This preserves the score ordering and typically suffices when the classifier is well-specified but overconfident.
Isotonic regression instead learns a nondecreasing function  : ℝ → [0, 1] such that (x) = (s(x)) minimizes squared error on the calibration set.
Isotonic calibration is nonparametric and robust to misspecification, at the cost of requiring enough calibration data to avoid staircase artifacts in the tails (which are exactly the regions where logits can explode without clipping).

Point calibration alone does not yield a certificate.
We therefore require an interval $[\underline u(x),\overline u(x)]$ that is valid under the target ν.
A convenient route is , in which we treat the calibrated probability (x) as a base regressor for the label Y ∈ {0, 1} and conformalize its residuals on a calibration set.
One simple construction uses nonconformity scores
αi := |yi − (xi)|,   i ∈ ℐcal,
and sets 1 − δ to be the (1 − δ) empirical quantile of {αi}.
Then we may take
$$ \underline u(x)\;:=\;\max\{0,\tilde u(x)-\widehat{q}_{1-\delta}\}, \qquad \overline u(x)\;:=\;\min\{1,\tilde u(x)+\widehat{q}_{1-\delta}\}. $$
By standard split-conformal arguments, the interval contains Y with probability at least 1 − δ under exchangeability.
To convert this into an interval for the ℙ(Y = 1 ∣ x), we employ the binary-probability specific variants such as Venn–Abers predictors, which return (p0(x), p1(x)) forming an interval with calibration guarantees; we then set $[\underline u(x),\overline u(x)]=[\min\{p_0(x),p_1(x)\},\max\{p_0(x),p_1(x)\}]$.
The advantage is that the output is natively an interval in [0, 1], well-suited for monotone transport through $\logit(\cdot)$.

The discriminator is trained on a distribution determined by buffer construction and rollout policies, whereas certificates are required on ν.
If ν differs from the calibration distribution νcal, we require a shift assumption.
A standard choice is dominated covariate shift: there exists a known or estimable weight function $w(x)\propto \frac{d\nu}{d\nu_{\mathrm{cal}}}(x)$ with 0 ≤ w(x) ≤ wmax.
In this setting we may use calibration, replacing the empirical quantile by a weighted quantile computed with weights {w(xi)}i ∈ ℐcal.
Under the usual conditions for weighted conformal prediction, the resulting interval attains marginal coverage at level 1 − δ under ν.
When w is only approximately known, we propagate its estimation error into a slightly inflated δ (or, equivalently, into a conservative enlargement of the interval), which ultimately appears as a larger ε in the logit domain.

Given any valid probability interval $[\underline u(x),\overline u(x)]$, monotonicity yields the logit interval
$$ \underline z(x)\;:=\;\logit(\underline u(x)),\qquad \overline z(x)\;:=\;\logit(\overline u(x)), $$
and after incorporating the (known or bounded) prior-offset constant we obtain an interval for the desired log-ratio.
Near {0, 1}, the logit map is unbounded; thus we always apply clipping at level L to both the midpoint estimate and to the radius.
This step is not merely a technical convenience: it ensures the shaped rewards remain bounded uniformly and prevents the certification radii from becoming vacuous due to rare but extreme calibration outputs in the tails.
Under these constructions, the calibration stage delivers exactly the objects required by our later value analysis: distributional logit (hence log-ratio) intervals with explicit confidence, stable under the mixture sampling and robust to moderate prior mismatch and covariate shift.


5. The Certified SAR Algorithm (C-SAR): end-to-end pipeline; clipped-logit reward construction; integration into MBPO/SAMBO training with synthetic rollouts; design invariants and stability considerations.

We now assemble the preceding components into an end-to-end offline model-based RL pipeline. The algorithm maintains three coupled objects: a world model m, a policy–critic pair (e.g., SAC) optimized on a shaped reward, and two discriminators whose calibrated logits provide log-ratio estimates. At a high level, we alternate between generating synthetic experience under m, updating discriminators to measure the discrepancy between real and synthetic transitions and between π and the (unknown) behavior policy, calibrating these discrepancies into logit intervals, and performing actor–critic updates using a clipped-logit, certificate-penalized reward.

C-SAR operates with two replay buffers. The first is the fixed offline dataset Denv = {(s, a, r, s)} sampled from the environment under πb. The second is a growing synthetic buffer Dm obtained by rolling out the current policy π in the learned model m starting from states sampled from Denv. We train the critic and actor on a mixture distribution dmix induced by sampling transitions from fDenv + (1 − f)Dm for some user-chosen mixing fraction f ∈ (0, 1]. The parameter f has a dual role: it controls the degree of extrapolation (smaller f uses more synthetic rollouts) and determines the reference distribution on which the certificates are required to hold.

Given a minibatch of starting states {s0} from Denv, we simulate short-horizon rollouts of length h in m under the current policy π, producing tuples (st, at, t, st + 1) where at ∼ π(⋅ ∣ st) and st + 1 ∼ m(⋅ ∣ st, at). In the simplest instantiation we set t = r(st, at) if rewards are modeled as part of m, or we reuse the logged reward model otherwise; our analysis only requires that the shaped reward used for optimization is bounded, which we ensure by operating on log r and clipping the ratio terms. Short rollouts are not incidental: they restrict compounding model bias and stabilize both discriminator training (by limiting the support drift of Dm) and actor–critic updates.

We train two binary classifiers on labeled membership data.

We emphasize that neither p nor πb need be evaluable; only membership labels are used. The discriminators may be updated more frequently than the actor–critic, but in practice it is often beneficial to decouple timescales: a small number of discriminator steps per epoch suffices once the classifiers track the slowly changing policy distribution.

At regular intervals we calibrate each discriminator (with a holdout split or cross-fitting) to obtain a probability interval $[\underline u(x),\overline u(x)]$ for the relevant conditional probability under the target distribution ν (typically ν = dmix). We then transport this interval through the logit map and incorporate the prior-offset constant, producing a logit interval $[\underline z(x),\overline z(x)]$ for the desired log-ratio. From this interval we define
$$ \widehat \ell(x)\;:=\;\mathrm{clip}_L\!\left(\frac{\underline z(x)+\overline z(x)}{2}\right), \qquad \varepsilon(x)\;:=\;\min\!\left\{L,\frac{\overline z(x)-\underline z(x)}{2}\right\}, $$
where clipL(t) = max {−L, min {L, t}}. The clipping level L is a design parameter that simultaneously bounds the shaped reward and prevents rare calibration failures in the tails from injecting unbounded signals into Bellman backups.

The core of C-SAR is a shaped reward that inserts the estimated log-ratio terms and subtracts their certificate radii to enforce pessimism. We define the per-transition certified reward C differently depending on whether the transition comes from Denv or Dm:
C(s, a, r, s) = log r + 1{(s, a, r, s) ∈ Denv} (βℓ̂π(s, a) − βεπ(s, a)) + 1{(s, a, r, s) ∈ Dm} (αℓ̂p(s, a, s) − αεp(s, a, s)).
The weights α, β ≥ 0 tune the relative strength of model-bias correction and policy-shift correction. Optionally, one may subtract both penalty terms on mixed batches (i.e., always include αεp − βεπ) to obtain a uniformly pessimistic surrogate irrespective of the data source; this can simplify downstream analysis at the cost of additional conservatism. The use of log r rather than r is compatible with our standing assumption r ∈ [rmin, rmax] and yields an additive reward-shaping form naturally aligned with log-ratios.

We run a standard off-policy algorithm (e.g., SAC) on the replay mixture, replacing the reward in the Bellman target by C. Concretely, if Qω denotes the critic and πθ the actor, we compute targets using
y = C + γ 𝔼a ∼ πθ(⋅ ∣ s)[Qω̄(s, a) − τlog πθ(a ∣ s)],
with the usual target network Qω̄ and temperature τ, and apply SGD updates on squared TD error. The only modification relative to the base algorithm is the certified reward, which is computed on-the-fly from discriminator outputs and calibration-derived intervals.

Two invariants are enforced by construction. First, boundedness: for all transitions,
|C| ≤ log rmax + (α + β)L,
ensuring that value targets remain uniformly bounded and that contraction-based arguments apply with an effective horizon scaling as H ≍ (1 − γ)−1. Second, pessimism at certified points: whenever |ℓ̂ − | ≤ ε holds (as guaranteed with high probability under the target distribution), we have
αℓ̂p − αεp ≤ αp,   βℓ̂π − βεπ ≤ βπ,
so C lower-bounds the corresponding ideal shaped reward termwise, up to clipping effects. In implementation, we additionally recommend (i) balancing discriminator minibatches to fix class priors (thereby making the logit-to-ratio constant explicit), (ii) limiting rollout horizon h to keep Dm within the region where both m and the certificates are informative, and (iii) cross-fitting calibration to avoid reusing data for score fitting and interval construction. These choices do not alter the formal definitions above, but they materially improve numerical stability and prevent the certificates from becoming trivially large due to uncontrolled distribution drift.


6. Main Theory I — From Calibration to Certified Log-Ratio Error: lemmas converting probability calibration to bounded logit error and to bounded log density-ratio error on the training mixture.

In this section we formalize the passage from calibrated classification uncertainty to certified errors on the log-ratio terms used by SAR. The logical structure is modular: (i) a calibration procedure produces, for each input, a prediction interval for a conditional class probability under a designated target distribution, and (ii) the standard density-ratio-by-classification identity converts that probability (equivalently its logit) into a log density ratio up to an additive constant determined by class priors. Combining the two yields pointwise log-ratio certificates on the training mixture distribution, which will be the sole interface to the value analysis in Section~7.

Let ν denote the distribution on which we require valid uncertainty statements. In our use case ν is the replay-mixture induced by sampling transitions from fDenv + (1 − f)Dm, i.e. ν = dmix for transition inputs x = (s, a, s) and similarly for action inputs x = (s, a). A (possibly cross-fitted) calibration method takes a trained classifier C(x) ∈ (0, 1) and returns an interval-valued predictor
$$ x \longmapsto [\underline u(x),\overline u(x)] \subset (0,1) $$
such that, with probability at least 1 − δ over the calibration randomness and sample draw (in the sense appropriate to the chosen calibration tool), we have

We do not fix a particular calibration technique; conformal prediction, split conformal, or other distribution-free methods are admissible provided they deliver on ν.

The first step is purely analytic: the logit map $\logit(t)=\log\frac{t}{1-t}$ is monotone, hence it transports probability intervals to logit intervals. The difficulty is that $\logit(\cdot)$ is unbounded near 0 and 1, so we incorporate clipping to obtain bounded shaped rewards and bounded error radii.



By monotonicity of $\logit$, implies $z(x)\in[\underline z(x),\overline z(x)]$ under the same event. Therefore $|z(x)-\tfrac{1}{2}(\underline z+\overline z)|\le \tfrac{1}{2}(\overline z-\underline z)$. Clipping can only reduce distances to [−L, L] and we cap the radius by L, yielding the stated bound.

We next recall the standard identity linking the optimal class probability to a density ratio. Let P and Q be two distributions on a common space 𝒳 with densities p and q (with respect to a dominating measure). Consider a binary label Y ∈ {1, 0} with class priors ρ := ℙ(Y = 1) and 1 − ρ = ℙ(Y = 0), and conditional X ∣ (Y = 1) ∼ P, X ∣ (Y = 0) ∼ Q. Then the Bayes conditional is
$$ u(x)=\mathbb{P}(Y=1\mid X=x)=\frac{\rho p(x)}{\rho p(x)+(1-\rho)q(x)}. $$
Algebra yields

Thus, up to the additive constant $c(\rho):=\log\frac{1-\rho}{\rho}$ determined by the class balance used in discriminator training, the desired log density ratio is the logit of the true conditional class probability.

For the transition discriminator, x = (s, a, s) and the two class-conditionals are the real and synthetic transition sources restricted to the training mixture support. Under the idealized picture in which (s, a) is drawn from a common marginal and only the conditional next-state differs, reduces pointwise to $\ell_p(s,a,s')=\log\frac{p(s'\mid s,a)}{m(s'\mid s,a)}$ up to the prior constant. For the action discriminator, x = (s, a) and we analogously obtain $\ell_\pi(s,a)=\log\frac{\pi(a\mid s)}{\pi_b(a\mid s)}$ up to the corresponding constant. In both cases we either (i) enforce balanced minibatches so that $\rho=\tfrac{1}{2}$ and c(ρ) = 0, or (ii) record the sampling ratio used to train the discriminator and correct by the known c(ρ).



Combine with Lemma~ and note that adding the constant c(ρ) commutes with the midpoint construction; clipping is handled identically to Lemma~.

Applying Lemma~ to the transition discriminator with target distribution ν = dmix on x = (s, a, s) yields, at level 1 − δp,
|clipL(p(s, a, s)) − clipL(ℓ̂p(s, a, s))| ≤ εp(s, a, s).
Applying the same lemma to the action discriminator on x = (s, a) yields, at level 1 − δπ,
|clipL(π(s, a)) − clipL(ℓ̂π(s, a))| ≤ επ(s, a).
A union bound gives simultaneous validity at level at least 1 − (δp + δπ) on dmix, and cross-fitting ensures that the events above hold for the distributions induced by the current epoch without reusing the same samples for both fitting and interval construction. These are precisely the certificates required to justify the pessimistic reward adjustment in C-SAR: on the (high-probability) event of validity, subtracting εp and επ produces termwise lower bounds on the corresponding unclipped log-ratio contributions, and clipping ensures the entire shaped reward remains bounded. In Section~7 we treat (ℓ̂p, εp) and (ℓ̂π, επ) as primitive certified inputs and propagate their effect through Bellman operators to obtain explicit value guarantees.


7. Main Theory II — From Certified Log-Ratio Error to Value Guarantees: pessimistic value iteration viewpoint; Bellman error bounds; suboptimality theorem with explicit dependence on (εp, επ), horizon, and mismatch divergences.

In this section we take the certified log-ratio inputs produced in Section~ and propagate their effect through Bellman operators to obtain explicit performance guarantees. The guiding viewpoint is that C-SAR induces a surrogate control problem: we optimize a shaped reward which (i) corrects for model bias and policy shift via log-ratio terms, and (ii) is made by subtracting certificate radii. The analysis therefore has two separable components: a mismatch term reflecting intrinsic distribution shift (which would persist even with exact ratios), and a certificate term reflecting finite-sample uncertainty in the ratios.

Let * denote the (unimplementable) shift-aware shaped reward which uses the true log-ratios p, π (and the same clipping level L as in the algorithm). Let C denote the shaped reward used by C-SAR, i.e. the same functional form but with (p, π) replaced by (ℓ̂p, ℓ̂π) and with the certificate radii subtracted (as in Algorithm C-SAR). Since the training batches are drawn from the replay mixture dmix, it is convenient to write both rewards as functions of a generic transition input x sampled from dmix; the precise dependence on whether x originated from Denv or Dm is immaterial to the algebra below, except through which ratio term is active.

On the high-probability event that the certificates hold on dmix, we obtain a pointwise sandwich: the certified reward is pessimistic with a controlled gap.


The first inequality is immediate from the construction (subtracting radii), while the second inequality uses the certificate bounds and the triangle inequality. We emphasize that is and does not require any Bellman-style argument.

Let π, * and π, C denote the discounted values of a fixed policy π under the same transition dynamics used for training (i.e. the kernel implicit in dmix), but with rewards * and C, respectively. Since * and C are bounded by clipping (and log r is bounded by r ∈ [rmin, rmax]), both value functions are well-defined and satisfy standard contraction properties.

A direct consequence of Lemma~ is that the value loss due solely to certification is at most horizon-linear in the radii.


The proof is a one-line application of along trajectories and the geometric series t ≥ 0γt = (1 − γ)−1; no further structure is needed.

We now connect the surrogate (C-SAR) control problem to the true environment objective. This step is necessarily assumption-dependent: no offline method can control the environment value without some overlap between the target policy visitation and the training distribution. We therefore assume a standard concentrability condition: there exists κ ≥ 1 such that for any candidate policy π under consideration,

In addition, we assume that the SAR construction yields a controlled mismatch between the surrogate backups and the true environment backups. We keep this component abstract and quantify it by nonnegative terms Δmodel and Δpolicy (depending on the intrinsic discrepancy between p and m and between π and πb), scaled by (α, β). Under these premises, the only additional degradation introduced by certification is the horizon-linear certificate term from Lemma~, transferred from dmix to dπ using .

Finally, we must account for the fact that we do not run exact value iteration: the algorithm updates an actor–critic (e.g. SAC) on samples. We represent this by an optimization/approximation residual opt_err, meaning that the output policy π̂ is near-optimal for the certified surrogate objective up to opt_err.


The proof is a standard contraction-based argument: we compare (a) the optimal value in the environment, (b) the optimal value under the ideal SAR reward (incurring the intrinsic mismatch terms), and (c) the optimal value under the certified reward (incurring the certificate terms via Lemma~); we then include opt_err to reflect approximate solution of the certified surrogate control problem. The role of κ is only to move expectations from the training mixture (where certificates are valid) to the occupancy of the comparator policy; without such a transfer inequality, the bound is necessarily vacuous.

Theorem~ isolates the precise price of certification: even if the intrinsic shift terms vanish (perfect model and no policy shift), we cannot beat an O(H(αε̄p + βε̄π)) degradation when we insist on high-probability correctness of the log-ratio terms. In Section~8 we show that this dependence is not an artifact of the analysis but is information-theoretically unavoidable.


8. Lower Bounds / Impossibility: show unavoidable dependence on log-ratio uncertainty; connect to off-policy evaluation lower bounds and coverage limitations.

We now justify, in an information-theoretic sense, why the horizon-linear dependence on certified log-ratio uncertainty that appears in Theorem~ is not merely a proof artifact. The conclusion is twofold. First, even if we grant perfect function approximation and exact optimization of the surrogate objective, any method that relies on estimated log-ratios must pay a price proportional to the uncertainty in those ratios. Second, no such statement can avoid an overlap condition: without support coverage, offline method (ratio-based or otherwise) can provide non-vacuous guarantees, and this impossibility composes with our certificate-driven one.

We consider algorithms that receive only samples from Denv and Dm and may post-process them arbitrarily (including training m, training discriminators, calibrating them, and performing any actor–critic updates). Fix any such algorithm Alg which outputs a policy π̂ together with any shaped reward or pessimism mechanism that depends on the data only through these samples. The lower bounds we state are of the following form: we construct two instances 0, ℐ1 such that

The indistinguishability forces Alg to behave similarly on both instances with nontrivial probability, while the difference in the correct objective forces it to be suboptimal in at least one of them. This is the standard ``two-point method’’ (Le Cam), specialized to the particular nuisance parameters that C-SAR tries to estimate, namely p and π.

We sketch the core construction behind Theorem~4. For simplicity, consider a finite-horizon H MDP (or discounted with H ≍ (1 − γ)−1) with a chain structure in which the agent repeatedly encounters a state st and must choose between two actions a ∈ {0, 1}. Action 0 is safe'' and yields deterministic next state and moderate reward. Action $1$ isrisky’’ and transitions to a high-reward absorbing state if a particular transition probability is large, and to a low-reward absorbing state if it is small. We choose the two instances 0, ℐ1 so that they agree on everything except the transition kernel of the risky action on a small region; in particular,
p0(⋅ ∣ s, a = 1) ≠ p1(⋅ ∣ s, a = 1),   p0(⋅ ∣ s, a = 0) = p1(⋅ ∣ s, a = 0),
and the difference is calibrated so that the induced likelihood ratio between p and the learned model m differs by an additive log amount εp on the risky transition: p, 1 − p, 0 ≈ 2εp (after clipping). We then arrange Denv and Dm so that the samples contain too little information to resolve which of 0, ℐ1 holds on the risky region (e.g., by making visits to (s, a = 1) sufficiently rare under πb, and by ensuring m produces similar synthetic transitions there). Formally, one ensures that the total variation (or KL) between the induced data distributions satisfies TV(ℙ0, ℙ1) ≤ c < 1, so that any test has error bounded away from 0.

Under 0, the correct shift-aware correction would justify selecting the risky action; under 1, it would not. Since the per-step shaped reward discrepancy is Θ(αεp) on the risky branch, the value discrepancy between the two instances under the two competing policies accumulates over horizon, yielding a gap of order Θ(Hαεp). Le Cam’s inequality then implies that for any Alg there exists i ∈ {0, 1} such that, with constant probability under i, the returned policy is suboptimal by at least Ω(Hαεp) in environment value. An analogous argument applies to the policy-shift term π by constructing two behavior policies πb, 0, πb, 1 that induce indistinguishable (s, a) marginals on the observed data but different true action log-ratios π on the relevant region, yielding Ω(Hβεπ).

The substantive point is that this lower bound is : if the data (plus calibration) can only certify |ℓ̂p − p| ≤ εp on the relevant region, then no downstream control algorithm can guarantee value loss o(Hαεp) uniformly over compatible instances, because the ambiguity in p is itself compatible with multiple environments that imply different optimal decisions.

Clipping and pessimism are necessary for stability and valid high-probability control of errors, but they do not create information. Clipping merely bounds the influence of regions where the ratio is extreme or poorly estimated; pessimism (subtracting radii) protects against over-optimistic errors. The lower bound above precisely matches this logic: if the algorithm chooses to be pessimistic by an amount comparable to ε, it will avoid catastrophic errors but necessarily sacrifices Θ(ε) shaped reward on the ambiguous region, and therefore Θ(Hε) value in the worst case. Conversely, if it refuses to be pessimistic, it must be wrong on at least one indistinguishable instance.

Independently of ratio uncertainty, we recall the classical offline RL barrier: without overlap, policy improvement is impossible in general. Concretely, one may embed a two-armed bandit into the first step of an MDP, with action a = 1 unobserved (or extremely rare) in Denv. Two environments that differ only in the reward of a = 1 then induce identical offline data with nontrivial probability, forcing any algorithm to have large regret on at least one of them. This establishes that some form of concentrability, such as , is not merely technical but logically necessary for non-vacuous guarantees.

Our contribution in the present work is orthogonal: even overlap, the use of model rollouts and policy improvement introduces additional shift terms that must be corrected, and the correction cannot be more accurate than the certified uncertainty in the implied log-ratios. Thus, the final guarantee must contain (i) an overlap-dependent transfer factor κ and (ii) an uncertainty-dependent term of order H(αεp + βεπ), up to logarithmic factors. Theorem~4 shows that, once the certificates are fixed, this latter dependence is unavoidable.


9. Practical Considerations and Experimental Plan: ablations (no calibration, calibration without clipping, clipping without calibration), robustness under class imbalance, rollout-induced drift, and implicit model mismatch; benchmarks (D4RL/NeoRL + controlled synthetic environments) and metrics (ECE, logit CI coverage, performance).

We outline implementation choices that make the certified shaping terms operational in modern offline RL pipelines, and we propose an experimental plan designed to isolate the roles of calibration, clipping, and pessimism under realistic distribution shift.

Both discriminators are trained on membership labels derived from buffers whose class proportions may vary over time. Concretely, the transition discriminator Cϕ sees positives from Denv and negatives from Dm, while the action discriminator Cψ sees positives from on-policy samples Dπ and negatives from Denv. Since the log-ratio identity includes an additive constant depending on class priors, we recommend either (i) explicit class balancing in each minibatch to enforce equal priors, or (ii) explicit prior correction by adding $c_p=\log\frac{\pi_{\mathrm{neg}}}{\pi_{\mathrm{pos}}}$ (and analogously for cπ) to the calibrated logits. The second option becomes necessary when |Dm| grows much larger than |Denv| or when we use non-uniform sampling from replay. In both cases we log the effective sampling priors used by the discriminator to ensure the constant offsets are not silently drifting across epochs.

We split data for each discriminator into (a) a training subset for fitting raw logits, and (b) a calibration subset (held-out) for constructing prediction intervals for ℙ(Y = 1 ∣ x). We will compare two calibration families: (i) temperature scaling (or Platt-style scaling) producing a calibrated point probability (x), augmented with bootstrap-based intervals; and (ii) conformal calibration producing finite-sample marginal coverage intervals $[\underline p(x),\overline p(x)]$ under a specified target distribution (taken to be the replay sampling distribution for the discriminator). In either case, we report the empirical coverage of the induced intervals on a second holdout set (not used in either training or calibration), since our certificates are applied after the $\logit(\cdot)$ transform and clipping. We emphasize that calibration is assessed on the same distribution on which the shaped reward is optimized; if we change the mixture ratio f or the rollout horizon h, we recalibrate to avoid invalidating the intended coverage level.

The clipping parameter L serves two roles: it bounds shaped rewards and prevents extreme logits (near 0 or 1 probabilities) from dominating learning. In practice, saturation is common precisely on the transitions of greatest interest (where model rollouts deviate from the environment), so clipping must be treated as a primary design parameter rather than an afterthought. We will sweep L on a logarithmic grid and report not only final return but also the fraction of samples whose logits are clipped and the resulting average certificate radius after clipping, $\mathbb{E}[\min\{L,(\overline z-\underline z)/2\}]$. This makes explicit the tradeoff between expressivity of the shaping term and the stability of pessimistic backups.

We will include targeted ablations that correspond to removing individual logical components.

On controlled synthetic environments (described below) where true ratios are computable, we will add an ablation that uses the true clipped log-ratios. This separates estimation error from the effect of the SAR functional form.

Class imbalance affects both discrimination and calibration: a discriminator can achieve high accuracy while producing poorly calibrated probabilities when the base rate is extreme. We will systematically vary the effective class prior in training by changing the sampling ratio of Denv to Dm (and of Denv to Dπ), and we will compare (i) explicit rebalancing, (ii) prior correction constants, and (iii) importance-weighted calibration procedures. The primary outcome is whether the reported logit intervals maintain coverage under the replay distribution used by the actor–critic update. Secondary outcomes include stability of learning (variance across seeds) and the magnitude of the pessimism penalties induced by widened intervals under severe imbalance.

Because Dm is generated by rolling out the evolving policy π in the learned model m, the negative class for Cϕ is nonstationary, and similarly the positive class for Cψ evolves with π. We will therefore treat discriminator retraining and recalibration as part of the control loop: every K policy updates we refresh Dm (possibly with a sliding window) and refit/calibrate (Cϕ, Cψ). We will vary K and the rollout horizon h to quantify how drift degrades calibration, measured by holdout ECE and by interval under-coverage. We will also test whether shorter rollouts (smaller h) reduce drift sufficiently to yield tighter certificates, trading off synthetic data diversity against certificate tightness.

When m is misspecified, synthetic transitions can fall outside regions where either discriminator generalizes, producing near-deterministic predictions with wide or unreliable intervals. Our plan is to treat the certificates as a diagnostic: large εp concentrated on specific state–action regions indicates model mismatch (or insufficient data) that cannot be corrected by downstream optimization. We will report spatial statistics of εp (e.g., by binning in state features or using learned embeddings) and correlate them with empirical model error metrics (one-step prediction error and multi-step rollout discrepancy). Operationally, we will also evaluate a conservative gating heuristic that drops or downweights synthetic transitions with εp above a threshold, to test whether certificates can be used not only for shaping but also for .

We will evaluate on standard offline RL suites (D4RL locomotion and navigation tasks, including regimes known to stress extrapolation) and NeoRL tasks that include stochasticity and structured dataset shifts. In parallel, we will include controlled synthetic environments: (i) tabular chain MDPs and gridworlds where p is known and m can be perturbed to induce a tunable p gap; and (ii) synthetic behavior-policy shifts where πb is known, allowing direct computation of π. These controlled settings enable direct measurement of certificate validity and tightness against ground truth.

We will report: (a) policy performance (normalized return and worst-seed return); (b) calibration metrics for both discriminators (ECE, reliability curves, and AUC as a non-calibration baseline); (c) empirical coverage of the logit intervals and average radius ε̄p, ε̄π; and (d) the fraction of clipped logits and the induced average pessimism penalty in shaped reward units. This set of metrics is designed to link performance changes to specific failure modes (miscalibration, saturation, drift, or mismatch) rather than attributing them generically to ``model bias.’’


10. Discussion and Limitations: what the guarantees do not cover (POMDPs, implicit models without overlap), when certificates become vacuous, and how to extend to likelihood-free/latent world models.

Our guarantees are deliberately instance-conditional: they certify performance only to the extent that (i) the shaped reward is a valid pessimistic surrogate on the distribution actually used for learning, and (ii) the target policy does not place substantial mass outside that distribution. This section records what is covered by the present analysis, when the certificates become vacuous, and which extensions appear technically plausible.

The calibration statements underpinning εp and επ are formulated under a specified target distribution ν (in our implementation, the replay sampling distribution, i.e., dmix). Consequently, even if the calibration procedure yields finite-sample marginal coverage under ν, it does not imply that |ℓ̂p − p| or |ℓ̂π − π| are controlled uniformly over all (s, a, s). The performance bound therefore necessarily depends on a concentrability factor κ relating dπ to dmix. When κ is large (or infinite), the bound can be loose (or void) regardless of how tight the certificates are on dmix. In particular, no discriminator-based ratio correction can manufacture support where none exists: if dmix(s, a) = 0 but dπ(s, a) > 0, then neither calibration nor clipping can prevent extrapolation error from dominating.

Our development is stated for an MDP over states s ∈ 𝒮 with transition kernel p(s ∣ s, a). In partially observed settings, one observes ot ∼ 𝒪(⋅ ∣ st) and acts based on histories ht = (o0 : t, a0 : t − 1) or a learned belief/latent state. If we na"ively apply C-SAR with s replaced by o, then the transition discriminator estimates a ratio between transition mixtures rather than the underlying state transitions, and the implied shaping term need not correspond to a valid correction of model bias in the latent dynamics. More subtly, even if we use recurrent policies and critics, the relevant ``state’’ becomes ht (or a sufficient statistic thereof), and the overlap assumption must hold in this enlarged space. Establishing an analogue of Thm.~3 in POMDPs thus requires (i) a precise choice of information state, (ii) calibration and certificates for ratios defined on that information state, and (iii) an analysis of approximation error when the learned representation is not sufficient. None of these steps is automatic, and we do not claim MDP-style guarantees under arbitrary state aliasing.

A common modern choice is an implicit or simulator-style dynamics model m (e.g., a diffusion model for next states) for which m(s ∣ s, a) is samplable but not tractably evaluable. Our approach already treats m as a black box for sampling and never requires likelihood evaluation; however, the of $\ell_p=\log\frac{p}{m}$ remains that of a Radon–Nikodym derivative, hence it exists only when p(⋅ ∣ s, a) ≪ m(⋅ ∣ s, a) (or vice versa, depending on the direction). When p and m have near-disjoint support for some (s, a), the true log-ratio is unbounded and any bounded surrogate necessarily incurs irreducible error. Clipping at level L makes optimization stable, but it also means we are optimizing a clipped objective whose relationship to the unclipped correction saturates in precisely the hard regions. Thus, in regimes of severe model misspecification or support mismatch, certificates do not ``fix’’ the problem; they merely quantify (often pessimistically) that the correction is unreliable.

There are several concrete failure modes in which the shaped reward becomes overly pessimistic or uninformative.
First, calibrated probability intervals that approach [0, 1] yield logit intervals of essentially unbounded width; after clipping, this manifests as εp ≈ L or επ ≈ L, so the penalty term αεp + βεπ can dominate the signal log r. This occurs under severe class imbalance, nonstationarity of discriminator inputs, insufficient calibration data, or simply when the classification task is intrinsically hard on the replay distribution. Second, even with tight per-sample intervals, the overall bound in Thm.~3 scales with κH; for long effective horizons H ≍ (1 − γ)−1, moderate per-step uncertainty compounds additively. Third, our use of log r presumes r(s, a) > 0. While one may shift rewards by a constant to enforce positivity, such transformations alter the effective objective and can interact nontrivially with entropy regularization and function approximation. In short, when pessimism penalties are large or horizon/coverage constants are unfavorable, the certificate-induced bound correctly indicates that meaningful guarantees are unattainable from the available data/model pair.

Our analysis isolates uncertainty arising from estimating p and π, but it does not provide a tight decomposition of (i) approximation error from the critic and policy classes, (ii) optimization error from finite SGD, and (iii) instability induced by bootstrapping. These effects are subsumed into an aggregate term (cf. opt_err in Thm.~3) and can dominate in practice. Moreover, the discriminators are trained on data that are themselves generated by the evolving policy via Dm and Dπ; although recalibration mitigates drift empirically, our theorems do not model this feedback loop as an adaptive data-collection process with time-uniform guarantees.

A natural extension is to perform C-SAR in a learned latent space z = fθ(s) (or z = fθ(o) in POMDPs), using a latent dynamics model mϑ(z ∣ z, a) and discriminators defined on (z, a, z). Technically, one must address two issues. First, the density-ratio identity applies to the distributions induced by the encoder; thus, the relevant correction becomes
$$ \ell_{p,z}(z,a,z')=\log\frac{p_z(z'\mid z,a)}{m_\vartheta(z'\mid z,a)}, $$
which coincides with the desired correction only if z is (approximately) sufficient for control and the encoder is stable across the environment/model distributions. Second, calibration must be performed under the replay distribution in latent space, which may shift as the encoder is updated; this suggests either freezing the encoder during calibration epochs or calibrating conditionally on the encoder parameters (a substantially harder problem). On the likelihood-free side, one can replace explicit logits with simulation-based inference objectives (e.g., classifier-based mutual information estimators or noise-contrastive estimation) and then apply conformal calibration to the resulting scores. The conceptual requirement remains unchanged: we need finite-sample intervals for a monotone transform of the true likelihood ratio under the distribution on which the actor–critic trains.

We view C-SAR as a method for making a common heuristic—reward shaping by model-bias and policy-shift penalties—: whenever the discriminators are uncertain, the algorithm is forced to be pessimistic by an explicit, measurable amount. The corresponding limitation is equally explicit: in low-coverage or high-shift regimes, the only valid certificate may be a large one, and the resulting performance guarantee can be unavoidably weak. Extending these ideas beyond fully observed MDPs and toward latent, implicit, and partially observed models appears feasible, but will require careful redefinition of the ratio objects, as well as calibration procedures robust to representation drift.