← Back

Anchors are Necessary: Tight Lower/Upper Bounds for Episodic Sensor Bias in Offline RL

Metadata

Table of Contents

  1. 1. Introduction: episodic non-stationary observation shifts, identifiability gap in offline RL deployment; why FORL9s success still leaves a fundamental question (when is correction possible at all?).
  2. 2. Problem Setup and Symmetry: formalize episodic offset POMDPs, no-offset-reveal regime; define symmetry groups of MDPs and induced equivalence classes of latent states under unknown offsets.
  3. 3. Impossibility Results (No Anchors): construct symmetric MDP families where observation histories cannot distinguish distinct states/offsets; prove constant minimax estimation error and non-vanishing control loss.
  4. 4. Anchor Models: define three minimal anchoring mechanisms (absolute coordinate, offset probe, trusted landmark channel) and formal conditions under which anchors break the symmetry group.
  5. 5. Anchor Filter Wrapper (Upper Bounds): present a simple belief/elimination algorithm over symmetry hypotheses; prove (1/p) expected localization time and corresponding regret bounds; discuss continuous vs finite state.
  6. 6. Matching Lower Bounds: prove (1/p) localization lower bound for any algorithm under each anchor model via information arguments / optional stopping / indistinguishability until first informative anchor.
  7. 7. Extensions: noisy anchors, affine (diagonal) observation transforms, partial observability beyond offsets, and dependence on policy excitation.
  8. 8. (Optional) Experiments: anchor-augmented FORL on long-horizon rollouts; verify scaling with p, demonstrate stabilization and tail-risk reduction vs no-anchor baselines.
  9. 9. Related Work: FORL, robust offline RL, POMDP identifiability, SLAM/loop-closure analogies, non-stationary RL drivers.
  10. 10. Discussion and Design Guidance: practical recommendationshow many anchors, what kind, where to place them; limitations and open problems.

Content

1. Introduction: episodic non-stationary observation shifts, identifiability gap in offline RL deployment; why FORL9s success still leaves a fundamental question (when is correction possible at all?).

We consider the deployment phase of an offline-learned decision rule in an environment whose latent Markovian dynamics are unchanged, yet whose observation stream is subject to episodic, non-stationary shifts. Concretely, a policy is trained (or otherwise specified) in a reference coordinate system in which the agent would observe the true state (or a known function thereof). At deployment, however, each episode may begin with an unobserved change of reference frame—for instance, a drift in localization, a reset of a global coordinate origin, a camera calibration shift, or a map alignment error—and this shift remains constant for the duration of the episode. The agent thus faces a partially observed process even when the underlying environment is fully observed in principle: the time series of observations is consistent with multiple latent trajectories differing by an unknown episodic transformation. Our purpose is to isolate when such shifts can be corrected online, and when they cannot be corrected by any algorithm, regardless of computational power.

This question arises in a particular form in offline reinforcement learning and policy deployment. An offline policy is typically trained on a fixed dataset collected under some observation convention (e.g., a particular global frame for positions, or a particular map alignment). At deployment we often disallow further learning, either because new data cannot be stored, or because the policy is certified and must remain fixed. In this setting, the only admissible modifications are wrappers: online procedures that transform the incoming observations into a form suitable for the fixed policy, or that maintain a belief over latent states and feed state estimates to the policy. Such wrappers are attractive because they separate representation or estimation from control: one may hope to correct sensor or coordinate shifts at test time without re-training the policy. The empirical success of several ``foundation-model’’-style approaches for online refinement and filtering (which we use as an informal point of reference) suggests that strong priors over trajectories can indeed repair certain observation corruptions. Yet this practical success does not settle the underlying identifiability question: even with perfect priors and unlimited computation, when does the observation stream contain enough information to determine the latent state (or determine it well enough to act optimally)?

To articulate the difficulty, consider the simplest episodic shift model in which observations are additive translations of the latent state:
ot = st + bj,
where j indexes the episode and the offset bj ∈ ℝd is fixed within an episode but may vary arbitrarily across episodes. Even in this elementary model, the agent’s interaction history (o0 : t, a0 : t − 1) is consistent with multiple candidate sequences (s0 : t, bj), because the same observation path may be generated by shifting all latent states and compensating by an opposing shift in bj. A natural hope is that the dynamics and rewards break this ambiguity: if the environment has a unique ``origin’’ or a uniquely labeled landmark, then the trajectory might reveal which shift is correct. Conversely, if the environment is symmetric—for instance a torus navigation task, a translation-invariant gridworld, or a maze composed of repeated motifs—then the ambiguity might persist indefinitely. The core question is thus not how to design a sophisticated estimator, but whether the model class admits identifiability in principle.

The tension becomes sharper once we impose the no-offset-reveal'' regime: we do not allow $b_j$ to be observed at any time, not even after an episode terminates, and we do not allow cross-episode supervision that would label offsets retroactively. In such a regime, it is tempting to treat the problem as mere state estimation in a POMDP and to invoke standard filtering intuition. However, classical POMDP filtering relies on the observation map being known up to stochastic noise, whereas here the observation map itself depends on an unknown episode-level nuisance parameter. This nuisance parameter is not simply a latent state component with Markovian evolution; it is constant within an episode and resets adversarially (or at least arbitrarily) across episodes. Consequently, one cannot generallyaverage it out’’ over time, nor treat it as i.i.d. noise. One may instead attempt to exploit temporal differences
Δot = ot − ot − 1 = (st + bj) − (st − 1 + bj) = st − st − 1,
which indeed cancel the offset. But cancellation of the offset in differences does not by itself recover absolute position: it yields only relative motion, and relative motion may be insufficient to determine the absolute state when the environment admits nontrivial symmetries. In other words, learning the path up to an unknown global transformation may still leave an irreducible ambiguity in where the agent is ``on the map,’’ and this ambiguity can induce a persistent control error when rewards depend (directly or indirectly) on absolute location.

This leads to an identifiability gap that is easy to overlook if one focuses exclusively on empirical performance. In many real systems, benign asymmetries—unique landmarks, boundary effects, or slight irregularities in dynamics—do break the symmetry and permit eventual localization. Yet for minimax reasoning, and for designing robust wrappers with provable guarantees, we must confront the possibility that the environment family contains exact or approximate symmetries. In that case, no amount of computation can extract information that is not present in the data distribution. A filtering procedure that appears to ``work’’ on typical instances may still fail catastrophically on a worst-case instance drawn from a symmetric class, and such failures may not be detectable online because they correspond to alternative latent explanations that are observationally equivalent.

The central lesson is that correction of episodic observation shifts requires more than temporal coherence: it requires symmetry-breaking information. One way to obtain such information is through anchoring mechanisms that occasionally provide absolute cues. These cues can be weak: a single coordinate of the absolute state revealed intermittently, a noisy probe of the offset itself, or a trusted landmark observation unaffected by the offset. The important point is not the particular sensor model but the information structure: anchors must distinguish among otherwise equivalent hypotheses induced by the symmetry of the dynamics and rewards. When such symmetry-breaking events occur stochastically at rate p > 0, one expects a qualitative transition: before the first informative anchor, multiple symmetry hypotheses remain viable; after a distinguishing anchor occurs, the ambiguity can be resolved and the wrapper can thereafter behave as if it observed the true state (up to residual noise).

In the remainder of the paper we formalize this intuition in a way that cleanly separates what is impossible from what is achievable. We begin by defining episodic offset POMDPs induced by an underlying MDP together with an unknown additive offset, and we specify the no-offset-reveal regime in which the offset is never directly observed. We then introduce the role of symmetries of the underlying MDP and the induced equivalence classes (orbits) of latent states under these symmetries, which characterize precisely the source of non-identifiability. This formulation will allow us to state lower bounds that do not depend on any particular algorithmic choice, as well as upper bounds showing that minimal anchoring suffices to regain identifiability with optimal dependence on the anchor rate.


2. Problem Setup and Symmetry: formalize episodic offset POMDPs, no-offset-reveal regime; define symmetry groups of MDPs and induced equivalence classes of latent states under unknown offsets.

We fix an underlying Markov decision process (MDP) M = (S, A, T, R, ρ0) with state space S (either finite or a compact subset of d), action space A, transition kernel T(⋅ ∣ s, a), reward function R(s, a), and initial state distribution ρ0. Deployment proceeds in episodes j = 1, 2, …, each of horizon H. The latent state trajectory within episode j is (s0, s1, …, sH), generated by
s0 ∼ ρ0,   st + 1 ∼ T(⋅ ∣ st, at),
where the actions (at) are chosen online.

The distinguishing feature is an episodic observation shift: at the beginning of each episode j an offset bj ∈ ℝd is drawn (or selected) and then remains fixed for the entire episode. The agent does not observe bj. Instead, at each time t in episode j it receives an observation

where we interpret st as embedded in d (e.g., for S ⊆ ℝd, or for discrete grids via the canonical embedding). One may regard as the core model; more general observation maps could be treated similarly, but the additive form isolates the identifiability obstruction we will later formalize.

The interaction history available to an online algorithm at time t is the sequence
ht := (o0, a0, r0, o1, a1, r1, …, ot),   rτ := R(sτ, aτ),
and the algorithm may be randomized. We distinguish two roles. An outputs t = t(ht) intended to approximate the latent st. A outputs actions at = at(ht). Of particular interest is the wrapper regime: we are given a fixed (offline-learned) policy π : S → Δ(A) defined on the unshifted state space, and we are allowed to modify only the to π by feeding it an online estimate t (or a belief), but not to update π itself. Formally, the wrapper chooses at ∼ π(t) with t computed from ht.

The regime imposes that bj is never observed, neither during the episode nor after termination. Thus there is no supervised signal that would label or retroactively infer bj across episodes. In particular, although the model implies the offset cancels in temporal differences,
Δot := ot − ot − 1 = st − st − 1,
this cancellation yields only relative displacement information and does not, by itself, specify the absolute latent state. Our minimax questions therefore concern whether the pair (dynamics, rewards) provides sufficient asymmetry to recover absolute state online from (ot), and, if not, what performance losses are unavoidable.

To reason about impossibility uniformly over environment instances, we consider a family of MDPs sharing a common structural symmetry. The relevant notion is a symmetry group acting on the state (and possibly action) spaces. Let G be a group, and let ϕg : S → S be a group action on states. When actions carry semantics tied to coordinates (e.g., ``move north’’), symmetries may also induce a corresponding action transform ψg : A → A. We say that M is if for all g ∈ G, all s, s ∈ S, and all a ∈ A,

and the rewards are invariant in the same sense,

In many canonical navigation problems one has ψg = id (e.g., symmetries that permute labels but preserve action meanings), whereas in translation or rotation symmetries one typically has a nontrivial ψg induced by the coordinate change. We also record that ρ0 is often taken G-invariant (e.g., uniform over S when S is finite and G acts transitively), which is the natural regime for minimax lower bounds.

The action of G partitions S into . For s ∈ S, define
[s]G := {ϕg(s) : g ∈ G}.
If G acts transitively, then all states lie in a single orbit; more generally, the orbit decomposition captures which states are indistinguishable up to symmetry. In the offset model , symmetries arise in two related ways. First, the additive offset itself generates a continuous translation ambiguity: ot is unchanged if we replace (st, bj) by (st + c, bj − c) whenever S is translation-closed (e.g., a torus). Second, and more importantly for our lower bounds, even when S is not translation-closed (e.g., a finite maze), if the MDP admits a nontrivial symmetry group G satisfying –, then there exist distinct latent explanations that produce identical observation-action-reward distributions by jointly transforming states/actions and adjusting the offset. The resulting obstruction is best stated as : from the corrupted observations one may hope to recover st only up to its orbit, unless additional symmetry-breaking information is present.

Accordingly, for estimation we allow either absolute error metrics (when meaningful) or orbit-based metrics. Given a metric d on S, we define the induced orbit distance
dG(s, s) := infg ∈ Gd(s, ϕg(s)),
and we evaluate an estimator t by 𝔼 d(t, st) or 𝔼 dG(t, st) depending on whether absolute identification is possible in the model class. In the no-offset-reveal regime, our principal lower bounds will show that for certain symmetric families , even orbit recovery does not improve beyond a constant unless symmetry is broken by additional side information. For control, we compare the value achieved by a history-dependent controller to that of an oracle-offset agent that observes st (equivalently, knows bj and can invert ). The gap between these performances will be traced to the inability to select the correct orbit element online.

The remainder of the paper makes these definitions operational. We will first exhibit symmetric families for which equivariance implies exact observational indistinguishability among distinct orbit elements under , yielding constant minimax estimation error and non-vanishing value loss. We will then show that introducing even minimal symmetry-breaking information (formalized later as anchors) restores identifiability with optimal dependence on the anchor rate.


3. Impossibility Results (No Anchors): construct symmetric MDP families where observation histories cannot distinguish distinct states/offsets; prove constant minimax estimation error and non-vanishing control loss.

We work in the no-anchor, no-offset-reveal regime, where the learner observes only (ot, rt) with ot = st + b for an unknown episode-constant b ∈ ℝd. The central obstruction is that, for symmetric families, there exist distinct latent explanations—different absolute states, and even different underlying MDP instances within the family—that induce distribution over interaction histories. Consequently, neither state estimation nor control can be uniformly consistent.

Let G act on S via ϕg. For each g ∈ G, we define the g-transform of an MDP M = (S, A, T, R, ρ0) by

When M is (G, ϕ, ψ)-equivariant and ρ0 is G-invariant, these transforms satisfy M(g) ≡ M; however, for control lower bounds it is useful to allow to contain the full orbit {M(g) : g ∈ G} of a base instance, e.g. by translating a goal or maze motif.

We specialize to families where the state embedding is compatible with the group action in the sense that there exists a map c : G → ℝd such that

interpreting S as a discrete torus or a subset of d closed under the relevant translations. In this case the observation shift can exactly absorb the group action: for any g, setting b(g) := b − cg yields
ϕg(st) + b(g) = (st + cg) + (b − cg) = st + b = ot.
Thus a change of latent coordinate by ϕg can be hidden entirely inside a compensating change of offset.

Fix any (possibly randomized) controller that selects at as a measurable function of the history ht = (o0, a0, r0, …, ot). Consider two worlds:

\end{itemize}
Couple the latent processes by setting t := ϕg(st) = st + cg and t := ψg(at). By construction, ot = t for all t under the coupled offsets. Moreover, by the definition of M(g) and the coupling, the transition law of t + 1 given (t, t) matches that of st + 1 given (st, at), and similarly the rewards coincide:
t = R(g)(t, t) = R(st, at) = rt.
Hence, for any controller, the induced distribution over observed histories (o0, a0, r0, …, ot) is identical in World 0 and World g. In information-theoretic terms, the KL divergence between these two hypotheses is exactly 0 at every time t.

We apply a two-point (Le Cam) argument. Let d be a metric on S and choose any g ≠ e with nontrivial displacement. Under , the separation
δg := 𝔼s ∼ ρ0d(s, ϕg(s))
is strictly positive for suitable families (e.g. on an n-torus with g corresponding to translation by n/2 in one coordinate, δg is a constant fraction of the diameter). Let t = t(ht) be any estimator. Consider the two coupled worlds above, which generate identical histories yet have latent states related by t = ϕg(st). Since the estimator sees the same ht in both worlds, it outputs the same t; therefore, by the triangle inequality,
d(st, t) ≤ d(st, t) + d(t, t).
Taking expectation under the coupling and using symmetry of the two hypotheses (they are observationally indistinguishable), we obtain
$$ \max\Bigl\{\mathbb{E}\,d(\hat{s}_t,s_t),\ \mathbb{E}\,d(\hat{s}_t,\tilde{s}_t)\Bigr\} \;\ge\; \frac{1}{2}\,\mathbb{E}\,d(s_t,\tilde{s}_t) \;=\; \frac{1}{2}\,\delta_g. $$
Since the two hypotheses correspond to valid choices of (M, b) within the minimax envelope, this yields a uniform lower bound
$$ \inf_{\hat{s}_t}\ \sup_{M\in\mathcal{M}}\ \sup_{b}\ \mathbb{E}\,d(\hat{s}_t,s_t)\;\ge\;\varepsilon, \qquad \varepsilon:=\frac{1}{2}\min_{g\neq e}\delta_g>0, $$
for appropriately chosen symmetric families. Notably, access to temporal differences Δot = st − st − 1 does not help: in the coupled worlds these differences agree identically, and thus cannot break the ambiguity in absolute phase.

For control, we use the same indistinguishability to force a constant probability of ``acting under the wrong symmetry.’’ Fix a base instance M(e) (e.g. a deterministic torus navigation task) and let ℳ = {M(g) : g ∈ G} be its symmetry orbit, where the reward landscape (or goal set) is shifted by g. Let H be small enough that, from a random initial state, only a strict subset of possible goal locations can be reached. An oracle-offset agent that observes st (equivalently, knows b) and knows the true instance M(g) attains some value Voracle by moving toward the correct goal. In contrast, any history-dependent controller without anchors induces the same distribution over observed histories under multiple g’s paired with compensating offsets, and hence cannot reliably condition its actions on the true g. By Yao’s minimax principle, choosing g uniformly at random implies that any deterministic controller achieves at most the average of its performance over indistinguishable instances, which is bounded away from Voracle by a constant depending on (H, G); therefore there exists c > 0 such that
infπ supM ∈ ℳ supb (Voracle − Vπ) ≥ c.
This formalizes the intuitive statement that, absent symmetry-breaking information, the learner cannot select the correct orbit element online, and value is lost whenever the optimal action depends on that selection.

The conclusion is that the no-anchor model admits an information-theoretic obstruction: even with unbounded computation and arbitrary memory, absolute localization and oracle-level control are impossible uniformly over symmetric families. In the next section we introduce minimal anchoring mechanisms that break this indistinguishability.


4. Anchor Models: define three minimal anchoring mechanisms (absolute coordinate, offset probe, trusted landmark channel) and formal conditions under which anchors break the symmetry group.

We now augment the no-anchor interaction stream with a minimal side channel whose sole role is to inject occasional absolute information that is not invariant under the symmetry transformations underlying the lower bound. We keep the core observation model ot = st + bj with unknown episode-constant bj ∈ ℝd, but assume that, at random times, the learner additionally receives an anchor variable zt generated by a known mechanism.

Fix an episode and suppress the episode index j. At each time t there is an indicator αt ∈ {0, 1} such that
αt ∼ Bernoulli(p),   independently over t and independent of the trajectory given (st, at).
If αt = 0 no anchor is emitted; if αt = 1 the learner observes an additional symbol zt whose conditional law depends on the current latent state (and, in one of the models below, on the offset). Formally, for a specified anchor channel Q we have
zt ∼ Q(  ⋅ ∣st)  or  zt ∼ Q(  ⋅ ∣st, b),
with conditional independence across time given the latent trajectory (and b when applicable). The learner observes (ot, αt, zt), with zt present only when αt = 1.

Let G act on S via ϕg. The role of anchors is to render distinct orbit elements distinguishable. A convenient sufficient condition is that the family of anchor likelihoods separates every nontrivial symmetry:

with the obvious modification when Q depends on b. In words, for each g ≠ e there is at least one state at which the anchor distribution changes under ϕg. Equivalently, if we define the
GQ := {g ∈ G: ∀s ∈ SQ(⋅ ∣ s) = Q(⋅ ∣ ϕg(s))},
then is precisely the requirement GQ = {e}. When GQ is nontrivial, anchors only reduce the ambiguity from G to cosets of GQ; in that case our upper bounds apply with |G| replaced by |G/GQ|, and exact identification is information-theoretically impossible.

We next instantiate three minimal anchor models. Each can be viewed as a special case of and is natural in navigation-style settings.

Assume that states admit an absolute coordinate representation s = (s(1), …, s(d)) (e.g. a grid/torus embedding). An absolute-coordinate anchor reveals one coordinate, or a low-dimensional projection, of the state at the current time:

where kt may be fixed, chosen uniformly from [d], or chosen by a known distribution independent of (st, b). This anchor is affected by the additive corruption b; it corresponds to a sporadic trusted readout of an absolute sensor (e.g. a GPS latitude coordinate, an absolute hallway index, or a ceiling camera with a known map).

For translation actions ϕg(s) = s + cg on a torus, the condition reduces to the requirement that for every g ≠ e there exists some coordinate k with (cg)(k) ≢ 0 (modulo the torus period) and that the anchor occasionally queries such a coordinate. In particular, if the action is full d-dimensional translation and kt cycles through all coordinates (or is uniform on [d]), then holds. When zt is noiseless as in , a single anchor observation typically eliminates a large fraction of symmetry hypotheses, and, under determinism, subsequent consistency checks prevent eliminated hypotheses from reappearing.

Here the anchor directly probes the unknown offset b rather than the state. Concretely, whenever αt = 1 we observe

where ηt is mean-zero noise with known law (often ηt ≡ 0 in the noiseless variant, or sub-Gaussian in the noisy variant). This model captures sporadic calibration events: the agent briefly sees a fixed fiducial in its sensor frame, or receives a trusted telemetry packet describing the current sensor bias.

In the translation-compatible setting ϕg(s) = s + cg from , knowledge of b immediately restores absolute state via st = ot − b. Consequently, the relevant symmetry-breaking property is that different orbit explanations correspond to different offsets. In the coupled indistinguishability construction, hypothesis g uses the compensated offset b(g) = b − cg. Thus any probe of b distinguishes g ≠ e whenever cg ≠ 0. Formally, the anchor likelihood under hypothesis g is the distribution of b(g) + ηt, which differs from that of b + ηt unless cg = 0 (or unless the noise is so degenerate as to make these distributions identical). Therefore, under mild nondegeneracy, holds for all nontrivial translations. This model is especially convenient because it decouples identifiability from where the agent happens to be: any anchor occurrence is potentially informative.

The third model assumes a collection of labeled landmarks L1, …, Lm ⊆ S and a recognition channel that is unaffected by the additive offset. When αt = 1, the environment returns a landmark label according to

or, in a noisy variant, Q(i ∣ s) concentrates on the index of the region containing s but may occasionally misclassify. This captures, for instance, the agent recognizing a uniquely textured object, a QR code, or a room number, even though its proprioceptive observation stream is offset.

The symmetry-breaking condition becomes a structural requirement on the landmark layout: for each g ≠ e there must exist s such that the landmark distributions differ between s and ϕg(s). In the noiseless case this is simply that the landmark labeling is not g-invariant, i.e.,
i with ϕg(Li) ≠ Li.
Intuitively, the landmarks must not repeat with the same pattern under the environment symmetries one wishes to remove. When satisfied, any anchor observed at a suitably chosen (or randomly encountered) landmark rules out all symmetry hypotheses that map that landmark to a different label.

All three mechanisms fit the same abstract template: an anchor channel Q observed at rate p, independent of the offset corruption affecting ot, and satisfying GQ = {e} (or, more generally, yielding a small quotient G/GQ). In the next section we exploit this structure to maintain a finite set of symmetry hypotheses and to show that, in deterministic dynamics, the expected time until the correct hypothesis is isolated is on the order of 1/p, leading to matching control guarantees for a wrapper around a fixed base policy.


5. Anchor Filter Wrapper (Upper Bounds): present a simple belief/elimination algorithm over symmetry hypotheses; prove (1/p) expected localization time and corresponding regret bounds; discuss continuous vs finite state.

We now describe an online wrapper which, given a fixed base policy π : S → A, produces actions using the offset-corrupted stream (ot) and the anchors (αt, zt). Throughout this section we work in the finite-symmetry, deterministic setting: G is finite, T is deterministic, and the dynamics are G-equivariant with (for simplicity) trivial induced action map ψg = id, so that
T(ϕg(s), a) = ϕg(T(s, a))   ∀g ∈ G, ∀s ∈ S, ∀a ∈ A.
We further assume the offset ambiguity can be indexed by G in the sense that there exists an (unknown) g ∈ G such that the learner’s observation can be interpreted as a g-transformed latent state. Concretely, in the translation-on-torus example one takes ϕg(s) = s + cg and ot = st + cg, so that different g correspond to different candidate offsets.

For each g ∈ G we define a decoding map decg from observations to candidate latent states. In the translation case one may take decg(o) = o − cg. We write
stg := decg(ot),
and interpret g as the hypothesis that stg equals the true latent state at time t. Under the above alignment assumption, the true hypothesis satisfies stg = st for all t.

We maintain a feasible set (or, in the noisy case, a weight vector) over hypotheses,
𝒢t ⊆ G,   𝒢0 := G.
At each time t the wrapper performs three operations.

If αt = 1, we remove every hypothesis whose decoded state is inconsistent with the realized anchor. In the noiseless elimination variant this takes the form
𝒢t ← {g ∈ 𝒢tQ(zt ∣ stg) > 0}.
In the noisy variant we instead update weights multiplicatively, wg ← wg ⋅ Q(zt ∣ stg), followed by renormalization.

Choose any representative t ∈ 𝒢t (e.g. the smallest index, or arg maxgwg in the weighted variant) and execute
at := π(stt).

After observing ot + 1 we decode st + 1g := decg(ot + 1) and enforce deterministic transition consistency:
𝒢t + 1 ← {g ∈ 𝒢tst + 1g = T(stg, at)}.
This step is optional in the noiseless-anchor setting (anchors alone suffice), but it is useful for ruling out hypotheses that may survive weak or noisy anchors, and for ensuring that once a hypothesis is removed it cannot re-enter.

The next lemma records the key monotonicity property.

To obtain a Θ(1/p) localization guarantee, we impose an identifiability condition ensuring that a constant number of anchor emissions pins down the symmetry. A convenient formulation is as follows: there exists an integer m0 = O(1) such that for any realization of m0 anchor events, the combined anchor constraints eliminate all g ≠ g almost surely (e.g. m0 = 1 for the offset-probe anchor; m0 = d for a coordinate anchor revealing one coordinate per emission with known kt).

We quantify performance against an oracle wrapper that observes st (equivalently, knows g) and hence executes at = π(st) at all times. Let rewards be bounded, say R ∈ [0, 1], and define the horizon-H regret
$$ \mathrm{Reg}(H)\;:=\;\sum_{t=0}^{H-1}\Bigl(R(s_t,a_t^\star)-R(s_t,a_t)\Bigr). $$

When G is infinite (e.g. continuous translations with b ∈ ℝd), exact elimination is unavailable. A standard substitute is a particle filter or discretized grid over candidate offsets b, updated by the same anchor likelihoods. In the offset-probe model with zt = b + ηt, one may estimate b by averaging the observed probes; after n steps, the number of probes is Binomial(n, p) and the estimation error typically scales as $O(\sigma/\sqrt{np})$ for sub-Gaussian ηt (up to logs). In turn, if π (or the value function under π) is Lipschitz in the state, such residual offset error translates directly into a controlled value gap, yielding the same qualitative dependence on p as in the finite case, with additional approximation terms depending on discretization resolution or particle count.


6. Matching Lower Bounds: prove (1/p) localization lower bound for any algorithm under each anchor model via information arguments / optional stopping / indistinguishability until first informative anchor.

We now show that the O(m0/p) localization bound from Theorem~ is information-theoretically tight, in the sense that no (possibly randomized) online algorithm can, in general, identify the latent orbit element substantially faster than the time needed to witness the requisite number of symmetry-breaking anchor emissions. The argument is an indistinguishability claim up to the first informative anchor, formalized by a two-hypothesis reduction and a stopping-time lower bound.

Fix a finite group G and pick two distinct elements g0, g1 ∈ G. We construct an MDP with symmetry group G in which the sole information about the active hypothesis g ∈ {g0, g1} enters through the anchor channel. Concretely, let S := G and let the dynamics be absorbing,
T(s, a) = s   for all s ∈ Sa ∈ A,
with arbitrary action space A (actions are irrelevant for identifiability). Let ρ0 be uniform on S. We couple the offset ambiguity to G by defining the learner’s ordinary observation as
ot = ϕg(s0)   for all t ≥ 0,
where ϕ is the (left) regular action of G on itself, ϕg(s) = gs. Since s0 ∼ Unif(G), the random variable ϕg(s0) is also uniform on G and hence its law is identical under g = g0 and g = g1. In particular, in this construction (o0, o1, …) carries no information about g.

Write αt ∼ Bernoulli(p) for the anchor indicator and let zt be the anchor emission (when αt = 1). We choose the anchor model so that, after m0 anchor emissions, the hypothesis g is uniquely determined, whereas with fewer than m0 emissions it remains ambiguous between g0 and g1. This matches the upper-bound identifiability parameter m0.

In each of the following, the randomness in (αt) is i.i.d. and independent of the MDP trajectory (which is static in our hard instance).

Set m0 = 1 and let zt reveal g (equivalently, reveal b in the additive-offset parametrization). Then one anchor emission suffices, and without any anchor emission no information is available.

Again m0 = 1. Let zt reveal an absolute, symmetry-breaking label of the current latent state s0 in the canonical coordinate system (for instance, a unique landmark identifier tied to a fixed state in S). Since the learner’s non-anchor observations are equivariant transforms, a single trusted landmark observation pins down the orbit element.

Consider S embedded as a discrete d-dimensional torus so that the group contains coordinate-wise translations. Let an anchor emission reveal one absolute coordinate of s0 (with known index). Choose g0, g1 so that they differ in all d coordinates, and define m0 = d (e.g. one must see all coordinates to distinguish g0 from g1). In this case the lower bound scales as Ω(d/p), matching the natural m0 = d upper bound.

Let Ht denote the full interaction history up to time t available to an arbitrary algorithm (including its own internal randomness), and let
$$ N_t \;:=\; \sum_{u=0}^{t}\alpha_u $$
be the number of anchor emissions up to time t. By construction, conditional on the event {Nt < m0}, the distribution of Ht is the same under g = g0 and g = g1: the non-anchor observations are identically distributed, and fewer than m0 anchors are, by design, insufficient to separate the two hypotheses. In particular, for any estimator t = t(Ht),

since under {Nt < m0} the two hypotheses induce the same law on Ht and thus no test can beat random guessing (Le Cam’s two-point method with zero total variation).

The above instance already forces Ω(m0/p) . To obtain an Ω(m0/p) regret lower bound (relative to an oracle-offset agent) one may augment the instance with a reward structure in which the oracle can secure unit reward immediately upon knowing g, while any algorithm incurs a constant expected per-step loss whenever it has not yet distinguished g0 from g1; then the regret lower bound follows by the same tail-sum argument applied to the localization time. We defer the noisy and approximate-identification variants, where the loss depends continuously on residual uncertainty, to the next section.


7. Extensions: noisy anchors, affine (diagonal) observation transforms, partial observability beyond offsets, and dependence on policy excitation.

We record several extensions of the basic model and analysis that recur in deployments. Throughout, the underlying theme is that the unknown episodic corruption induces an (an orbit) of latent explanations consistent with the non-anchor transcript, and that anchors (together with sufficient dynamical excitation) provide the only mechanism to shrink this class.

In many applications, anchors are not noiseless labels but rather noisy side measurements. A canonical model is
zt = h(st) + ηt,   ηt sub-Gaussian with parameter σ2,
emitted only when αt = 1. In this regime, hard elimination is no longer appropriate: an incorrect hypothesis may match a single noisy anchor by chance. We instead maintain weights (wg)g ∈ G and update by likelihood ratios,
wg ∝ wg ⋅ Pr  (zt ∣ stg)   on anchor times,
and, in stochastic dynamics, optionally include a transition likelihood factor as well. Under the symmetry-breaking condition that for each g ≠ e there exists an s with Pr (z ∣ s) ≠ Pr (z ∣ ϕg(s)), standard concentration for log-likelihood ratios implies that after n anchor emissions we obtain
Pr  (arg maxgwg ≠ g) ≤ |G|exp (−cn),
for a constant c depending on the anchor signal-to-noise ratio through, e.g., a χ2- or KL-separation between h(s) and h(ϕg(s)). Since informative anchors arrive at rate p, we recover a localization time scaling of order (1/p)log |G| to drive the posterior error probability below a prescribed δ. For control, if the base policy π and value function are Lipschitz in the state metric (or more generally if the advantage of incorrect orbit selection is bounded), we obtain a value-loss bound proportional to a residual uncertainty radius rather than a binary ``identified/not identified’’ delay. This yields the familiar tradeoff: smaller δ or larger |G| increases the anchor budget logarithmically, while the 1/p dependence remains unavoidable by the stopping-time argument of Theorem~.

The additive-offset model ot = st + bj is the simplest episodic nuisance. A common generalization is an affine, episode-dependent observation transform
ot = Djst + bj,
where Dj is unknown but structured (e.g., diagonal with positive entries, or a signed diagonal representing per-coordinate gain and possible sign flips). Such transforms arise from uncalibrated sensors or coordinate-wise normalization shifts across episodes. When Dj belongs to a finite set (e.g., sign flips), we may absorb it into an enlarged symmetry set and apply the same hypothesis-tracking filter with || = |G|⋅|𝒟|. When Dj is continuous (positive diagonal), exact identification is impossible without further assumptions; nevertheless, two practically useful regimes admit clean statements.

First, if anchors directly observe a subset of absolute state coordinates (or linear functionals thereof), then Dj becomes identifiable along those directions, and the remaining ambiguity reduces to a smaller group acting on the unanchored subspace. Second, if the dynamics are known and sufficiently rich (e.g., controlled movements along coordinate axes with known magnitudes), then from delta-observations Δot = ot − ot − 1 = Dj(st − st − 1) we can estimate Dj up to the excitation present in (st − st − 1), while anchors resolve the residual absolute alignment. In both cases, the analysis decomposes into (i) identifiability of the continuous nuisance from either anchors or excitation and (ii) orbit selection over a discrete symmetry component. The anchor-rate dependence continues to enter through the waiting time to collect the necessary number of informative anchor emissions, while the continuous component contributes an additional estimation error term (typically of order $1/\sqrt{n}$ in the number of informative events).

Offsets are only one source of partial observability. Suppose instead that the agent observes
ot = h(st) + bj + ξt,
with a possibly non-invertible map h and observation noise ξt. If h is many-to-one, then even an oracle knowing bj cannot generally recover st; the relevant goal becomes or identifying the orbit of st under the indistinguishability relation induced by h. Our framework remains applicable provided the environment family admits a group action G such that both h and the dynamics are equivariant:
h(ϕg(s)) = ϕgobs(h(s)),   T(ϕg(s) ∣ ϕg(s), ψg(a)) = T(s ∣ s, a),
for some induced action ϕgobs on the observation space. Then, absent anchors, one again obtains non-identifiability by transporting trajectories through g while adjusting the nuisance parameters to match the observation law. Anchors need only break the symmetry of the induced POMDP; equivalently, they must separate the orbits that remain after accounting for h. In practice this means that anchors can be weaker than absolute state reveals: it suffices that the anchor channel is not invariant under the remaining symmetries.

Our upper bounds implicitly assume that the data stream contains opportunities to rule out incorrect hypotheses. Deterministic transition consistency is maximally informative when the policy induces diverse state-action pairs; conversely, a fixed base policy may be . Two distinct obstructions can arise.

First, if the base policy drives the system into a subset S ⊆ S on which the group action is not distinguishable (e.g., the action of G restricted to S is transitive and the anchor channel is invariant on S), then no amount of time suffices: even with anchors, the hypotheses remain observationally equivalent. This is not a failure of the filter but an identifiability failure of the deployment configuration.

Second, even when identifiability holds globally, the localization time depends on how quickly the policy visits states that yield symmetry-breaking anchor likelihoods. One may formalize this by replacing the raw anchor rate p with an pinfo := p ⋅ ν, where ν is the long-run fraction of time the trajectory spends in states where anchors separate the competing hypotheses by a fixed amount. The expected localization time then scales as Θ(1/pinfo) rather than Θ(1/p). This refinement is unavoidable: the lower bound of Theorem~ can be strengthened by conditioning not merely on Nt < m0 but on the absence of anchor emissions, which may be rarer than anchors themselves under a poorly exciting policy. Consequently, when wrapping a fixed offline policy, it is natural to treat m0 (and, more generally, the anchor informativeness constants) as parameters, and to certify them either analytically from the model or empirically from pre-deployment rollouts.


8. (Optional) Experiments: anchor-augmented FORL on long-horizon rollouts; verify scaling with p, demonstrate stabilization and tail-risk reduction vs no-anchor baselines.

We briefly describe an empirical protocol whose purpose is not to optimize constants, but to verify the qualitative predictions of Theorems~– in long-horizon deployments: (i) without anchors, orbit ambiguity persists and induces a non-vanishing performance gap; (ii) with symmetry-breaking anchors arriving at rate p, localization and regret scale on the order of 1/p (up to logarithmic factors under noisy anchors); and (iii) anchors reduce tail risk by preventing rare but catastrophic orbit-mismatch episodes.

We consider two representative symmetric families. The first is translation-equivariant navigation on a discrete torus S = (ℤ/Lℤ)2 with deterministic dynamics and G the translation group; the observation is ot = st + bj modulo L. Rewards are shaped to be G-invariant while still requiring correct absolute alignment for optimality, e.g. by defining a set of absolute goals 𝒢 ⊂ S and paying 1 when st ∈ 𝒢, with 𝒢 chosen so that distinct translations induce distinct goal configurations up to a small overlap (so that orbit selection matters for average return). The second is a symmetric maze family constructed by tiling a motif and taking G to be the automorphism group of the tiling; the base policy is trained offline on the uncorrupted maze, and deployment perturbs observations by an episode-wise additive offset in a learned latent embedding (in which G acts approximately by permutations), so that the wrapper must identify the correct orbit element rather than a Euclidean translation.

We implement three anchor mechanisms, each tuned by a per-timestep anchor probability p ∈ (0, 1]: (i) , where at anchor times we reveal one coordinate of st exactly (uniformly choosing which coordinate in d = 2); (ii) , where zt = bj + ηt with ηt Gaussian; and (iii) , where zt is an indicator of membership in a distinguished landmark set L ⊂ S unaffected by the offset. In all cases we choose anchor channels that violate invariance under every nontrivial g ∈ G on a non-negligible subset of states, so that the symmetry-breaking condition holds. To expose the dependence on visitation of informative states, we also include an ablation where landmarks are placed only in a region rarely visited by the base policy, so that pinfo ≪ p.

For finite G we run the eliminative version of SHEA in deterministic environments and the weighted (Bayesian) version under noisy anchors. For the continuous-offset setting (where bj ∈ ℝd is not discretized), we implement an ``anchor-augmented FORL’’ variant consistent with the computational model described earlier: we maintain K particles for b and update them (i) by deterministic propagation using Δot when b is constant within an episode and (ii) by anchor likelihood weighting when αt = 1, with resampling when the effective sample size falls below a threshold. When using a learned prior (e.g. a score-based model over plausible state embeddings), we incorporate it only through per-particle likelihood evaluations; the control policy remains fixed.

We report (a) a τδ := inf {t: d(t, st) ≤ δ} (or the analogous criterion for offset particles), (b) cumulative regret relative to an oracle-offset agent,
$$ \mathrm{Reg}(H)\;:=\;\sum_{t=0}^{H-1}\bigl(r_t^{\mathrm{oracle}}-r_t\bigr), $$
and (c) tail metrics of the per-episode return $G_j:=\sum_{t=0}^{H-1} r_t$, such as CVaRα(G) for α ∈ {0.05, 0.1}, which captures rare but severe orbit-mismatch failures. In long-horizon runs we take H ≫ 1/p so that both the pre-localization transient and the post-localization steady state are visible.

Across both the torus and symmetric-maze families, we observe that 𝔼[τδ] grows approximately linearly in 1/p when anchors are noiseless and distinguishing on the visited state distribution. Under noisy anchors, the same plots become linear in (1/p)log (1/δ) (and, in finite-G experiments, in (1/p)log |G| when we drive the posterior misclassification probability below a fixed threshold), matching the concentration discussion in the noisy-anchor extension. Correspondingly, when H is large, the empirical regret decomposes into a transient proportional to τδ plus a small post-localization residual; the transient component dominates, yielding Reg(H) ≈ c/p over the range of p explored. In contrast, in no-anchor baselines (including naive ``treat ot as st’’ controllers and delta-only integration with an arbitrary initial alignment), the orbit ambiguity persists and the regret curve flattens to a nonzero constant gap per unit time, consistent with the non-identifiability lower bound.

The most salient qualitative effect of anchors in long-horizon rollouts is stabilization: once the wrapper concentrates on a single orbit element, the deployed actions coincide (in the noiseless-anchor case) with those of the oracle-offset agent, and return trajectories exhibit markedly reduced variance. This manifests in tail statistics: without anchors, a fixed fraction of episodes behave as though the agent were deployed in the wrong orbit element, producing sporadic but severe failures (e.g. repeatedly navigating to the translate of the intended goal region); with anchors, the lower tail of Gj improves substantially even when mean return changes modestly. Empirically, CVaRα(G) approaches the oracle value as p increases, and the convergence is typically faster than that of the mean because anchors primarily eliminate catastrophic orbit-mismatch events rather than fine-tune within-orbit behavior.

Finally, the landmark-placement ablation illustrates the distinction between p and pinfo. When anchors are frequent but (under the base policy’s visitation distribution) rarely symmetry-breaking, increasing p alone does not accelerate localization: τδ tracks 1/pinfo rather than 1/p, and tail-risk reduction is correspondingly delayed. This behavior is consistent with the policy-excitation discussion above and motivates reporting not only anchor frequency but also an empirical proxy for informativeness, e.g. the observed log-likelihood ratio separation between leading symmetry hypotheses on anchor timesteps.


Our deployment model is motivated by wrapper'' approaches that leave a base policy unchanged and instead attempt to filter latent variables online so as to present approximately correct inputs to the policy. This viewpoint appears in work on test-time adaptation of perception modules, system-identification layers attached to controllers, and filtering in latent-space policies; see, e.g., the broad family ofpolicy + estimator’’ decompositions in partially observed control . The specific ``FORL’’-style instantiation (in which one maintains a belief over an offset or a finite hypothesis class and updates it with online likelihood information) is closest in spirit to Bayesian filtering in HMMs and state-space models , except that here the unobserved variable is an additive corruption that induces a global symmetry ambiguity. Our contribution relative to generic filtering is to isolate a minimax regime in which filtering is without symmetry-breaking side information, and to give a wrapper with sharp scaling in the anchor rate once such information is present.

There is a substantial literature on robust and conservative offline RL under dataset shift, including pessimism-based methods, uncertainty-regularized objectives, and adversarial formulations . These methods typically address uncertainty about the transition/reward model induced by limited coverage of the offline data distribution, and they aim to prevent catastrophic overestimation when the learned Q-function is queried out of distribution. Our setting is complementary: the underlying MDP is fixed across deployment, and the performance failure is driven by an confounder that is constant within an episode but arbitrary across episodes. In particular, even if the base policy is optimal for the uncorrupted MDP and the dynamics are known, the policy may be rendered ineffective by orbit ambiguity. The lower bounds we present therefore do not follow from classical offline RL uncertainty arguments; rather, they reflect a structural non-identifiability induced by symmetry. Conversely, robust offline RL techniques can be combined with our wrapper, but they do not substitute for symmetry-breaking anchors when the observation channel erases absolute alignment.

Identifiability questions for latent-state models are classical in statistics and control. For HMMs and controlled HMMs/POMDPs, generic identifiability typically requires richness conditions (full-rank emissions, sufficient excitation, etc.) and yields recovery only up to permutation/relabelling of latent states . In reinforcement learning, related notions appear under block MDPs'' andobservable POMDPs,’’ where observation embeddings are sufficient to recover latent state up to an invertible transform . Our setting may be viewed as an adversarially chosen non-identifiable observation map s ↦ s + bj that changes between episodes. The salient distinction is that the ambiguity is G acting on S, and our lower bound is stated directly in terms of orbit separation under G (rather than generic rank conditions). This makes precise an obstruction that is often implicit in POMDP identifiability: even with unlimited interaction, one cannot identify latent state beyond an equivalence class if the observation model collapses those classes and the dynamics/rewards respect the same invariances.

Exploiting symmetries in MDPs has a long history, including state aggregation via automorphisms, equivariant representations, and invariant policy classes . Many of these works use symmetry to the effective state space or improve sample efficiency by tying parameters across orbit elements. Our use of symmetry is different: we use it to construct (non-identifiability without anchors) and to motivate a simple hypothesis-elimination filter when anchors are available. In particular, the orbit-indistinguishability argument underlying our lower bound is a controlled analogue of the fact that, absent symmetry-breaking observations, any estimator can at best select a canonical representative of an orbit, which is insufficient for tasks that depend on absolute alignment.

The anchor mechanisms we formalize are analogous to absolute measurements in simultaneous localization and mapping (SLAM). In SLAM, dead-reckoning from relative motion integrates drift, while intermittent absolute observations (GPS fixes, loop closures, recognized landmarks) provide constraints that collapse global pose ambiguity . Our model is intentionally simpler: the offset bj is constant within an episode rather than slowly drifting, and the ``map’’ is replaced by known MDP dynamics. Nevertheless, the qualitative phenomenon is shared: relative information (here, Δot and transition consistency) is insufficient to determine global alignment, whereas a sparse set of absolute constraints yields localization with waiting time governed by their arrival process. The Θ(1/p) scaling in our theorems can thus be interpreted as a control-theoretic analogue of loop-closure sample complexity under a Bernoulli landmark process.

The per-episode offset can be viewed as a latent context variable that changes between episodes, placing us near work on hidden-parameter MDPs, contextual MDPs, and meta-RL . There is also extensive work on non-stationary and piecewise-stationary RL, where transitions or rewards change over time and one seeks detection and adaptation . Our regime differs in that the and only the observation channel changes, but the learning-theoretic difficulty is of similar flavor: without side information that pins down the latent episode variable, performance can be bounded away from optimal uniformly over the change sequence. Anchors play the role of a minimal ``reveal’’ signal that renders the latent context identifiable; our results quantify how the frequency and informativeness of such reveals control both localization time and regret.

Finally, our offset model is a structured instance of observation corruption. In contrast to general adversarial noise models (where one typically seeks worst-case robustness guarantees), we exploit the fact that the corruption lies in a known symmetry family and is constant within an episode. This places the problem between classical robust control and statistical estimation under group actions. The main message is that structural knowledge alone does not guarantee recoverability: if the environment is itself symmetric, then the corruption may be without explicit symmetry-breaking information. This perspective clarifies when ``test-time calibration’’ is fundamentally limited and when sparse anchoring is sufficient.


10. Discussion and Design Guidance: practical recommendationshow many anchors, what kind, where to place them; limitations and open problems.

We summarize practical implications of the lower/upper bounds and translate the anchor assumptions into concrete design guidelines. The central message is that, in the symmetric regimes captured by Theorems~1–5, information (transition consistency, differences Δot, rewards that respect the symmetry) does not determine absolute alignment; hence performance hinges on introducing side information at a rate that controls the waiting time to disambiguation.

In the idealized Bernoulli model where an informative anchor occurs each step with probability p, the expected time to the first anchor is 1/p, and more generally
ℙ(no anchor by time t) = (1 − p)t ≤ ept.
Thus, to ensure at least one anchor by time t with probability at least 1 − δ, it suffices that $t \ge \frac{1}{p}\log\frac{1}{\delta}$. In the deterministic finite-G setting of Theorem~3, once a anchor occurs, the hypothesis set collapses and does not re-expand, so the performance transition is sharp: pre-anchor behavior is orbit-ambiguous, post-anchor behavior matches the oracle (noise-free case).

In applications, one rarely controls a literal per-step anchor probability. A more useful abstraction is an rate
peff = p ⋅ q,
where p is the probability that the anchoring channel is available, and q is the probability that the agent is in an anchor- situation when it is available. For instance, if an anchor reveals an absolute coordinate only at certain locations (e.g., a corridor with a fiducial), then q is the visitation frequency of those locations under the deployed policy. The Θ(1/p) conclusions should then be read as Θ(1/peff): anchoring mechanisms that are frequent but almost always symmetry-preserving can be nearly useless.

Our assumptions allow several qualitatively different anchor channels, but they share a single requirement: for each nontrivial symmetry g ≠ e, there must exist states s for which the anchor statistics distinguish s from ϕg(s). Concretely, the following design patterns are robust:

  1. Occasional measurements of one coordinate in an external frame (GPS-like) are powerful when the symmetry acts nontrivially on that coordinate. On a translation-equivariant torus, a single absolute coordinate breaks all translations along that axis but leaves residual ambiguity along the orthogonal axis; this is not a failure of the filter but a statement that the remaining subgroup is still unidentifiable. The practical guideline is to ensure that the anchor resolves the full symmetry group relevant to the task.

  2. Direct (possibly noisy) measurements of bj are maximally informative when available, but even a biased or partial probe can suffice if its distribution is not G-invariant. When probes are noisy, Theorem~5 indicates that one should budget ((1/p)log |G|) anchor occurrences to concentrate the posterior over G; correspondingly, decreasing the probe noise variance can be exchanged against anchor frequency.

  3. Landmark identities that are unaffected by the offset (e.g., AprilTags in the world frame, uniquely textured regions) implement symmetry breaking if and only if landmark labels are not repeated across orbit elements. If the environment contains repeated motifs, then landmarks that look ``distinct’’ locally may still be globally symmetric; in such cases, one must either place landmarks asymmetrically or encode global identifiers.

Because the effective rate is peff = p ⋅ q, spatial placement aims to increase q under the deployed behavior. If we cannot modify the base policy, then anchors should be placed along trajectories that the policy visits reliably (near the start-state distribution ρ0, along common funnels, or at bottlenecks). If minor action interventions are allowed, one can add a short ``localization prelude’’ that steers toward anchor-rich regions; the analysis then becomes two-phase, with an explicit tradeoff between the prelude cost and the reduced expected localization time.

A second placement principle concerns . An anchor at a state s that satisfies z(s) ≈ z(ϕg(s)) for many g is weak even if frequent. In grid/maze instances, anchors near highly symmetric regions (center of a symmetric room, repeated corridor segments) tend to be less informative than anchors near asymmetric boundary features. For finite G, one may formalize this by maximizing, over candidate anchor locations, the expected number of eliminated hypotheses per anchor event under the anticipated visitation distribution.

When implementing a hypothesis filter such as SHEA, we recommend tracking not only weights wg but also a scalar summary (e.g., 1 − maxg ≠ wg) and using it to gate downstream behavior. Before confidence crosses a threshold, the wrapper may (a) act conservatively (risk-averse actions that are near-invariant across hypotheses), (b) request additional anchors if the system supports active sensing, or (c) abstain in safety-critical contexts. This addresses a practical gap between the information-theoretic statements (which permit constant pre-anchor loss) and systems that require bounded risk at all times.

First, our tight Θ(1/p) localization picture relies on determinism (or, more generally, on transition tests that reliably eliminate inconsistent hypotheses). Extending sharp elimination-style guarantees to stochastic dynamics requires likelihood-based updates and careful control of false elimination; one expects additional factors depending on mixing, KL separations, and horizon.

Second, real environments exhibit rather than exact symmetries. In that case, non-identifiability becomes ``ill-conditioning’’: without anchors, one can sometimes localize slowly via tiny asymmetries, but the sample complexity may be prohibitive and brittle. Characterizing the interpolation between exact symmetry lower bounds and approximate-symmetry rates remains open.

Third, we assumed known G (or at least a known hypothesis class). In many robotics and perception settings, the relevant nuisance transformations are only partially modeled. Jointly learning the symmetry structure, the anchor model, and the filter online is an important direction, as is replacing finite G with continuous groups and analyzing discretization effects.

Finally, our guarantees are wrapper-style and therefore inherit limitations of the base policy: if π never visits distinguishing regions, then q ≈ 0 and anchors are ineffectual. Designing policies that are simultaneously task-effective and ``anchor-seeking’’ (or more generally, information-seeking under deployment-time confounding) is a natural bridge between our identifiability focus and classic exploration in partially observed control.