Retrieval-augmented generation (RAG) is commonly presented as a behavioral fix for factuality: we attach a retriever to a language model, provide relevant passages at inference time, and expect the model to condition on those passages when producing an answer. Empirically, however, many prominent failures of RAG are not well described as mere insufficiency of retrieved information. Rather, they present as : the system has access to a channel that contains the needed evidence, yet the final prediction is routed through a different computational pathway that is only weakly constrained by that evidence. In such cases the output may be fluent and even locally consistent with the retrieved text, while still being unsupported by it, or it may contradict retrieved evidence while remaining stable under retrieval perturbations. These phenomena suggest that improving RAG requires not only better retrieval and better decoding, but also explicit guarantees that the model retrieved evidence in a causal sense.
Two observations motivate this work. First, standard accuracy metrics on question answering with retrieval conflate at least three distinct regimes: (i) supported answering where the model reads and applies the retrieved evidence; (ii) unsupported answering where the model relies on parametric memory or priors (and may coincidentally be correct); and (iii) spurious answering where the model is driven by superficial cues in the retrieved passages (formatting, lexical overlap, distractor entities) rather than the intended evidence. Purely behavioral evaluations (e.g. exact match on held-out questions) cannot disentangle these regimes, because the label does not specify the causal pathway. A model that answers correctly by parametric recall may score identically to a model that reads the passage. Conversely, a model that is correct on average may remain brittle under small changes to retrieval content, precisely because its computation is not anchored to the evidence-bearing tokens.
Second, several practical mitigation strategies in RAG implicitly assume that the model routes computation through retrieval: increasing context length, adding citation prompts, ranking passages more carefully, or training on instruction-following. These methods may improve outcomes, yet they do not certify the intended causal dependence. Indeed, we often observe cases where providing retrieval context increases hallucination: additional distractors introduce high-salience but irrelevant spans, and the model becomes less sensitive to the true evidence. This is consistent with the view that the model implements a latent mixture of mechanisms, only one of which corresponds to direct evidence use, and that the mixture weights can shift as the surface statistics of the retrieved context change.
Our central claim is that RAG should be audited and trained at the level of internal causal pathways, not merely at the level of input–output behavior. Concretely, we regard the retrieval interface—cross-attention over retrieved tokens, a memory read vector, or an explicit compression module—as a designated component through which evidence must flow in supported examples. If the prediction is truly evidence-based, then intervening on this interface should have a controlled and predictable effect: removing or corrupting evidence should reduce the model’s likelihood of the correct answer, and restoring the interface state corresponding to the clean evidence should restore that likelihood. Conversely, swapping in interface states corresponding to distractors should not restore the correct answer likelihood. These desiderata are inherently ; they refer to what the model would have done under controlled changes to evidence while holding other factors fixed.
To operationalize this idea we draw on interchange interventions, which have emerged as a useful tool for mechanistic evaluation in neural models. The interchange principle is simple: we identify an internal activation (or set of activations) that is hypothesized to carry a certain information channel, and we replace it with the activation obtained under a different input that isolates that channel. In the retrieval setting, the natural interchange target is the retrieval read component at the answer position(s). Given a clean retrieved context and a corrupted one, we can run the model on each, record the retrieval-interface activations, and then run an intervened forward pass that uses the corrupted input while substituting in the clean retrieval activation. If the model’s probability of the correct answer is substantially restored by this substitution, then the retrieval interface is causally mediating the prediction; if not, then the prediction is being computed largely elsewhere (e.g. by parametric memory, or by features of the corrupted context that bypass the intended evidence).
This causal perspective also clarifies the relationship between RAG and two mechanisms that have been widely discussed in transformer interpretability: and . In the direct retrieval regime, the model computes an answer by attending to specific tokens in the provided context and copying or transforming them. In the induction regime, the model uses patterns in context (e.g. repeated key–value mappings) to infer a rule and apply it, potentially without relying on a particular evidence span. Both mechanisms can produce correct outputs, but they differ in their sensitivity to evidence swaps and in their failure modes under distractors. A model that has learned a strong parametric association between a query pattern and an answer may behave like an ``induction-only’’ system even when retrieval is present, treating the retrieved text as optional decoration. Such a model is especially dangerous in RAG deployments because it can appear grounded (the retrieved context contains relevant content) while its computation ignores that content.
We therefore frame the RAG problem as one of enforcing and measuring a desired routing constraint: on supported queries, a nontrivial fraction of the answer likelihood should be mediated by the retrieval interface and, more specifically, by the evidence-bearing tokens. This leads to a metric that we can audit per example and aggregate over distributions: a score derived from restoration under interchange interventions. The metric is not intended to replace accuracy; rather, it refines accuracy into a mechanistically meaningful statement about why the model is correct when it is correct, and how it fails when it fails. It also provides an immediate diagnostic for common pathologies: (i) high accuracy but low causal usage (parametric memorization masquerading as grounding), (ii) high sensitivity to distractors (spurious routing through non-evidence features), and (iii) insensitivity to evidence identity (retrieval presence is used as a generic signal, but not the specific evidence).
Finally, this viewpoint suggests a training principle: if we can compute a causal usage score with a small number of additional forward passes, then we can regularize the model to increase evidence-mediated routing and suppress distractor-mediated routing. This is conceptually analogous to contrastive learning, but the contrast is imposed at the level of internal causal channels rather than at the level of representations or outputs alone. The remainder of this paper makes this precise: we define the restoration-based causal metric at a designated retrieval component, we introduce a corruption process that preserves surface cues while swapping evidence, and we propose a training objective that encourages the model to depend on evidence in the required mechanistic sense. The resulting framework is intended to be both auditable and actionable: it produces per-example causal diagnostics and, under explicit assumptions, yields distributional bounds on unsupported answering under retriever shift.
Our use of interchange interventions places this work in the broader program of causal analysis of neural networks, where one evaluates candidate internal variables by on their values and measuring downstream effects. The basic operation is to run two forward passes on distinct inputs, record an activation at a designated site, and then perform a third forward pass in which the activation is (replaced) from one run into the other. This family of methods appears under several names—activation patching, causal tracing, representation swapping, and interchange interventions—and is closely related to the causal abstraction viewpoint in which internal states are treated as endogenous variables in a structural causal model over the computation graph. A key methodological advantage is that these interventions can be implemented with lightweight hooks at module boundaries (e.g. attention outputs, MLP pre-activations, or designated read vectors) and do not require training a separate explainer model.
Among patching-based methods, scores are particularly natural in retrieval settings. Fix an input pair (q, D+) and (q, D−) that differ in the evidence-bearing content, and fix a component f that we hypothesize carries the retrieval-mediated signal into the answer computation. We may compare (i) the model’s probability of the gold answer on the clean context, (ii) the probability on the corrupted context, and (iii) the probability on the corrupted context under an intervention that substitutes the clean activation at f. The resulting restoration fraction can be interpreted as a measure of how much of the model’s preference for the correct answer is mediated by the intervened component, relative to the preference change induced by the corruption. This idea is conceptually aligned with mediation analysis: the corruption changes both the value of the proposed mediator and possibly other features, and the interchange intervention isolates the mediated pathway by holding the rest of the computation fixed. In practice, restoration-based scores have been used to localize information flow for factual recall and to identify attention heads and layers implicated in particular behaviors; we adopt the same operational stance but tailor the intervention site to the explicit retrieval interface of a RAG system.
A recurring difficulty in auditing retrieval usage on natural QA is that real corpora entangle evidence with confounds such as lexical overlap, answer frequency, and topical priors. For this reason, synthetic tasks have been used to study whether transformers can implement content-addressable retrieval and whether attention mechanisms act as retrieval primitives. In particular, associative-style recall tasks instantiate a set of key–value bindings placed in context and query the value for a key; attentive retrieval variants control where and how the relevant binding appears, often inserting distractor bindings whose surface statistics closely match the true one. These settings permit clean causal tests: one can swap the evidence binding while keeping formatting, entity types, and positional cues fixed, thereby distinguishing models that truly route through the intended binding from models that rely on shortcuts (e.g. positional heuristics, spurious token correlations, or query-only priors). Such tasks have also been used to study ``induction’’ behavior, where models generalize a mapping pattern rather than copying a specific span. For our purposes, the main lesson is methodological: carefully controlled evidence swaps and distractor injections are necessary to prevent trivial detection of corruption and to ensure that any measured restoration reflects evidence identity, not merely retrieval presence.
A large literature addresses grounding and faithfulness in retrieval-augmented generation, typically at the level of input–output behavior. Common approaches include: prompting the model to cite sources, training with supervised rationales, reranking passages to improve evidence quality, and post-hoc verification via entailment models or secondary checkers. Evaluation protocols often measure whether generated answers are supported by retrieved text (attribution accuracy), whether citations point to relevant spans, or whether the answer is stable under minor retrieval perturbations. While these methods are valuable, they do not by themselves establish that the model’s uses evidence in a mediated sense: a model may generate plausible citations without conditioning strongly on them, or it may answer correctly from parametric memory even when evidence is present. Conversely, behavioral sensitivity tests can be ambiguous when the perturbation changes many correlates at once. Our position is that mechanistic auditing complements behavioral faithfulness: by intervening at the retrieval read path we can quantify whether the retrieved tokens have a causal route into the answer, and we can separate evidence-mediated effects from spurious correlations carried by other parts of the network.
Retrieval augmentation is often implemented by concatenating passages to the query and relying on standard self-attention. However, many modern systems introduce explicit that mediate information flow from retrieved text into generation: cross-attention blocks over external memory, late-fusion architectures that aggregate passage-wise representations, and retrieval-to-generation adapters. Architectures in the RETRO family and related nearest-neighbor augmentation schemes similarly impose a distinct channel through which retrieved neighbors influence decoding. These design choices are relevant because they expose natural intervention sites: rather than patching arbitrary internal activations, we may patch the read vector(s) of a designated retrieval module, which is precisely the component intended to carry retrieved evidence.
Long-context transformers and memory-augmented models further
complicate the notion of
retrieval,'' since information may enter the answer computation via learned summaries, compressed representations, or persistent memory tokens rather than direct attention to raw retrieved spans. Recurrent memory mechanisms, segment-level recurrence, and learned compressive modules create an explicit bottleneck through which past context is distilled; similarly, architectures that allocate special memory tokens (or latent arrays) act as a structured interface between high-volume context and the decoding stream. From a causal standpoint, these interfaces are attractive because they define a small set of ports whose activations summarize the evidence pathway. At the same time, they introduce a failure mode: compression may preserve superficial cues while discarding the discriminative fact needed for a particular query, yielding a system that appears tohave
seen’’ the evidence but cannot causally use it at answer time. This
motivates auditing not only raw-token attention patterns but also the
internal memory readout and compression outputs, treating them as
candidate f sites for
interchange interventions.
The common thread across these lines of work is the need to distinguish of information from . Interchange interventions and restoration attribution provide an operational tool for this distinction; synthetic AR/ATR-style tasks provide controlled environments where evidence swaps are well-defined; and the diversity of retrieval and memory interfaces in modern architectures motivates defining evidence usage at explicit read ports rather than at the level of output behavior alone. We synthesize these ideas by proposing a retrieval-specific restoration metric and a training-time regularizer that acts directly on the designated retrieval channel.
We formalize retrieval-augmented prediction as inference under a partially observed evidence variable, with explicit control over both the retriever and the retrieved context. Let 𝒟 be a distribution over triples (q, y*, e), where q is a query, y* is the gold answer, and e is a latent evidence item (typically a short span or atomic fact) that suffices to justify y* given q. We assume that e is well-defined up to an equivalence class of paraphrases; in synthetic settings e may be an exact span, whereas in natural QA e may correspond to a minimal supporting sentence. When available, we represent an evidence label inside a retrieved context by a token index set E(D) ⊆ {1, …, |D|} indicating the location(s) of e within D.
A retriever R maps a query
to a set of passages D = R(q), which we
treat either as a multiset of documents or as a single concatenated
token sequence. We consider a stochastic retriever (via sampling,
approximate search, or nondeterminism in the corpus snapshot) and assume
a miss probability bound
Pr(q, y*, e) ∼ 𝒟[e ∉ R(q)] ≤ η,
where η is an explicit
parameter of the environment. For analysis and training we distinguish
two regimes: a regime in which the retrieved context contains the
relevant evidence, and an regime in which it does not. Concretely, for
each example we write D+ for a retrieved
context containing e, and we
allow that D+ may
also contain distractor spans that are irrelevant to y* but similar in surface
form. At evaluation time we will consider , in which the test-time miss
rate and distractor profile differ from those encountered during
training.
Given (q, D), the model outputs a conditional distribution pθ(y ∣ q, D) over answers. We allow answers to be sequences (e.g. token strings), but we keep notation at the level of a single output variable y; the log-likelihood of a multi-token answer is interpreted as the sum of per-token log-probabilities under teacher forcing. The total input length (query plus retrieved context and any special delimiters) is denoted by n. We emphasize that the model is not assumed to be purely extractive: it may synthesize y from D and its parametric memory. Our objective, made precise later, is to enforce that when evidence is present the model’s preference for y* is causally mediated by the retrieved evidence through an explicit interface, rather than being primarily driven by query-only priors or by confounded correlates inside D.
We assume the architecture contains a designated retrieval interface
component f through which
retrieved tokens (or their summaries) influence the answer computation.
We use hf(q, D)
for the activation at that component on input (q, D), and we require that
hf be
accessible for recording and replacement at module boundaries. The
definition of f depends on the
RAG design:
(i) in a two-stream encoder–decoder with cross-attention over retrieved
tokens, f may be the
cross-attention output vector at the answer position (or the collection
of per-head outputs) in a specified layer;
(ii) in late-fusion or passage-wise aggregation, f may be the pooled passage
representation or the mixture weights over passages;
(iii) in memory-augmented designs with read ports, f may be the memory read vector(s)
or key–value retrieval output at each decoding step;
(iv) in long-context compression settings, f may be the output of a compression
module or the states of learned memory tokens that summarize the
retrieved text.
In all cases we conceptualize f as the intended bottleneck for
retrieval-mediated information, so that interventions at f isolate the retrieval pathway more
directly than arbitrary internal patching.
To disentangle causal usage of evidence identity from superficial features of retrieval, we introduce a family of corruption operators 𝒞 that map a context D to a corrupted context D− = 𝒞(D). We parameterize 𝒞 by a corruption strength δ controlling how aggressively evidence is altered or how many distractors are injected, and we require that 𝒞 preserve surface statistics to the extent possible (format, length, entity types, positional cues), thereby preventing trivial detection.
We consider two canonical corruption types.
replaces the evidence item e
(or its span) with an alternative evidence-like span e′ drawn so that the
resulting context remains plausible and stylistically matched. In
synthetic key–value tasks, this corresponds to swapping the value paired
with a queried key while keeping the key and formatting fixed. In
factual QA, this may be approximated by swapping a sentence containing
the answer with a sentence from a different entity of the same type. The
resulting D− is
designed so that pθ(y* ∣ q, D−)
should decrease relative to pθ(y* ∣ q, D+)
for models that genuinely use e.
adds (or replaces with) spans that mimic the evidence span’s lexical and
structural properties but are independent of y* given q. This corruption targets models
that over-rely on shallow cues, such as topical overlap or answer
frequency, by ensuring that these cues can be matched without providing
the correct fact.
In addition, we allow a that removes evidence-bearing passages entirely, modeling the event e ∉ D and capturing the unavoidable failure mode where the retrieved context is genuinely unsupported.
We distinguish the training distribution, in which D+ is obtained by running (or simulating) the retriever and filtering for support, from the test distribution, in which retrieval quality may degrade. Formally, at test time we may face a shift in (η, δ): the miss rate η may increase (e.g. due to corpus drift or domain shift) and the distractor rate δ may increase (e.g. due to larger candidate sets or noisier retrieval). Our downstream evaluation therefore considers not only task accuracy on (q, D), but also the model’s behavior conditioned on whether D contains evidence. In particular, we will later bound the probability of producing an answer unsupported by the retrieved context under such shift, in terms of η and a mechanistic evidence-usage condition enforced during training.
Finally, we assume access to paired contexts (q, D+) and (q, D−) during training (constructed via 𝒞), and access to activations hf(q, D) for a small number of probed components. This is the only white-box requirement: we neither assume span-level supervision in general nor require a separate verifier. The next section uses this setup to define an intervention protocol that compares behavior across (q, D+), (q, D−), and activation-patched forward passes, yielding a causal evidence usage metric specialized to retrieval interfaces and evidence spans.
We now define an intervention protocol and an associated metric intended to quantify whether the model’s preference for the gold answer y* is by retrieved evidence, as opposed to being primarily driven by query-only priors or by non-evidence features of the retrieved context. The central object is an at a designated retrieval interface component f, implemented by recording the activation hf(q, D) on one input and substituting it into a forward pass on another input.
Fix an example (q, y*, D+)
in the supported regime, and let D− = 𝒞(D+)
be a corruption that removes or swaps the evidence identity while
preserving superficial cues. We define three log-likelihoods
(single-token or teacher-forced sequence likelihoods as described in the
setup):
L+ = log pθ(y* ∣ q, D+), L− = log pθ(y* ∣ q, D−),
and an log-likelihood
Lfint = log pθf ← f*(y* ∣ q, D−, D+),
where pθf ← f*
denotes the distribution induced by running a forward pass on (q, D−) but
replacing the activation at f
by its clean counterpart:
hf(q, D−) ← hf(q, D+).
Operationally, this corresponds to (i) running (q, D+) once to
cache hf(q, D+),
(ii) running (q, D−) once to
obtain baseline behavior and, if needed, tensor shapes, and (iii)
running (q, D−) again
while patching the cached activation into the module boundary for f.
We quantify how much of the corruption-induced degradation in
likelihood can be solely by fixing the retrieval interface. Let Δ = L+ − L−
be the clean–corrupted log-likelihood gap for y*. Provided Δ is not degenerate, we define the
restoration fraction
$$
\mathrm{Attrib}_f
\;=\;
\frac{L^{\mathrm{int}}_f - L^-}{L^+ - L^-}.
$$
In practice we use a stabilized and clipped version,
$$
\mathrm{Attrib}_f^{(\varepsilon)}
\;=\;
\mathrm{clip}_{[0,1]}\!\left(
\frac{L^{\mathrm{int}}_f - L^-}{\max\{L^+ - L^-,\varepsilon\}}
\right),
$$
with ε > 0 chosen to avoid
division by near-zero gaps (e.g. when the corruption has negligible
effect on y*).
Clipping is not part of the idealized definition, but it prevents rare
numerical pathologies where the intervention overshoots the clean
likelihood due to nonlinear interactions.
The goal of CEU is not merely to attribute to the of retrieval, but to attribute specifically to the tokens (or read ports) within the retrieval interface. We therefore refine Attribf by restricting the intervention to a designated subset of retrieval inputs.
Assume we have a token index set S ⊆ {1, …, |D|} in the
retrieved context (e.g. S = E(D+)
for evidence, or S
corresponding to a known distractor span). For many retrieval
interfaces, the activation hf(q, D)
can be decomposed additively or by masking into contributions from
retrieved tokens; for instance, in cross-attention one may view the
per-head output as a weighted sum of value vectors from retrieved
positions. Abstractly, we assume an operator ΠS that extracts
the contribution attributable to retrieved indices S at the interface, with complement
ΠS̄, so
that
hf(q, D) ≈ ΠShf(q, D) + ΠS̄hf(q, D),
where the approximation is exact in common linear readouts
(e.g. attention output before the MLP) and can be implemented by
recomputing the interface with attention masked to S or by masking the retrieved-side
representations prior to the interface. We then define a interchange
intervention that replaces only the S-contribution:
ΠShf(q, D−) ← ΠShf(q, D+), ΠS̄hf(q, D−) unchanged.
Let Lf, Sint
be the resulting log-likelihood for y*. The span-restricted
causal evidence usage score is
$$
\mathrm{CEU}_{f,S}
\;=\;
\mathrm{clip}_{[0,1]}\!\left(
\frac{L^{\mathrm{int}}_{f,S} - L^-}{\max\{L^+ - L^-,\varepsilon\}}
\right).
$$
We instantiate as CEUf, E(D+)
when span labels are available, and analogously for a designated
distractor span set.
In architectures with explicit read ports (e.g. multiple memory reads,
passage-mixture heads, or per-layer retrieval adapters), hf naturally
decomposes into a tuple (hf, 1, …, hf, r).
For a subset of ports P ⊆ {1, …, r} we define an
intervention that replaces (hf, i)i ∈ P
from the clean run while leaving the remaining ports unchanged. Denoting
the resulting likelihood by Lf, Pint,
we define
$$
\mathrm{CEU}_{f,P}
\;=\;
\mathrm{clip}_{[0,1]}\!\left(
\frac{L^{\mathrm{int}}_{f,P} - L^-}{\max\{L^+ - L^-,\varepsilon\}}
\right).
$$
This form is convenient when ports have a direct semantic meaning
(e.g. one port per retrieved passage), allowing auditing of whether the
model is causally using the correct passage rather than merely any
retrieved text.
When y* is a
sequence (y1*, …, yT*)
evaluated under teacher forcing, we define
$$
\log p_\theta(y^*\mid q,D)\;=\;\sum_{t=1}^T \log p_\theta(y^*_t\mid
q,D,y^*_{<t}),
$$
and we apply the same restoration fraction to the summed
log-likelihoods. This choice aligns CEU with the training objective
(maximum likelihood) and avoids ambiguities that arise if one uses only
the first token or an end-to-end exact-match metric. If desired, one can
also compute token-wise CEUf, S(t)
by applying the intervention only at decoding step t (or by caching step-specific hf), but we
treat this as an optional finer-grained diagnostic due to increased
compute.
When evidence spans E(D+) are provided, CEU is directly defined as above. When span labels are unavailable, we require a proxy procedure E(D+) = Ê(q, D+) to nominate a small set of candidate indices for intervention, subject to the compute constraint that only k probes are feasible. Typical choices include: (i) heuristic sentence selection (e.g. the highest-overlap sentence with q), (ii) retriever-provided highlights or passage-level provenance, and (iii) model-internal saliency to tokens (e.g. top-ℓ retrieved tokens by cross-attention mass at the answer position), with the caveat that saliency is not itself causal and is used only to choose where to intervene.
If an example admits multiple disjoint evidence items e(1), …, e(m) (or multiple supporting sentences), we may define E(D+) as their union, or compute per-item scores CEUf, E(i) and aggregate by maxi (capturing whether correct evidence is causally used) or by an average (capturing whether the model spreads causal reliance across supports). The appropriate aggregation depends on whether the task requires a single sufficient fact or multi-hop composition.
By construction, CEUf, S measures a effect: holding the corrupted input fixed, it asks how much the likelihood of y* recovers when only the retrieval-interface contribution attributable to S is restored to its clean value. Evidence CEU is therefore high only when (i) the corruption materially harms the likelihood of y* and (ii) the harm is mediated through the retrieval interface contribution of the evidence span (or evidence ports), rather than through unrelated computation. Distractor CEU provides the complementary check that superficially similar but semantically irrelevant spans do not become dominant causal drivers of the output. These quantities will serve both as (a) per-example diagnostics and (b) training-time regularization targets in the objective we introduce next.
We now analyze a stylized model in which the preference for the gold answer decomposes into a query-only channel and a retrieval-mediated channel. The purpose of the abstraction is (i) to make precise when restoration-based interventions identify the and (ii) to connect a lower bound on that contribution to robustness against distractors and sensitivity to evidence swaps, while deriving an upper bound on unsupported answering under retriever noise.
Fix an answer candidate y
(in particular y = y*). Let
ℓθ(y ∣ q, D)
denote the pre-softmax logit for y under teacher forcing at the
relevant position(s), or the sum of per-token logits for the gold
sequence under a locally normalized factorization. We assume a
decomposition
where ℓpar is
independent of D (capturing
parametric priors and query-only cues), and ℓret depends on D only through the retrieval
interface activation hf(q, D).
We further assume that ℓret is (locally) affine
in hf,
i.e.,
which is exact if f is taken
to be a linear read vector into the final logit, and is a standard local
approximation when f is chosen
at a late module boundary (e.g. post-attention, pre-MLP). The role of is
to ensure that interchange interventions at f isolate a well-defined additive
contribution.
Let D+ denote retrieved context containing a specific evidence item e for (q, y*). A corruption operator 𝒞 produces D− = 𝒞(D+) by swapping or removing the of e while preserving superficial statistics (lexical overlap, formatting, passage length, entity type). Formally, we may posit a surface statistic map σ(⋅) such that σ(D−) = σ(D+) and e ∉ D−, while the substituted span is conditionally independent of y* given (q, σ(D+)). This is the regime in which we intend the clean–corrupted gap to reflect evidence identity rather than retrieval presence.
We relate the restoration fraction at f to the retrieval share of the
evidence-dependent logit gap. Let ℓ+ and ℓ− denote ℓθ(y* ∣ q, D+)
and ℓθ(y* ∣ q, D−),
and let ℓfint
denote the logit under the intervened run that replaces hf(q, D−)
by hf(q, D+).
By –,
ℓ+ − ℓ− = ⟨wy*, hf(q, D+) − hf(q, D−)⟩, ℓfint − ℓ− = ⟨wy*, hf(q, D+) − hf(q, D−)⟩,
so at the the intervention restores the entire evidence-dependent
difference whenever the only difference between D+ and D− that matters for y* flows through hf. The subtlety
is that our metric is defined in terms of (log-)likelihoods after
softmax normalization rather than raw logits. To address this, we assume
bounded logit ranges and local Lipschitzness of log pθ(y* ∣ ⋅)
as a function of the gold logit gap. Under these standard
calibration-style conditions, a first-order Taylor expansion implies
that the restoration fraction computed in log-likelihood space equals
the corresponding restoration fraction in logit space up to a bounded
error term that vanishes as the logit perturbation shrinks. Concretely,
if other logits remain within a bounded interval and the softmax map has
Lipschitz constant Lsm on that interval,
then
which yields the identifiability claim: restoration at f recovers the retrieval-mediated
share of the evidence-dependent effect, up to a calibration-dependent
discrepancy. This is the sense in which CEUf, S is an
estimator of causal mediation through the retrieval interface restricted
to S.
We formalize robustness as stability of the correct prediction under injection of distractor spans that match superficial cues. Consider a process that, given (q, D+), produces D′ by adding m distractor spans d1, …, dm sampled so that σ(di) = σ(e) and such that di ⟂ y* ∣ q (i.e., distractors carry no true information about the label beyond query-only features). In the two-channel model, any degradation induced by distractors must arise because hf(q, D′) moves in a direction that the model treats as predictive of some answer, despite the distractors being label-independent.
Suppose 𝔼[CEUf] < α over supported examples. By definition, this means that when we swap evidence identity (moving from D+ to D−), patching hf back restores less than an α fraction of the clean–corrupted likelihood gap on average, so a (1 − α) fraction of the gold preference is mediated by pathways the evidence-bearing retrieval contribution (e.g. query-only priors, or non-evidence features of retrieved text). Because distractors are constructed to match surface statistics, we may choose them adversarially to preserve those non-evidence features while altering hf in a way that counteracts the evidence direction. Under mild richness assumptions on the distractor family (that it spans a neighborhood in the space of attainable hf values while maintaining σ), we can construct D′ such that ℓret(y* ∣ hf(q, D′)) decreases by a constant fraction of the evidence advantage while ℓpar(y* ∣ q) remains unchanged. The resulting drop in the gold logit yields a constant-probability error event, giving an accuracy reduction of order Ω(1 − α) on supported queries. The same construction implies a lower bound: if evidence identity is swapped but CEU is small, then the model’s output distribution cannot change sufficiently in the correct direction, because too little of the decision is mediated by the retrieval pathway that the swap affects.
We now connect a CEU lower bound to the probability of producing an
answer unsupported by retrieved evidence when the retriever sometimes
misses. Let η be the miss
probability Pr [e ∉ D], and let δ parameterize additional test-time
distractor corruption. Condition on the event E that evidence is present in the retrieved
set. Assume (i) 𝔼[CEUf ∣ E] ≥ α,
and (ii) evidence swaps reduce the retrieval-mediated gold logit by at
least a margin γ > 0 in
expectation (a property enforced by swap-based contrastive training, but
taken here as an assumption). Under , the gold logit advantage
attributable to evidence is then at least αγ in expectation. If we
additionally assume a standard logit-to-error calibration bound
(e.g. that the probability of selecting an unsupported alternative
decays exponentially in the gold-vs-best-alternative logit gap), we
obtain
for a constant c > 0
determined by the calibration inequality and bounded-logit assumptions;
test-time distractors at rate δ can be absorbed into a reduction
of the effective margin, yielding the same form with γ replaced by γ(δ). Unconditioning and
using Pr [¬E] ≤ η yields
Pr [unsupported
output] ≤ η + (1 − η)exp (−c αγ),
which separates unavoidable hallucinations due to retrieval misses from
those due to insufficient causal usage of present evidence. This bound
is loose but mechanistically interpretable: increasing α tightens the supported-regime
error exponentially in the evidence margin, while improvements in the
retriever (smaller η) reduce
the irreducible term.
We now specify a training objective that encourages on retrieved evidence rather than on query-only priors or superficial retrieval artifacts. The key ingredient is a restoration-based causal mediation score (CEU) computed by a small number of interchange interventions at an explicit retrieval interface component~f. We combine the standard task loss with (i) a hinge-style constraint enforcing a minimum causal evidence usage on supported examples and (ii) a penalty discouraging causal influence from distractor spans, together with an optional contrastive corruption term that preserves sensitivity to evidence identity.
Each training instance consists of a triple (q, y*, D+)
where D+ contains
an evidence item e supporting
y*. A corruption
operator 𝒞 produces a matched corrupted
context D− = 𝒞(D+)
in which e is swapped/removed
while surface statistics are preserved. Let
L+ = log pθ(y* ∣ q, D+), L− = log pθ(y* ∣ q, D−)
denote the gold log-likelihoods under teacher forcing (or summed over
gold tokens). Let hf(q, D)
be the activation at the designated retrieval interface~f for input (q, D). We assume we can
(a) record hf(q, D+)
and (b) run an intervened forward pass on (q, D−) where we
replace the activation with the clean one. We denote the intervened
likelihood by
Lfint = log pθf ← f*(y* ∣ q, D−, D+),
where the superscript indicates that we forward on (q, D−) while
setting hf(q, D−) ← hf(q, D+).
For examples where the corruption changes the model confidence (so
that L+ − L−
is non-negligible), we define a restoration fraction
with the convention that we ignore the term or set it to 0 whenever L+ − L−
falls below a small threshold to avoid numerical instability. When
span/port structure is available, we refine~ into a score CEUf, S by
restricting the intervention to a designated subset S of evidence-bearing tokens/ports.
Concretely, for token-level cross-attention, we implement restriction by
masking the patch to apply only to the attention readout attributable to
tokens in~S (equivalently,
patching a per-span read vector); for memory-port architectures, we
patch only the evidence-designated ports. We write CEUf, ev for the score
restricted to evidence spans/ports and CEUf, dis for the
analogous score restricted to distractor spans/ports. In all cases we
optionally clip to [0, 1] to keep the
score interpretable as a fraction:
CEUf, ev = clip[0, 1](Attribf, ev), CEUf, dis = clip[0, 1](Attribf, dis).
Let α ∈ (0, 1] be a target
minimum evidence mediation level. Our base task loss is
ℓtask(q, y*, D+) = − log pθ(y* ∣ q, D+).
We add a mechanistic regularizer consisting of an evidence hinge and a
distractor penalty:
where [t]+ = max {0, t}
and λev, λdis ≥ 0
are weights. Intuitively, the first term enforces that a fixed fraction
of the clean–corrupted likelihood gap can be restored by repairing the
evidence-mediated pathway through~f; the second term discourages the
model from using distractor spans as if they were evidence.
Optionally, we include an explicit swap-sensitivity (contrastive
corruption) term to ensure that the identity of evidence matters and not
merely the presence of retrieval:
or, when a margin is desired,
ℓswaphinge = [γ − (L+ − L−)]+,
with margin γ > 0. The
overall objective per example is then
with β ≥ 0. We emphasize that
is a constraint: it depends on counterfactual forward passes that
isolate the causal channel through~f, rather than on correlations
between token presence and outputs.
We implement CER with a constant number of forward evaluations per example. For each minibatch, we (i) sample corruptions D− = 𝒞(D+), (ii) compute L+ and cache the clean activation(s) at~f, (iii) compute L−, and (iv) run up to k intervened passes to estimate CEUf, ev and CEUf, dis (and optionally additional probes such as random-span patches to sanity-check localization). A typical configuration uses k ∈ {1, 2, 3}, corresponding to evidence-only, distractor-only, and (optionally) a control patch.
Relative to standard training (one forward/backward on (q, D+)), CER requires one additional forward on (q, D−) and O(k) intervened forwards to estimate the CEU terms. Thus the forward-pass count per example is approximately (2 + k), with the backward pass taken only through the loss actually optimized (i.e., we backpropagate through the intervened runs as needed to shape the parameters that control hf and its downstream effects). In wall-clock terms, the overhead factor is close to (2 + k) for compute-dominated regimes, plus a small constant associated with activation capture and replacement at module boundary~f. The memory overhead is dominated by storing the clean activations at~f for each probed span/port; this is O(dim (hf)) per probe per example (or per microbatch when using activation checkpointing). When memory is constrained, we can recompute the clean activation on demand, trading memory for an extra forward pass; this does not change the asymptotic constant-factor accounting and keeps the intervention budget explicit.
We require benchmarks in which (i) evidence is explicitly localized so that evidence-restricted interventions are well-defined, (ii) corruptions can remove or swap the supporting fact while approximately preserving surface cues, and (iii) evaluation can separate from as measured by CEUf, ev and related sensitivity diagnostics. We therefore provide a two-part suite: a synthetic binding-with-retrieval task with guaranteed evidence localization, and a lightweight factual QA task augmented with controlled retrieval corruption.
The synthetic component is designed to be mechanistically
unambiguous: the gold answer is a deterministic function of a short
evidence span inserted into the retrieved context, and the query is
constructed to be uninformative without retrieval. Concretely, for each
instance we sample (a) a query template, (b) a key–value binding, and
(c) a retrieval bundle D+ consisting of m short passages. Exactly one
passage contains the evidence item e in a designated format,
e.g.,
$$
\texttt{ID: \#742 \;\; KEY: k\_9 \;\; VALUE: v\_3}
$$
and the query asks for the value associated with a key (or the key
associated with a value) under an ID constraint. We then define y* as the unique value
implied by the evidence line. This construction yields a latent evidence
item e that is also directly
labelable as a token span E(D+) by design
(e.g., the substring ).
To prevent shortcut learning from superficial artifacts, we randomize (i) passage order, (ii) distractor passage formats (matching punctuation and token lengths), and (iii) lexical overlap between distractors and the query. We additionally include that contain plausible competing bindings with mismatched IDs or keys. The resulting task admits clear failure modes: a model that answers from a query-only prior is near-chance, a model that keys on spurious lexical overlap between query and distractor passages will be brittle under swaps, and a model that truly reads the evidence should exhibit high CEUf, ev.
We implement a controlled corruption operator 𝒞 for ATR-insert in three modes, each producing a D− that preserves surface statistics:Because E(D+) is known exactly, we can compute CEUf, ev by restricting interventions to the evidence span, and CEUf, dis by restricting to distractor spans of identical format. This enables fine-grained sanity checks, including control patches to random non-evidence spans, for which we expect near-zero restoration.
The second component targets factual question answering with natural language evidence, while retaining controllability over corruption and evaluation. We begin with a set of short, single-hop questions (e.g., entity–attribute queries) for which we can retrieve one or more passages containing a single sentence that supports the answer. Each example is stored as (q, y*, D+) along with an evidence span E(D+) obtained either from dataset-provided annotations when available or from a deterministic matching procedure (e.g., locating an answer-bearing sentence containing the answer string and a query entity, then expanding to a minimal clause boundary). We restrict to instances where (i) the answer is extractive or can be normalized to a short canonical string, and (ii) the evidence span is short relative to the retrieved context, so that span-restricted interventions are meaningful.
We define corruption operators that preserve fluency and topicality:We emphasize that these corruptions are applied to the retrieved context rather than to the query, so that the task remains ``answer from retrieval.’’ In addition, we parameterize corruption strength (e.g., number of injected distractors, similarity constraints) so that robustness curves can be measured as a function of δ.
We report standard predictive quality metrics alongside mechanistic and causal-sensitivity metrics, with the goal of separating correct answers that are causally supported by retrieval from correct answers produced by priors or artifacts.
We compute exact-match accuracy (and token-level F1 when applicable)
under a fixed answer normalization. We also report negative
log-likelihood on D+,
NLL = − log pθ(y* ∣ q, D+),
to capture calibration-relevant changes not reflected in exact
match.
We compute expected calibration error (ECE) over binned confidence scores, where the confidence for a predicted answer ŷ is pθ(ŷ ∣ q, D) (or the product of token probabilities for multi-token answers). We also report the Brier score for the indicator of correctness when the answer space is small (e.g., ATR-insert classification-style variants). These metrics detect training regimes that increase accuracy by sharpening overconfident priors rather than by improving evidence use.
For each example we compute CEUf, ev and CEUf, dis via restoration fractions from span- or port-restricted interventions at component f, following the definitions in Section~. We summarize by reporting the distribution (mean, median, and lower quantiles) and by conditioning on correctness, since the relevant failure mode is high-confidence correctness with low CEUf, ev.
We report two swap-based sensitivity measures. First, the
ΔL = L+ − L− = log pθ(y* ∣ q, D+) − log pθ(y* ∣ q, D−),
which should be positive and large when the evidence is swapped or
removed. Second, we report the and gaps:
Δev = Levint − L−, Δdis = Ldisint − L−,
which diagnose whether restoring evidence through f repairs the model, and whether
restoring distractors incorrectly repairs it. For ATR-insert we
additionally evaluate robustness under increasing distractor rate δ, plotting accuracy and CEUf, ev as functions of
δ and context length n.
Finally, we define an operational notion of ``causally supported’’ answering: an answer is counted as supported if it is correct satisfies CEUf, ev ≥ τ for a chosen threshold τ (e.g., τ = 0.5). Reporting the supported-answer rate prevents models that achieve high accuracy with systematically low mediation from being scored favorably.
Together, these benchmarks and metrics provide both a controlled setting (ATR-insert) in which evidence mediation can be verified with minimal ambiguity and a natural-language setting (factual QA) in which controlled corruptions approximate real retriever failures and distractor-heavy contexts.
We design experiments to test the central empirical claims implied by our causal formulation: (i) a model can achieve high predictive performance while exhibiting low causal mediation of retrieved evidence, and (ii) the proposed CER objective increases causal evidence mediation at a designated retrieval interface f while improving robustness to evidence-preserving distractors and sensitivity to evidence swaps. All experiments are conducted on the benchmark suite in Section~ with corruption operators 𝒞 and evidence labels E(D+) when available.
We vary five factors: (a) retrieval interface choice and probe location f, (b) model family, (c) retriever quality (miss probability η and ranking noise), (d) distractor rate δ, and (e) total context length n. Our primary hypotheses are:We additionally test whether these effects depend on architectural inductive bias (attention vs. SSM) and on the degree to which retrieval is fused into the computation graph.
We instantiate multiple explicit retrieval interfaces and define a corresponding component f for interventions:For each interface, we compare (i) training without causal regularization, (ii) CER with evidence-only regularization, and (iii) CER with both evidence and distractor terms. We also vary the intervention granularity: span-restricted (tokens in E(D+)) versus port-restricted (a fixed subset of retrieval read channels). The goal is to verify that improvements in CEUf, ev persist across reasonable definitions of f and are not an artifact of a particular hook.
To evaluate whether causal evidence usage is architecture-dependent, we train parameter-matched models in three classes:We keep training compute approximately constant and report performance as a function of regularization weight λ and probe budget k. This isolates whether CER is effective when retrieval is consumed through different computational primitives.
We construct retriever-quality regimes to vary η and the prevalence of near-miss distractors:We evaluate whether models trained with CER degrade more gracefully as η increases, and whether the supported-answer rate tracks the bound structure suggested by Theorem~3. Empirically, we test for a monotone relation between retriever quality and the fraction of correct answers that remain causally supported by retrieved evidence.
We measure robustness under increasing distractor rate δ and context length n by expanding D with additional distractor passages or sentences while preserving evidence presence. For each δ and n, we report (i) accuracy on D+, (ii) causal metrics CEUf, ev and CEUf, dis, and (iii) swap-based sensitivity under 𝒞. The key diagnostic is whether CER-trained models maintain both accuracy and evidence mediation as irrelevant context grows. We also test a control in which D− and D+ have identical token counts, ensuring that any sensitivity is to evidence identity rather than to length or formatting artifacts.
For each configuration we train with fixed seeds, report mean and variance across runs, and sweep λ and the CEU target α in the hinge term. We track the Pareto frontier between clean accuracy and causal support, operationalized as the supported-answer rate at several thresholds τ. To confirm that CER does not merely induce brittleness, we include a corruption condition (evidence paraphrase) and require that ΔL remains small under paraphrase while remaining large under evidence swaps. Finally, we include a compute accounting ablation varying k ∈ {1, 2, 3} to quantify how much causal auditing and training benefit is obtained per additional intervened forward pass.
We isolate several limitations of our current formulation and outline extensions that appear technically natural within the structured intervention view.
Our metric CEUf
is defined relative to a designated component f and (when available) a designated
evidence span E(D+). This is
well-suited to single-span factual questions and synthetic binding
tasks, but it does not, by itself, resolve settings where correctness
requires disjoint evidence items, possibly spread across passages and
requiring compositional reasoning (multi-hop QA, comparison, temporal
aggregation).
A first extension is to generalize the evidence notion from a single
span to a of spans {ei}i = 1r
and define a group attribution score via grouped interventions:
$$
\mathrm{CEU}_{f,\mathrm{group}} := \frac{\log p_\theta^{f\leftarrow
f^*}(y^*\mid q, D^-, D^+) - \log p_\theta(y^*\mid q, D^-)}{\log
p_\theta(y^*\mid q, D^+) - \log p_\theta(y^*\mid q, D^-)}\,,
$$
where f* is
computed from a context in which all required evidence spans are
simultaneously restored.
However, this grouped score collapses the structure of how evidence
items interact. If we wish to distinguish evidence (either span
suffices) from evidence (both spans needed), we must estimate
interaction terms, e.g. by Shapley-style or inclusion–exclusion
decompositions over evidence subsets. This introduces an exponential
dependence on the number of evidence sources unless we assume restricted
forms (e.g. bounded-order interactions) or accept stochastic subset
sampling. Under the compute constraints we impose (bounded k), the practical compromise is to
report (i) marginal CEU per evidence span and (ii) a small number of
pairwise interaction probes for a fixed budget. This yields an auditing
signal but not a complete identifiability result for arbitrary
multi-evidence support.
Our two-channel decomposition explicitly allows a parametric term
ℓpar(q),
and we treat high CEUf, ev as a mechanistic
desideratum . This does not preclude the model from also knowing y* parametrically, and
indeed in many domains we expect substantial parametric knowledge. Two
issues follow.
First, CEUf is not
a correctness guarantee: it measures mediation through a designated
retrieval pathway, not semantic truth. Second, if y* is strongly supported
by priors, then even correct evidence may contribute only a small
incremental logit, making CEU small
while the model remains accurate and non-hallucinatory in a semantic
sense. We can respond in two ways. (i) We can calibrate evaluation to
emphasize evidence-conditional identification by selecting queries where
priors are weak or by constructing contrast sets where parametric
knowledge is uninformative (e.g. counterfactual entity names, synthetic
keys). (ii) We can modify the objective to encourage on retrieval when
retrieval is available, for example by adding a penalty on answering
confidently under evidence removal:
ℓno-ret = max (0, log pθ(y* ∣ q, ∅) − β),
or by introducing a
selective abstain'' output that is rewarded when evidence is missing. These changes alter the product specification: the desirable behavior is no longeranswer
whenever possible’’ but ``answer when supported.’’ In deployments where
parametric knowledge is acceptable, we should interpret CEU as an rather than a universal
requirement.
A practical limitation is that interchange interventions require
access to activations hf(q, D)
at a module boundary. Many efficient inference/training stacks fuse
attention, MLPs, and normalization into kernels where intermediate
values are not materialized, and retrieval augmentation may be
implemented via custom CUDA ops or external memory systems.
There are at least three engineering-level mitigations: (i) choose f at an accessible boundary
(e.g. residual stream before/after a block), (ii) implement a ``shadow’’
hook that recomputes the relevant subgraph in unfused form for auditing
only, or (iii) instrument the kernel to expose the read vector or
attention output for a small fraction of steps. Each option incurs a
different overhead–faithfulness trade-off. Conceptually, our formalism
does not require f to be
uniquely defined; it requires only that f capture the retrieval-to-answer
channel of interest. Nevertheless, the empirical value of CEUf depends on how well
the chosen boundary aligns with the actual causal bottleneck. A useful
extension is to treat f as a
of candidate sites and to report a profile {CEUfℓ}ℓ
across layers/ports, thereby reducing the risk that we certify the wrong
interface.
Our interventions operate at the level of internal activations and evidence spans, which suggests a natural connection to causal abstraction: we seek a high-level variable ``evidence content’’ whose influence on the answer is mediated through a designated mechanism. From this perspective, CEUf is a restricted mediation estimate under a particular abstraction map (tokens ↦ retrieval states ↦ answer logits). A limitation is that the abstraction is currently (via E(D) and f). An extension is to learn the abstraction map jointly, e.g. by learning a sparse selector over retrieved tokens/ports that maximizes restoration while minimizing selected mass, subject to compute constraints. However, the hardness discussion already implies that exact minimal selection is intractable in general, so we expect only approximate abstractions, and we should evaluate them by stability across corruptions and by agreement with human-labeled evidence where available.
Our experiments focus on single-step retrieval-augmented generation
with fixed D = R(q).
Tool-augmented agents introduce two additional causal pathways: (i)
actions influence future observations (queries issued to tools) and (ii)
intermediate chain-of-thought or scratchpad states may route information
in ways not localized to a single f. Extending our approach requires
deciding what counts as ``evidence’’ (tool outputs, web snippets,
database rows) and what the relevant interface is (the memory write/read
boundary, the tool-call result embedding, or the planner state).
A direct extension is to define CEU over : we treat the agent as
inducing a distribution over tool transcripts T, and we measure whether
swapping/removing a transcript segment restores or destroys answer
likelihood through the designated read component. This suggests an
auditing protocol for 2026-era systems: certify not only that an answer
depends on retrieved text, but that it depends on the rather than on
spurious correlations in the prompt or on parametric recall. The main
open problem is controlling confounding introduced by the agent’s policy
(tool choices depend on the model state). Addressing this likely
requires randomized tool-response corruptions and policy-aware
estimators, which we leave as an explicit direction.
We have argued that retrieval augmentation is not, by itself, a guarantee of evidence-conditioned behavior. A retrieval-augmented language model can achieve high task accuracy while routing decisive information through channels that are only weakly coupled to the retrieved evidence (e.g. parametric priors, prompt heuristics, or distractor-sensitive features). For this reason, we treat —in the narrow sense of counterfactual intervention at an explicit retrieval interface—as a primary design and evaluation primitive rather than a post hoc interpretability add-on.
Our central object is the causal evidence usage score CEUf, defined by
restoration under interchange interventions at a designated retrieval
read component f.
Operationally, CEUf
answers the question: The definition is intentionally local: it is tied
to a chosen interface f and
(when available) designated evidence spans or ports. This locality is a
feature. It yields a computable signal with constant-factor overhead
(Theorem~4), and it permits targeted claims of the form
the model uses retrieval \emph{through this mechanism}'' rather than global, ill-posed statements aboutusing
evidence.’’
We then introduced CER, a training objective that treats mechanistic support as a constraint. The regularizer simultaneously (i) enforces a minimum evidence-mediated restoration CEUf, ev ≥ α on supported examples and (ii) penalizes restoration attributable to distractor spans. The resulting training signal is not merely correlational: it is defined in terms of counterfactual intervention, and thus it aligns directly with the causal question that downstream users care about (whether the output is supported by retrieved evidence). Under the additive logit model, the restoration fraction corresponds to an identifiable share of the evidence-dependent logit difference (Theorem~1). Although this stylized model is not a literal description of modern transformers, it provides a tractable hypothesis class in which the metric is interpretable and the regularizer has a clear semantic target.
The accompanying bounds motivate why this style of audit should be considered a primitive for robust RAG. Theorem~2 formalizes a lower-bound phenomenon: if a system exhibits low causal mediation of evidence through the retrieval interface, then distractor injection processes exist that degrade accuracy by a constant amount, even when evidence is present and superficially indistinguishable from distractors. Conversely, Theorem~3 gives an upper bound on unsupported answering under retriever noise: when the retriever miss probability is η and the model satisfies a nontrivial mediation constraint together with a swap-sensitivity margin, the hallucination probability (with respect to the retrieved context) is controlled by a function that separates error from retriever misses (η) and error from weak evidence dependence (captured by α and the margin γ). The point is not that these constants are sharp in practice, but that the qualitative dependency is the correct one: without an explicit mechanism-level constraint, there is no reason to expect monotone improvements in evidence-conditioned behavior as we scale models or retrievers.
From an engineering perspective, the conclusion is that we should treat CEU-style audits analogously to unit tests for a software interface. In a 2026 deployment, one does not merely evaluate end-task accuracy on a benchmark. One also specifies (a) a corruption family 𝒞 that preserves superficial cues while altering the underlying evidence identity, (b) an interface site f that represents the intended retrieval-to-answer bottleneck, and (c) acceptance thresholds on evidence restoration and distractor suppression. Concretely, for a chosen supported-query slice, we can maintain dashboards of (i) the distribution of CEUf, ev, (ii) the distribution of CEUf, dist, and (iii) swap-sensitivity margins log pθ(y* ∣ q, D+) − log pθ(y* ∣ q, D−). These are not substitutes for accuracy; rather, they are orthogonal observables that diagnose failure modes that accuracy alone cannot detect.
We also emphasize a product-level implication. If the intended
specification is
answer whenever possible,'' then strong parametric knowledge is acceptable and low $\mathrm{CEU}$ may be tolerable. If the intended specification isanswer
only when supported by retrieved evidence,’’ then mechanistic mediation
becomes a requirement, and it is natural to couple CER-style training
with abstention or confidence penalties under evidence removal. The
virtue of CEU is that it makes these
specification choices explicit and measurable at the mechanism boundary,
rather than leaving them implicit in loosely phrased desiderata about
``groundedness.’’
Finally, we view the main conceptual contribution as a shift from evaluation to evaluation. Retrieval augmentation introduces a distinguished causal pathway, and we can interrogate that pathway with controlled counterfactuals. The resulting metrics and objectives are compute-feasible (constant additional passes), compatible with black-box probability queries plus white-box hooks at f, and extensible to richer settings by expanding the intervention family. We do not claim that a single scalar CEUf certifies truthfulness or completeness; rather, we claim that mechanistic audits provide a stable intermediate target between unverifiable semantic criteria and purely behavioral benchmarks.
In summary, we advocate the following stance for RAG systems: (i) define the retrieval interface as an explicit causal bottleneck, (ii) enforce evidence mediation and distractor suppression during training when the product specification requires support, and (iii) report intervention-based audit statistics as first-class evaluation outputs. This stance converts ``grounding’’ from an aspirational property into a testable, optimizable constraint, thereby making evidence-conditioned behavior an object of routine model design rather than an after-the-fact hope.