Causal Evidence Usage in Retrieval-Augmented LMs via Interchange Interventions

Metadata

Total Words: 10,396
Export Date: 2026-01-18 00:24:55
Description: Retrieval-augmented generation (RAG) systems are ubiquitous in 2026-era language model deployments, yet they often fail in a particular way: the model is exposed to relevant retrieved text but the output is not causally supported by that evidence. Inspired by mechanistic evaluation of retrieval in synthetic tasks (e.g., associative recall) using interchange interventions, we develop a causal audit framework for RAG. We define Causal Evidence Usage (CEU), a restoration-based attribution score that quantifies how much of the answer likelihood is mediated through a designated retrieval read component and the evidence span. We then propose a training-time mechanistic regularizer that enforces high CEU for true evidence and low CEU for distractors, using counterfactual retrieval corruptions (evidence swaps and distractor injections). In a formal two-channel additive logit model, we prove (i) a lower bound: robustness to distractors and sensitivity to evidence swaps is impossible unless the retrieval channel carries nontrivial causal mass; and (ii) an upper bound: the regularizer guarantees a minimum CEU and yields explicit bounds on hallucination under retriever error rate. Finally, we outline experiments on a benchmark suite combining synthetic binding tasks with retrieval inserts and controlled factual QA, demonstrating improved robustness to retriever shift, distractors, and length extrapolation.

1. Introduction: RAG failures as mechanistic routing failures; motivation for causal audits beyond behavioral accuracy; connection to interchange interventions and induction/direct-retrieval mechanisms.
2. Background and Related Work: interchange interventions and restoration attribution; synthetic retrieval tasks (AR/ATR); evidence faithfulness in RAG; memory tokens/long-context compression interfaces.
3. Problem Setup: formalize retrieval-augmented prediction, corruption operators (evidence swap, distractor injection), and evaluation under retriever shift; define explicit retrieval interfaces f (cross-attn, memory read, compression module).
4. Causal Evidence Usage (CEU): define intervention protocols and CEU metrics for evidence spans and retrieval ports; discuss practical considerations (span selection, multiple evidence items, multi-token answers).
5. Theory: Two-Channel Additive Logit Model: formal assumptions; identifiability of retrieval contribution via interventions; lower bounds tying robustness/sensitivity to minimum CEU; upper bounds from CEU constraints to hallucination probability under retriever noise.
6. Mechanistic Regularization Objective: define CER (Causal Evidence Regularization) combining task loss with CEU constraints and contrastive corruption losses; provide algorithm and compute overhead analysis.
7. Benchmark Suite: (i) Synthetic binding-with-retrieval (ATR-insert) with guaranteed evidence localization; (ii) lightweight factual QA with controlled retrieval corruption; evaluation metrics (accuracy, calibration, CEU, sensitivity).
8. Experimental Plan (flagged as strengthening evidence): ablations over retrieval interfaces, model families (attention/SSM/hybrids), retriever quality, distractor rate, and context length; demonstrate causally supported answering and robustness.
9. Limitations and Extensions: multiple evidence sources; parametric knowledge; interventions inside fused kernels; connections to causal abstraction; broader implications for tool-augmented agents.
10. Conclusion: mechanistic audits as a design/evaluation primitive for 2026 RAG systems.

Content

1. Introduction: RAG failures as mechanistic routing failures; motivation for causal audits beyond behavioral accuracy; connection to interchange interventions and induction/direct-retrieval mechanisms.

Retrieval-augmented generation (RAG) is commonly presented as a behavioral fix for factuality: we attach a retriever to a language model, provide relevant passages at inference time, and expect the model to condition on those passages when producing an answer. Empirically, however, many prominent failures of RAG are not well described as mere insufficiency of retrieved information. Rather, they present as : the system has access to a channel that contains the needed evidence, yet the final prediction is routed through a different computational pathway that is only weakly constrained by that evidence. In such cases the output may be fluent and even locally consistent with the retrieved text, while still being unsupported by it, or it may contradict retrieved evidence while remaining stable under retrieval perturbations. These phenomena suggest that improving RAG requires not only better retrieval and better decoding, but also explicit guarantees that the model retrieved evidence in a causal sense.

Two observations motivate this work. First, standard accuracy metrics on question answering with retrieval conflate at least three distinct regimes: (i) supported answering where the model reads and applies the retrieved evidence; (ii) unsupported answering where the model relies on parametric memory or priors (and may coincidentally be correct); and (iii) spurious answering where the model is driven by superficial cues in the retrieved passages (formatting, lexical overlap, distractor entities) rather than the intended evidence. Purely behavioral evaluations (e.g. exact match on held-out questions) cannot disentangle these regimes, because the label does not specify the causal pathway. A model that answers correctly by parametric recall may score identically to a model that reads the passage. Conversely, a model that is correct on average may remain brittle under small changes to retrieval content, precisely because its computation is not anchored to the evidence-bearing tokens.

Second, several practical mitigation strategies in RAG implicitly assume that the model routes computation through retrieval: increasing context length, adding citation prompts, ranking passages more carefully, or training on instruction-following. These methods may improve outcomes, yet they do not certify the intended causal dependence. Indeed, we often observe cases where providing retrieval context increases hallucination: additional distractors introduce high-salience but irrelevant spans, and the model becomes less sensitive to the true evidence. This is consistent with the view that the model implements a latent mixture of mechanisms, only one of which corresponds to direct evidence use, and that the mixture weights can shift as the surface statistics of the retrieved context change.

Our central claim is that RAG should be audited and trained at the level of internal causal pathways, not merely at the level of input–output behavior. Concretely, we regard the retrieval interface—cross-attention over retrieved tokens, a memory read vector, or an explicit compression module—as a designated component through which evidence must flow in supported examples. If the prediction is truly evidence-based, then intervening on this interface should have a controlled and predictable effect: removing or corrupting evidence should reduce the model’s likelihood of the correct answer, and restoring the interface state corresponding to the clean evidence should restore that likelihood. Conversely, swapping in interface states corresponding to distractors should not restore the correct answer likelihood. These desiderata are inherently ; they refer to what the model would have done under controlled changes to evidence while holding other factors fixed.

To operationalize this idea we draw on interchange interventions, which have emerged as a useful tool for mechanistic evaluation in neural models. The interchange principle is simple: we identify an internal activation (or set of activations) that is hypothesized to carry a certain information channel, and we replace it with the activation obtained under a different input that isolates that channel. In the retrieval setting, the natural interchange target is the retrieval read component at the answer position(s). Given a clean retrieved context and a corrupted one, we can run the model on each, record the retrieval-interface activations, and then run an intervened forward pass that uses the corrupted input while substituting in the clean retrieval activation. If the model’s probability of the correct answer is substantially restored by this substitution, then the retrieval interface is causally mediating the prediction; if not, then the prediction is being computed largely elsewhere (e.g. by parametric memory, or by features of the corrupted context that bypass the intended evidence).

This causal perspective also clarifies the relationship between RAG and two mechanisms that have been widely discussed in transformer interpretability: and . In the direct retrieval regime, the model computes an answer by attending to specific tokens in the provided context and copying or transforming them. In the induction regime, the model uses patterns in context (e.g. repeated key–value mappings) to infer a rule and apply it, potentially without relying on a particular evidence span. Both mechanisms can produce correct outputs, but they differ in their sensitivity to evidence swaps and in their failure modes under distractors. A model that has learned a strong parametric association between a query pattern and an answer may behave like an ``induction-only’’ system even when retrieval is present, treating the retrieved text as optional decoration. Such a model is especially dangerous in RAG deployments because it can appear grounded (the retrieved context contains relevant content) while its computation ignores that content.

We therefore frame the RAG problem as one of enforcing and measuring a desired routing constraint: on supported queries, a nontrivial fraction of the answer likelihood should be mediated by the retrieval interface and, more specifically, by the evidence-bearing tokens. This leads to a metric that we can audit per example and aggregate over distributions: a score derived from restoration under interchange interventions. The metric is not intended to replace accuracy; rather, it refines accuracy into a mechanistically meaningful statement about why the model is correct when it is correct, and how it fails when it fails. It also provides an immediate diagnostic for common pathologies: (i) high accuracy but low causal usage (parametric memorization masquerading as grounding), (ii) high sensitivity to distractors (spurious routing through non-evidence features), and (iii) insensitivity to evidence identity (retrieval presence is used as a generic signal, but not the specific evidence).

Finally, this viewpoint suggests a training principle: if we can compute a causal usage score with a small number of additional forward passes, then we can regularize the model to increase evidence-mediated routing and suppress distractor-mediated routing. This is conceptually analogous to contrastive learning, but the contrast is imposed at the level of internal causal channels rather than at the level of representations or outputs alone. The remainder of this paper makes this precise: we define the restoration-based causal metric at a designated retrieval component, we introduce a corruption process that preserves surface cues while swapping evidence, and we propose a training objective that encourages the model to depend on evidence in the required mechanistic sense. The resulting framework is intended to be both auditable and actionable: it produces per-example causal diagnostics and, under explicit assumptions, yields distributional bounds on unsupported answering under retriever shift.

Our use of interchange interventions places this work in the broader program of causal analysis of neural networks, where one evaluates candidate internal variables by on their values and measuring downstream effects. The basic operation is to run two forward passes on distinct inputs, record an activation at a designated site, and then perform a third forward pass in which the activation is (replaced) from one run into the other. This family of methods appears under several names—activation patching, causal tracing, representation swapping, and interchange interventions—and is closely related to the causal abstraction viewpoint in which internal states are treated as endogenous variables in a structural causal model over the computation graph. A key methodological advantage is that these interventions can be implemented with lightweight hooks at module boundaries (e.g. attention outputs, MLP pre-activations, or designated read vectors) and do not require training a separate explainer model.

Among patching-based methods, scores are particularly natural in retrieval settings. Fix an input pair (q, D⁺) and (q, D⁻) that differ in the evidence-bearing content, and fix a component f that we hypothesize carries the retrieval-mediated signal into the answer computation. We may compare (i) the model’s probability of the gold answer on the clean context, (ii) the probability on the corrupted context, and (iii) the probability on the corrupted context under an intervention that substitutes the clean activation at f. The resulting restoration fraction can be interpreted as a measure of how much of the model’s preference for the correct answer is mediated by the intervened component, relative to the preference change induced by the corruption. This idea is conceptually aligned with mediation analysis: the corruption changes both the value of the proposed mediator and possibly other features, and the interchange intervention isolates the mediated pathway by holding the rest of the computation fixed. In practice, restoration-based scores have been used to localize information flow for factual recall and to identify attention heads and layers implicated in particular behaviors; we adopt the same operational stance but tailor the intervention site to the explicit retrieval interface of a RAG system.

A recurring difficulty in auditing retrieval usage on natural QA is that real corpora entangle evidence with confounds such as lexical overlap, answer frequency, and topical priors. For this reason, synthetic tasks have been used to study whether transformers can implement content-addressable retrieval and whether attention mechanisms act as retrieval primitives. In particular, associative-style recall tasks instantiate a set of key–value bindings placed in context and query the value for a key; attentive retrieval variants control where and how the relevant binding appears, often inserting distractor bindings whose surface statistics closely match the true one. These settings permit clean causal tests: one can swap the evidence binding while keeping formatting, entity types, and positional cues fixed, thereby distinguishing models that truly route through the intended binding from models that rely on shortcuts (e.g. positional heuristics, spurious token correlations, or query-only priors). Such tasks have also been used to study ``induction’’ behavior, where models generalize a mapping pattern rather than copying a specific span. For our purposes, the main lesson is methodological: carefully controlled evidence swaps and distractor injections are necessary to prevent trivial detection of corruption and to ensure that any measured restoration reflects evidence identity, not merely retrieval presence.

A large literature addresses grounding and faithfulness in retrieval-augmented generation, typically at the level of input–output behavior. Common approaches include: prompting the model to cite sources, training with supervised rationales, reranking passages to improve evidence quality, and post-hoc verification via entailment models or secondary checkers. Evaluation protocols often measure whether generated answers are supported by retrieved text (attribution accuracy), whether citations point to relevant spans, or whether the answer is stable under minor retrieval perturbations. While these methods are valuable, they do not by themselves establish that the model’s uses evidence in a mediated sense: a model may generate plausible citations without conditioning strongly on them, or it may answer correctly from parametric memory even when evidence is present. Conversely, behavioral sensitivity tests can be ambiguous when the perturbation changes many correlates at once. Our position is that mechanistic auditing complements behavioral faithfulness: by intervening at the retrieval read path we can quantify whether the retrieved tokens have a causal route into the answer, and we can separate evidence-mediated effects from spurious correlations carried by other parts of the network.

Retrieval augmentation is often implemented by concatenating passages to the query and relying on standard self-attention. However, many modern systems introduce explicit that mediate information flow from retrieved text into generation: cross-attention blocks over external memory, late-fusion architectures that aggregate passage-wise representations, and retrieval-to-generation adapters. Architectures in the RETRO family and related nearest-neighbor augmentation schemes similarly impose a distinct channel through which retrieved neighbors influence decoding. These design choices are relevant because they expose natural intervention sites: rather than patching arbitrary internal activations, we may patch the read vector(s) of a designated retrieval module, which is precisely the component intended to carry retrieved evidence.

Long-context transformers and memory-augmented models further complicate the notion of retrieval,'' since information may enter the answer computation via learned summaries, compressed representations, or persistent memory tokens rather than direct attention to raw retrieved spans. Recurrent memory mechanisms, segment-level recurrence, and learned compressive modules create an explicit bottleneck through which past context is distilled; similarly, architectures that allocate special memory tokens (or latent arrays) act as a structured interface between high-volume context and the decoding stream. From a causal standpoint, these interfaces are attractive because they define a small set of ports whose activations summarize the evidence pathway. At the same time, they introduce a failure mode: compression may preserve superficial cues while discarding the discriminative fact needed for a particular query, yielding a system that appears tohave seen’’ the evidence but cannot causally use it at answer time. This motivates auditing not only raw-token attention patterns but also the internal memory readout and compression outputs, treating them as candidate f sites for interchange interventions.

The common thread across these lines of work is the need to distinguish of information from . Interchange interventions and restoration attribution provide an operational tool for this distinction; synthetic AR/ATR-style tasks provide controlled environments where evidence swaps are well-defined; and the diversity of retrieval and memory interfaces in modern architectures motivates defining evidence usage at explicit read ports rather than at the level of output behavior alone. We synthesize these ideas by proposing a retrieval-specific restoration metric and a training-time regularizer that acts directly on the designated retrieval channel.

3. Problem Setup: formalize retrieval-augmented prediction, corruption operators (evidence swap, distractor injection), and evaluation under retriever shift; define explicit retrieval interfaces f (cross-attn, memory read, compression module).

We formalize retrieval-augmented prediction as inference under a partially observed evidence variable, with explicit control over both the retriever and the retrieved context. Let 𝒟 be a distribution over triples (q, y^*, e), where q is a query, y^* is the gold answer, and e is a latent evidence item (typically a short span or atomic fact) that suffices to justify y^* given q. We assume that e is well-defined up to an equivalence class of paraphrases; in synthetic settings e may be an exact span, whereas in natural QA e may correspond to a minimal supporting sentence. When available, we represent an evidence label inside a retrieved context by a token index set E(D) ⊆ {1, …, |D|} indicating the location(s) of e within D.

A retriever R maps a query to a set of passages D = R(q), which we treat either as a multiset of documents or as a single concatenated token sequence. We consider a stochastic retriever (via sampling, approximate search, or nondeterminism in the corpus snapshot) and assume a miss probability bound
Pr_{(q, y^*, e) ∼ 𝒟}[e ∉ R(q)] ≤ η,
where η is an explicit parameter of the environment. For analysis and training we distinguish two regimes: a regime in which the retrieved context contains the relevant evidence, and an regime in which it does not. Concretely, for each example we write D⁺ for a retrieved context containing e, and we allow that D⁺ may also contain distractor spans that are irrelevant to y^* but similar in surface form. At evaluation time we will consider , in which the test-time miss rate and distractor profile differ from those encountered during training.

Given (q, D), the model outputs a conditional distribution p_θ(y ∣ q, D) over answers. We allow answers to be sequences (e.g. token strings), but we keep notation at the level of a single output variable y; the log-likelihood of a multi-token answer is interpreted as the sum of per-token log-probabilities under teacher forcing. The total input length (query plus retrieved context and any special delimiters) is denoted by n. We emphasize that the model is not assumed to be purely extractive: it may synthesize y from D and its parametric memory. Our objective, made precise later, is to enforce that when evidence is present the model’s preference for y^* is causally mediated by the retrieved evidence through an explicit interface, rather than being primarily driven by query-only priors or by confounded correlates inside D.

We assume the architecture contains a designated retrieval interface component f through which retrieved tokens (or their summaries) influence the answer computation. We use h_f(q, D) for the activation at that component on input (q, D), and we require that h_f be accessible for recording and replacement at module boundaries. The definition of f depends on the RAG design:
(i) in a two-stream encoder–decoder with cross-attention over retrieved tokens, f may be the cross-attention output vector at the answer position (or the collection of per-head outputs) in a specified layer;
(ii) in late-fusion or passage-wise aggregation, f may be the pooled passage representation or the mixture weights over passages;
(iii) in memory-augmented designs with read ports, f may be the memory read vector(s) or key–value retrieval output at each decoding step;
(iv) in long-context compression settings, f may be the output of a compression module or the states of learned memory tokens that summarize the retrieved text.
In all cases we conceptualize f as the intended bottleneck for retrieval-mediated information, so that interventions at f isolate the retrieval pathway more directly than arbitrary internal patching.

To disentangle causal usage of evidence identity from superficial features of retrieval, we introduce a family of corruption operators 𝒞 that map a context D to a corrupted context D⁻ = 𝒞(D). We parameterize 𝒞 by a corruption strength δ controlling how aggressively evidence is altered or how many distractors are injected, and we require that 𝒞 preserve surface statistics to the extent possible (format, length, entity types, positional cues), thereby preventing trivial detection.

We consider two canonical corruption types.
replaces the evidence item e (or its span) with an alternative evidence-like span e^′ drawn so that the resulting context remains plausible and stylistically matched. In synthetic key–value tasks, this corresponds to swapping the value paired with a queried key while keeping the key and formatting fixed. In factual QA, this may be approximated by swapping a sentence containing the answer with a sentence from a different entity of the same type. The resulting D⁻ is designed so that p_θ(y^* ∣ q, D⁻) should decrease relative to p_θ(y^* ∣ q, D⁺) for models that genuinely use e.
adds (or replaces with) spans that mimic the evidence span’s lexical and structural properties but are independent of y^* given q. This corruption targets models that over-rely on shallow cues, such as topical overlap or answer frequency, by ensuring that these cues can be matched without providing the correct fact.

In addition, we allow a that removes evidence-bearing passages entirely, modeling the event e ∉ D and capturing the unavoidable failure mode where the retrieved context is genuinely unsupported.

We distinguish the training distribution, in which D⁺ is obtained by running (or simulating) the retriever and filtering for support, from the test distribution, in which retrieval quality may degrade. Formally, at test time we may face a shift in (η, δ): the miss rate η may increase (e.g. due to corpus drift or domain shift) and the distractor rate δ may increase (e.g. due to larger candidate sets or noisier retrieval). Our downstream evaluation therefore considers not only task accuracy on (q, D), but also the model’s behavior conditioned on whether D contains evidence. In particular, we will later bound the probability of producing an answer unsupported by the retrieved context under such shift, in terms of η and a mechanistic evidence-usage condition enforced during training.

Finally, we assume access to paired contexts (q, D⁺) and (q, D⁻) during training (constructed via 𝒞), and access to activations h_f(q, D) for a small number of probed components. This is the only white-box requirement: we neither assume span-level supervision in general nor require a separate verifier. The next section uses this setup to define an intervention protocol that compares behavior across (q, D⁺), (q, D⁻), and activation-patched forward passes, yielding a causal evidence usage metric specialized to retrieval interfaces and evidence spans.

4. Causal Evidence Usage (CEU): define intervention protocols and CEU metrics for evidence spans and retrieval ports; discuss practical considerations (span selection, multiple evidence items, multi-token answers).

We now define an intervention protocol and an associated metric intended to quantify whether the model’s preference for the gold answer y^* is by retrieved evidence, as opposed to being primarily driven by query-only priors or by non-evidence features of the retrieved context. The central object is an at a designated retrieval interface component f, implemented by recording the activation h_f(q, D) on one input and substituting it into a forward pass on another input.

Fix an example (q, y^*, D⁺) in the supported regime, and let D⁻ = 𝒞(D⁺) be a corruption that removes or swaps the evidence identity while preserving superficial cues. We define three log-likelihoods (single-token or teacher-forced sequence likelihoods as described in the setup):
L⁺ = log p_θ(y^* ∣ q, D⁺), L⁻ = log p_θ(y^* ∣ q, D⁻),
and an log-likelihood
L_f^int = log p_θ^{f ← f^*}(y^* ∣ q, D⁻, D⁺),
where p_θ^{f ← f^*} denotes the distribution induced by running a forward pass on (q, D⁻) but replacing the activation at f by its clean counterpart:
h_f(q, D⁻) ← h_f(q, D⁺).
Operationally, this corresponds to (i) running (q, D⁺) once to cache h_f(q, D⁺), (ii) running (q, D⁻) once to obtain baseline behavior and, if needed, tensor shapes, and (iii) running (q, D⁻) again while patching the cached activation into the module boundary for f.

We quantify how much of the corruption-induced degradation in likelihood can be solely by fixing the retrieval interface. Let Δ = L⁺ − L⁻ be the clean–corrupted log-likelihood gap for y^*. Provided Δ is not degenerate, we define the restoration fraction
$$ \mathrm{Attrib}_f \;=\; \frac{L^{\mathrm{int}}_f - L^-}{L^+ - L^-}. $$
In practice we use a stabilized and clipped version,
$$ \mathrm{Attrib}_f^{(\varepsilon)} \;=\; \mathrm{clip}_{[0,1]}\!\left( \frac{L^{\mathrm{int}}_f - L^-}{\max\{L^+ - L^-,\varepsilon\}} \right), $$
with ε > 0 chosen to avoid division by near-zero gaps (e.g. when the corruption has negligible effect on y^*). Clipping is not part of the idealized definition, but it prevents rare numerical pathologies where the intervention overshoots the clean likelihood due to nonlinear interactions.

The goal of CEU is not merely to attribute to the of retrieval, but to attribute specifically to the tokens (or read ports) within the retrieval interface. We therefore refine Attrib_f by restricting the intervention to a designated subset of retrieval inputs.

Assume we have a token index set S ⊆ {1, …, |D|} in the retrieved context (e.g. S = E(D⁺) for evidence, or S corresponding to a known distractor span). For many retrieval interfaces, the activation h_f(q, D) can be decomposed additively or by masking into contributions from retrieved tokens; for instance, in cross-attention one may view the per-head output as a weighted sum of value vectors from retrieved positions. Abstractly, we assume an operator Π_S that extracts the contribution attributable to retrieved indices S at the interface, with complement Π_S̄, so that
h_f(q, D) ≈ Π_Sh_f(q, D) + Π_S̄h_f(q, D),
where the approximation is exact in common linear readouts (e.g. attention output before the MLP) and can be implemented by recomputing the interface with attention masked to S or by masking the retrieved-side representations prior to the interface. We then define a interchange intervention that replaces only the S-contribution:
Π_Sh_f(q, D⁻) ← Π_Sh_f(q, D⁺), Π_S̄h_f(q, D⁻) unchanged.
Let L_f, S^int be the resulting log-likelihood for y^*. The span-restricted causal evidence usage score is
$$ \mathrm{CEU}_{f,S} \;=\; \mathrm{clip}_{[0,1]}\!\left( \frac{L^{\mathrm{int}}_{f,S} - L^-}{\max\{L^+ - L^-,\varepsilon\}} \right). $$
We instantiate as CEU_f, E(D⁺) when span labels are available, and analogously for a designated distractor span set.

In architectures with explicit read ports (e.g. multiple memory reads, passage-mixture heads, or per-layer retrieval adapters), h_f naturally decomposes into a tuple (h_f, 1, …, h_f, r). For a subset of ports P ⊆ {1, …, r} we define an intervention that replaces (h_f, i)_i ∈ P from the clean run while leaving the remaining ports unchanged. Denoting the resulting likelihood by L_f, P^int, we define
$$ \mathrm{CEU}_{f,P} \;=\; \mathrm{clip}_{[0,1]}\!\left( \frac{L^{\mathrm{int}}_{f,P} - L^-}{\max\{L^+ - L^-,\varepsilon\}} \right). $$
This form is convenient when ports have a direct semantic meaning (e.g. one port per retrieved passage), allowing auditing of whether the model is causally using the correct passage rather than merely any retrieved text.

When y^* is a sequence (y₁^*, …, y_T^*) evaluated under teacher forcing, we define
$$ \log p_\theta(y^*\mid q,D)\;=\;\sum_{t=1}^T \log p_\theta(y^*_t\mid q,D,y^*_{<t}), $$
and we apply the same restoration fraction to the summed log-likelihoods. This choice aligns CEU with the training objective (maximum likelihood) and avoids ambiguities that arise if one uses only the first token or an end-to-end exact-match metric. If desired, one can also compute token-wise CEU_f, S(t) by applying the intervention only at decoding step t (or by caching step-specific h_f), but we treat this as an optional finer-grained diagnostic due to increased compute.

When evidence spans E(D⁺) are provided, CEU is directly defined as above. When span labels are unavailable, we require a proxy procedure E(D⁺) = Ê(q, D⁺) to nominate a small set of candidate indices for intervention, subject to the compute constraint that only k probes are feasible. Typical choices include: (i) heuristic sentence selection (e.g. the highest-overlap sentence with q), (ii) retriever-provided highlights or passage-level provenance, and (iii) model-internal saliency to tokens (e.g. top-ℓ retrieved tokens by cross-attention mass at the answer position), with the caveat that saliency is not itself causal and is used only to choose where to intervene.

If an example admits multiple disjoint evidence items e⁽¹⁾, …, e^(m) (or multiple supporting sentences), we may define E(D⁺) as their union, or compute per-item scores CEU_{f, E⁽ⁱ⁾} and aggregate by max_i (capturing whether correct evidence is causally used) or by an average (capturing whether the model spreads causal reliance across supports). The appropriate aggregation depends on whether the task requires a single sufficient fact or multi-hop composition.

By construction, CEU_f, S measures a effect: holding the corrupted input fixed, it asks how much the likelihood of y^* recovers when only the retrieval-interface contribution attributable to S is restored to its clean value. Evidence CEU is therefore high only when (i) the corruption materially harms the likelihood of y^* and (ii) the harm is mediated through the retrieval interface contribution of the evidence span (or evidence ports), rather than through unrelated computation. Distractor CEU provides the complementary check that superficially similar but semantically irrelevant spans do not become dominant causal drivers of the output. These quantities will serve both as (a) per-example diagnostics and (b) training-time regularization targets in the objective we introduce next.

5. Theory: Two-Channel Additive Logit Model: formal assumptions; identifiability of retrieval contribution via interventions; lower bounds tying robustness/sensitivity to minimum CEU; upper bounds from CEU constraints to hallucination probability under retriever noise.

We now analyze a stylized model in which the preference for the gold answer decomposes into a query-only channel and a retrieval-mediated channel. The purpose of the abstraction is (i) to make precise when restoration-based interventions identify the and (ii) to connect a lower bound on that contribution to robustness against distractors and sensitivity to evidence swaps, while deriving an upper bound on unsupported answering under retriever noise.

Fix an answer candidate y (in particular y = y^*). Let ℓ_θ(y ∣ q, D) denote the pre-softmax logit for y under teacher forcing at the relevant position(s), or the sum of per-token logits for the gold sequence under a locally normalized factorization. We assume a decomposition

where ℓ_par is independent of D (capturing parametric priors and query-only cues), and ℓ_ret depends on D only through the retrieval interface activation h_f(q, D). We further assume that ℓ_ret is (locally) affine in h_f, i.e.,

which is exact if f is taken to be a linear read vector into the final logit, and is a standard local approximation when f is chosen at a late module boundary (e.g. post-attention, pre-MLP). The role of is to ensure that interchange interventions at f isolate a well-defined additive contribution.

Let D⁺ denote retrieved context containing a specific evidence item e for (q, y^*). A corruption operator 𝒞 produces D⁻ = 𝒞(D⁺) by swapping or removing the of e while preserving superficial statistics (lexical overlap, formatting, passage length, entity type). Formally, we may posit a surface statistic map σ(⋅) such that σ(D⁻) = σ(D⁺) and e ∉ D⁻, while the substituted span is conditionally independent of y^* given (q, σ(D⁺)). This is the regime in which we intend the clean–corrupted gap to reflect evidence identity rather than retrieval presence.

We relate the restoration fraction at f to the retrieval share of the evidence-dependent logit gap. Let ℓ⁺ and ℓ⁻ denote ℓ_θ(y^* ∣ q, D⁺) and ℓ_θ(y^* ∣ q, D⁻), and let ℓ_f^int denote the logit under the intervened run that replaces h_f(q, D⁻) by h_f(q, D⁺). By –,
ℓ⁺ − ℓ⁻ = ⟨w_y^*, h_f(q, D⁺) − h_f(q, D⁻)⟩, ℓ_f^int − ℓ⁻ = ⟨w_y^*, h_f(q, D⁺) − h_f(q, D⁻)⟩,
so at the the intervention restores the entire evidence-dependent difference whenever the only difference between D⁺ and D⁻ that matters for y^* flows through h_f. The subtlety is that our metric is defined in terms of (log-)likelihoods after softmax normalization rather than raw logits. To address this, we assume bounded logit ranges and local Lipschitzness of log p_θ(y^* ∣ ⋅) as a function of the gold logit gap. Under these standard calibration-style conditions, a first-order Taylor expansion implies that the restoration fraction computed in log-likelihood space equals the corresponding restoration fraction in logit space up to a bounded error term that vanishes as the logit perturbation shrinks. Concretely, if other logits remain within a bounded interval and the softmax map has Lipschitz constant L_sm on that interval, then

which yields the identifiability claim: restoration at f recovers the retrieval-mediated share of the evidence-dependent effect, up to a calibration-dependent discrepancy. This is the sense in which CEU_f, S is an estimator of causal mediation through the retrieval interface restricted to S.

We formalize robustness as stability of the correct prediction under injection of distractor spans that match superficial cues. Consider a process that, given (q, D⁺), produces D^′ by adding m distractor spans d₁, …, d_m sampled so that σ(d_i) = σ(e) and such that d_i ⟂ y^* ∣ q (i.e., distractors carry no true information about the label beyond query-only features). In the two-channel model, any degradation induced by distractors must arise because h_f(q, D^′) moves in a direction that the model treats as predictive of some answer, despite the distractors being label-independent.

Suppose 𝔼[CEU_f] < α over supported examples. By definition, this means that when we swap evidence identity (moving from D⁺ to D⁻), patching h_f back restores less than an α fraction of the clean–corrupted likelihood gap on average, so a (1 − α) fraction of the gold preference is mediated by pathways the evidence-bearing retrieval contribution (e.g. query-only priors, or non-evidence features of retrieved text). Because distractors are constructed to match surface statistics, we may choose them adversarially to preserve those non-evidence features while altering h_f in a way that counteracts the evidence direction. Under mild richness assumptions on the distractor family (that it spans a neighborhood in the space of attainable h_f values while maintaining σ), we can construct D^′ such that ℓ_ret(y^* ∣ h_f(q, D^′)) decreases by a constant fraction of the evidence advantage while ℓ_par(y^* ∣ q) remains unchanged. The resulting drop in the gold logit yields a constant-probability error event, giving an accuracy reduction of order Ω(1 − α) on supported queries. The same construction implies a lower bound: if evidence identity is swapped but CEU is small, then the model’s output distribution cannot change sufficiently in the correct direction, because too little of the decision is mediated by the retrieval pathway that the swap affects.

We now connect a CEU lower bound to the probability of producing an answer unsupported by retrieved evidence when the retriever sometimes misses. Let η be the miss probability Pr [e ∉ D], and let δ parameterize additional test-time distractor corruption. Condition on the event E that evidence is present in the retrieved set. Assume (i) 𝔼[CEU_f ∣ E] ≥ α, and (ii) evidence swaps reduce the retrieval-mediated gold logit by at least a margin γ > 0 in expectation (a property enforced by swap-based contrastive training, but taken here as an assumption). Under , the gold logit advantage attributable to evidence is then at least αγ in expectation. If we additionally assume a standard logit-to-error calibration bound (e.g. that the probability of selecting an unsupported alternative decays exponentially in the gold-vs-best-alternative logit gap), we obtain

for a constant c > 0 determined by the calibration inequality and bounded-logit assumptions; test-time distractors at rate δ can be absorbed into a reduction of the effective margin, yielding the same form with γ replaced by γ(δ). Unconditioning and using Pr [¬E] ≤ η yields
Pr [unsupported output] ≤ η + (1 − η)exp (−c αγ),
which separates unavoidable hallucinations due to retrieval misses from those due to insufficient causal usage of present evidence. This bound is loose but mechanistically interpretable: increasing α tightens the supported-regime error exponentially in the evidence margin, while improvements in the retriever (smaller η) reduce the irreducible term.

6. Mechanistic Regularization Objective: define CER (Causal Evidence Regularization) combining task loss with CEU constraints and contrastive corruption losses; provide algorithm and compute overhead analysis.

We now specify a training objective that encourages on retrieved evidence rather than on query-only priors or superficial retrieval artifacts. The key ingredient is a restoration-based causal mediation score (CEU) computed by a small number of interchange interventions at an explicit retrieval interface component~f. We combine the standard task loss with (i) a hinge-style constraint enforcing a minimum causal evidence usage on supported examples and (ii) a penalty discouraging causal influence from distractor spans, together with an optional contrastive corruption term that preserves sensitivity to evidence identity.

Each training instance consists of a triple (q, y^*, D⁺) where D⁺ contains an evidence item e supporting y^*. A corruption operator 𝒞 produces a matched corrupted context D⁻ = 𝒞(D⁺) in which e is swapped/removed while surface statistics are preserved. Let
L⁺ = log p_θ(y^* ∣ q, D⁺), L⁻ = log p_θ(y^* ∣ q, D⁻)
denote the gold log-likelihoods under teacher forcing (or summed over gold tokens). Let h_f(q, D) be the activation at the designated retrieval interface~f for input (q, D). We assume we can (a) record h_f(q, D⁺) and (b) run an intervened forward pass on (q, D⁻) where we replace the activation with the clean one. We denote the intervened likelihood by
L_f^int = log p_θ^{f ← f^*}(y^* ∣ q, D⁻, D⁺),
where the superscript indicates that we forward on (q, D⁻) while setting h_f(q, D⁻) ← h_f(q, D⁺).

For examples where the corruption changes the model confidence (so that L⁺ − L⁻ is non-negligible), we define a restoration fraction

with the convention that we ignore the term or set it to 0 whenever L⁺ − L⁻ falls below a small threshold to avoid numerical instability. When span/port structure is available, we refine~ into a score CEU_f, S by restricting the intervention to a designated subset S of evidence-bearing tokens/ports. Concretely, for token-level cross-attention, we implement restriction by masking the patch to apply only to the attention readout attributable to tokens in~S (equivalently, patching a per-span read vector); for memory-port architectures, we patch only the evidence-designated ports. We write CEU_f, ev for the score restricted to evidence spans/ports and CEU_f, dis for the analogous score restricted to distractor spans/ports. In all cases we optionally clip to [0, 1] to keep the score interpretable as a fraction:
CEU_f, ev = clip_[0, 1](Attrib_f, ev), CEU_f, dis = clip_[0, 1](Attrib_f, dis).

Let α ∈ (0, 1] be a target minimum evidence mediation level. Our base task loss is
ℓ_task(q, y^*, D⁺) = − log p_θ(y^* ∣ q, D⁺).
We add a mechanistic regularizer consisting of an evidence hinge and a distractor penalty:

where [t]₊ = max {0, t} and λ_ev, λ_dis ≥ 0 are weights. Intuitively, the first term enforces that a fixed fraction of the clean–corrupted likelihood gap can be restored by repairing the evidence-mediated pathway through~f; the second term discourages the model from using distractor spans as if they were evidence.

Optionally, we include an explicit swap-sensitivity (contrastive corruption) term to ensure that the identity of evidence matters and not merely the presence of retrieval:

or, when a margin is desired,
ℓ_swap^hinge = [γ − (L⁺ − L⁻)]₊,
with margin γ > 0. The overall objective per example is then

with β ≥ 0. We emphasize that is a constraint: it depends on counterfactual forward passes that isolate the causal channel through~f, rather than on correlations between token presence and outputs.

We implement CER with a constant number of forward evaluations per example. For each minibatch, we (i) sample corruptions D⁻ = 𝒞(D⁺), (ii) compute L⁺ and cache the clean activation(s) at~f, (iii) compute L⁻, and (iv) run up to k intervened passes to estimate CEU_f, ev and CEU_f, dis (and optionally additional probes such as random-span patches to sanity-check localization). A typical configuration uses k ∈ {1, 2, 3}, corresponding to evidence-only, distractor-only, and (optionally) a control patch.

Relative to standard training (one forward/backward on (q, D⁺)), CER requires one additional forward on (q, D⁻) and O(k) intervened forwards to estimate the CEU terms. Thus the forward-pass count per example is approximately (2 + k), with the backward pass taken only through the loss actually optimized (i.e., we backpropagate through the intervened runs as needed to shape the parameters that control h_f and its downstream effects). In wall-clock terms, the overhead factor is close to (2 + k) for compute-dominated regimes, plus a small constant associated with activation capture and replacement at module boundary~f. The memory overhead is dominated by storing the clean activations at~f for each probed span/port; this is O(dim (h_f)) per probe per example (or per microbatch when using activation checkpointing). When memory is constrained, we can recompute the clean activation on demand, trading memory for an extra forward pass; this does not change the asymptotic constant-factor accounting and keeps the intervention budget explicit.

7. Benchmark Suite: (i) Synthetic binding-with-retrieval (ATR-insert) with guaranteed evidence localization; (ii) lightweight factual QA with controlled retrieval corruption; evaluation metrics (accuracy, calibration, CEU, sensitivity).

We require benchmarks in which (i) evidence is explicitly localized so that evidence-restricted interventions are well-defined, (ii) corruptions can remove or swap the supporting fact while approximately preserving surface cues, and (iii) evaluation can separate from as measured by CEU_f, ev and related sensitivity diagnostics. We therefore provide a two-part suite: a synthetic binding-with-retrieval task with guaranteed evidence localization, and a lightweight factual QA task augmented with controlled retrieval corruption.

The synthetic component is designed to be mechanistically unambiguous: the gold answer is a deterministic function of a short evidence span inserted into the retrieved context, and the query is constructed to be uninformative without retrieval. Concretely, for each instance we sample (a) a query template, (b) a key–value binding, and (c) a retrieval bundle D⁺ consisting of m short passages. Exactly one passage contains the evidence item e in a designated format, e.g.,
$$ \texttt{ID: \#742 \;\; KEY: k\_9 \;\; VALUE: v\_3} $$
and the query asks for the value associated with a key (or the key associated with a value) under an ID constraint. We then define y^* as the unique value implied by the evidence line. This construction yields a latent evidence item e that is also directly labelable as a token span E(D⁺) by design (e.g., the substring ).

To prevent shortcut learning from superficial artifacts, we randomize (i) passage order, (ii) distractor passage formats (matching punctuation and token lengths), and (iii) lexical overlap between distractors and the query. We additionally include that contain plausible competing bindings with mismatched IDs or keys. The resulting task admits clear failure modes: a model that answers from a query-only prior is near-chance, a model that keys on spurious lexical overlap between query and distractor passages will be brittle under swaps, and a model that truly reads the evidence should exhibit high CEU_f, ev.

We implement a controlled corruption operator 𝒞 for ATR-insert in three modes, each producing a D⁻ that preserves surface statistics:

Because E(D⁺) is known exactly, we can compute CEU_f, ev by restricting interventions to the evidence span, and CEU_f, dis by restricting to distractor spans of identical format. This enables fine-grained sanity checks, including control patches to random non-evidence spans, for which we expect near-zero restoration.

The second component targets factual question answering with natural language evidence, while retaining controllability over corruption and evaluation. We begin with a set of short, single-hop questions (e.g., entity–attribute queries) for which we can retrieve one or more passages containing a single sentence that supports the answer. Each example is stored as (q, y^*, D⁺) along with an evidence span E(D⁺) obtained either from dataset-provided annotations when available or from a deterministic matching procedure (e.g., locating an answer-bearing sentence containing the answer string and a query entity, then expanding to a minimal clause boundary). We restrict to instances where (i) the answer is extractive or can be normalized to a short canonical string, and (ii) the evidence span is short relative to the retrieved context, so that span-restricted interventions are meaningful.

We define corruption operators that preserve fluency and topicality:

We emphasize that these corruptions are applied to the retrieved context rather than to the query, so that the task remains ``answer from retrieval.’’ In addition, we parameterize corruption strength (e.g., number of injected distractors, similarity constraints) so that robustness curves can be measured as a function of δ.

We report standard predictive quality metrics alongside mechanistic and causal-sensitivity metrics, with the goal of separating correct answers that are causally supported by retrieval from correct answers produced by priors or artifacts.

We compute exact-match accuracy (and token-level F1 when applicable) under a fixed answer normalization. We also report negative log-likelihood on D⁺,
NLL = − log p_θ(y^* ∣ q, D⁺),
to capture calibration-relevant changes not reflected in exact match.

We compute expected calibration error (ECE) over binned confidence scores, where the confidence for a predicted answer ŷ is p_θ(ŷ ∣ q, D) (or the product of token probabilities for multi-token answers). We also report the Brier score for the indicator of correctness when the answer space is small (e.g., ATR-insert classification-style variants). These metrics detect training regimes that increase accuracy by sharpening overconfident priors rather than by improving evidence use.

For each example we compute CEU_f, ev and CEU_f, dis via restoration fractions from span- or port-restricted interventions at component f, following the definitions in Section~. We summarize by reporting the distribution (mean, median, and lower quantiles) and by conditioning on correctness, since the relevant failure mode is high-confidence correctness with low CEU_f, ev.

We report two swap-based sensitivity measures. First, the
ΔL = L⁺ − L⁻ = log p_θ(y^* ∣ q, D⁺) − log p_θ(y^* ∣ q, D⁻),
which should be positive and large when the evidence is swapped or removed. Second, we report the and gaps:
Δ_ev = L_ev^int − L⁻, Δ_dis = L_dis^int − L⁻,
which diagnose whether restoring evidence through f repairs the model, and whether restoring distractors incorrectly repairs it. For ATR-insert we additionally evaluate robustness under increasing distractor rate δ, plotting accuracy and CEU_f, ev as functions of δ and context length n.

Finally, we define an operational notion of ``causally supported’’ answering: an answer is counted as supported if it is correct satisfies CEU_f, ev ≥ τ for a chosen threshold τ (e.g., τ = 0.5). Reporting the supported-answer rate prevents models that achieve high accuracy with systematically low mediation from being scored favorably.

Together, these benchmarks and metrics provide both a controlled setting (ATR-insert) in which evidence mediation can be verified with minimal ambiguity and a natural-language setting (factual QA) in which controlled corruptions approximate real retriever failures and distractor-heavy contexts.

8. Experimental Plan (flagged as strengthening evidence): ablations over retrieval interfaces, model families (attention/SSM/hybrids), retriever quality, distractor rate, and context length; demonstrate causally supported answering and robustness.

We design experiments to test the central empirical claims implied by our causal formulation: (i) a model can achieve high predictive performance while exhibiting low causal mediation of retrieved evidence, and (ii) the proposed CER objective increases causal evidence mediation at a designated retrieval interface f while improving robustness to evidence-preserving distractors and sensitivity to evidence swaps. All experiments are conducted on the benchmark suite in Section~ with corruption operators 𝒞 and evidence labels E(D⁺) when available.

We vary five factors: (a) retrieval interface choice and probe location f, (b) model family, (c) retriever quality (miss probability η and ranking noise), (d) distractor rate δ, and (e) total context length n. Our primary hypotheses are:

We additionally test whether these effects depend on architectural inductive bias (attention vs. SSM) and on the degree to which retrieval is fused into the computation graph.

We instantiate multiple explicit retrieval interfaces and define a corresponding component f for interventions:

For each interface, we compare (i) training without causal regularization, (ii) CER with evidence-only regularization, and (iii) CER with both evidence and distractor terms. We also vary the intervention granularity: span-restricted (tokens in E(D⁺)) versus port-restricted (a fixed subset of retrieval read channels). The goal is to verify that improvements in CEU_f, ev persist across reasonable definitions of f and are not an artifact of a particular hook.

To evaluate whether causal evidence usage is architecture-dependent, we train parameter-matched models in three classes:

We keep training compute approximately constant and report performance as a function of regularization weight λ and probe budget k. This isolates whether CER is effective when retrieval is consumed through different computational primitives.

We construct retriever-quality regimes to vary η and the prevalence of near-miss distractors:

We evaluate whether models trained with CER degrade more gracefully as η increases, and whether the supported-answer rate tracks the bound structure suggested by Theorem~3. Empirically, we test for a monotone relation between retriever quality and the fraction of correct answers that remain causally supported by retrieved evidence.

We measure robustness under increasing distractor rate δ and context length n by expanding D with additional distractor passages or sentences while preserving evidence presence. For each δ and n, we report (i) accuracy on D⁺, (ii) causal metrics CEU_f, ev and CEU_f, dis, and (iii) swap-based sensitivity under 𝒞. The key diagnostic is whether CER-trained models maintain both accuracy and evidence mediation as irrelevant context grows. We also test a control in which D⁻ and D⁺ have identical token counts, ensuring that any sensitivity is to evidence identity rather than to length or formatting artifacts.

For each configuration we train with fixed seeds, report mean and variance across runs, and sweep λ and the CEU target α in the hinge term. We track the Pareto frontier between clean accuracy and causal support, operationalized as the supported-answer rate at several thresholds τ. To confirm that CER does not merely induce brittleness, we include a corruption condition (evidence paraphrase) and require that ΔL remains small under paraphrase while remaining large under evidence swaps. Finally, we include a compute accounting ablation varying k ∈ {1, 2, 3} to quantify how much causal auditing and training benefit is obtained per additional intervened forward pass.

9. Limitations and Extensions: multiple evidence sources; parametric knowledge; interventions inside fused kernels; connections to causal abstraction; broader implications for tool-augmented agents.

We isolate several limitations of our current formulation and outline extensions that appear technically natural within the structured intervention view.

Our metric CEU_f is defined relative to a designated component f and (when available) a designated evidence span E(D⁺). This is well-suited to single-span factual questions and synthetic binding tasks, but it does not, by itself, resolve settings where correctness requires disjoint evidence items, possibly spread across passages and requiring compositional reasoning (multi-hop QA, comparison, temporal aggregation).
A first extension is to generalize the evidence notion from a single span to a of spans {e_i}_i = 1^r and define a group attribution score via grouped interventions:
$$ \mathrm{CEU}_{f,\mathrm{group}} := \frac{\log p_\theta^{f\leftarrow f^*}(y^*\mid q, D^-, D^+) - \log p_\theta(y^*\mid q, D^-)}{\log p_\theta(y^*\mid q, D^+) - \log p_\theta(y^*\mid q, D^-)}\,, $$
where f^* is computed from a context in which all required evidence spans are simultaneously restored.
However, this grouped score collapses the structure of how evidence items interact. If we wish to distinguish evidence (either span suffices) from evidence (both spans needed), we must estimate interaction terms, e.g. by Shapley-style or inclusion–exclusion decompositions over evidence subsets. This introduces an exponential dependence on the number of evidence sources unless we assume restricted forms (e.g. bounded-order interactions) or accept stochastic subset sampling. Under the compute constraints we impose (bounded k), the practical compromise is to report (i) marginal CEU per evidence span and (ii) a small number of pairwise interaction probes for a fixed budget. This yields an auditing signal but not a complete identifiability result for arbitrary multi-evidence support.

Our two-channel decomposition explicitly allows a parametric term ℓ_par(q), and we treat high CEU_f, ev as a mechanistic desideratum . This does not preclude the model from also knowing y^* parametrically, and indeed in many domains we expect substantial parametric knowledge. Two issues follow.
First, CEU_f is not a correctness guarantee: it measures mediation through a designated retrieval pathway, not semantic truth. Second, if y^* is strongly supported by priors, then even correct evidence may contribute only a small incremental logit, making CEU small while the model remains accurate and non-hallucinatory in a semantic sense. We can respond in two ways. (i) We can calibrate evaluation to emphasize evidence-conditional identification by selecting queries where priors are weak or by constructing contrast sets where parametric knowledge is uninformative (e.g. counterfactual entity names, synthetic keys). (ii) We can modify the objective to encourage on retrieval when retrieval is available, for example by adding a penalty on answering confidently under evidence removal:
ℓ_no-ret = max (0, log p_θ(y^* ∣ q, ∅) − β),
or by introducing a selective abstain'' output that is rewarded when evidence is missing. These changes alter the product specification: the desirable behavior is no longeranswer whenever possible’’ but ``answer when supported.’’ In deployments where parametric knowledge is acceptable, we should interpret CEU as an rather than a universal requirement.

A practical limitation is that interchange interventions require access to activations h_f(q, D) at a module boundary. Many efficient inference/training stacks fuse attention, MLPs, and normalization into kernels where intermediate values are not materialized, and retrieval augmentation may be implemented via custom CUDA ops or external memory systems.
There are at least three engineering-level mitigations: (i) choose f at an accessible boundary (e.g. residual stream before/after a block), (ii) implement a ``shadow’’ hook that recomputes the relevant subgraph in unfused form for auditing only, or (iii) instrument the kernel to expose the read vector or attention output for a small fraction of steps. Each option incurs a different overhead–faithfulness trade-off. Conceptually, our formalism does not require f to be uniquely defined; it requires only that f capture the retrieval-to-answer channel of interest. Nevertheless, the empirical value of CEU_f depends on how well the chosen boundary aligns with the actual causal bottleneck. A useful extension is to treat f as a of candidate sites and to report a profile {CEU_{f_ℓ}}_ℓ across layers/ports, thereby reducing the risk that we certify the wrong interface.

Our interventions operate at the level of internal activations and evidence spans, which suggests a natural connection to causal abstraction: we seek a high-level variable ``evidence content’’ whose influence on the answer is mediated through a designated mechanism. From this perspective, CEU_f is a restricted mediation estimate under a particular abstraction map (tokens ↦ retrieval states ↦ answer logits). A limitation is that the abstraction is currently (via E(D) and f). An extension is to learn the abstraction map jointly, e.g. by learning a sparse selector over retrieved tokens/ports that maximizes restoration while minimizing selected mass, subject to compute constraints. However, the hardness discussion already implies that exact minimal selection is intractable in general, so we expect only approximate abstractions, and we should evaluate them by stability across corruptions and by agreement with human-labeled evidence where available.

Our experiments focus on single-step retrieval-augmented generation with fixed D = R(q). Tool-augmented agents introduce two additional causal pathways: (i) actions influence future observations (queries issued to tools) and (ii) intermediate chain-of-thought or scratchpad states may route information in ways not localized to a single f. Extending our approach requires deciding what counts as ``evidence’’ (tool outputs, web snippets, database rows) and what the relevant interface is (the memory write/read boundary, the tool-call result embedding, or the planner state).
A direct extension is to define CEU over : we treat the agent as inducing a distribution over tool transcripts T, and we measure whether swapping/removing a transcript segment restores or destroys answer likelihood through the designated read component. This suggests an auditing protocol for 2026-era systems: certify not only that an answer depends on retrieved text, but that it depends on the rather than on spurious correlations in the prompt or on parametric recall. The main open problem is controlling confounding introduced by the agent’s policy (tool choices depend on the model state). Addressing this likely requires randomized tool-response corruptions and policy-aware estimators, which we leave as an explicit direction.

10. Conclusion: mechanistic audits as a design/evaluation primitive for 2026 RAG systems.

We have argued that retrieval augmentation is not, by itself, a guarantee of evidence-conditioned behavior. A retrieval-augmented language model can achieve high task accuracy while routing decisive information through channels that are only weakly coupled to the retrieved evidence (e.g. parametric priors, prompt heuristics, or distractor-sensitive features). For this reason, we treat —in the narrow sense of counterfactual intervention at an explicit retrieval interface—as a primary design and evaluation primitive rather than a post hoc interpretability add-on.

Our central object is the causal evidence usage score CEU_f, defined by restoration under interchange interventions at a designated retrieval read component f. Operationally, CEU_f answers the question: The definition is intentionally local: it is tied to a chosen interface f and (when available) designated evidence spans or ports. This locality is a feature. It yields a computable signal with constant-factor overhead (Theorem~4), and it permits targeted claims of the form the model uses retrieval \emph{through this mechanism}'' rather than global, ill-posed statements aboutusing evidence.’’

We then introduced CER, a training objective that treats mechanistic support as a constraint. The regularizer simultaneously (i) enforces a minimum evidence-mediated restoration CEU_f, ev ≥ α on supported examples and (ii) penalizes restoration attributable to distractor spans. The resulting training signal is not merely correlational: it is defined in terms of counterfactual intervention, and thus it aligns directly with the causal question that downstream users care about (whether the output is supported by retrieved evidence). Under the additive logit model, the restoration fraction corresponds to an identifiable share of the evidence-dependent logit difference (Theorem~1). Although this stylized model is not a literal description of modern transformers, it provides a tractable hypothesis class in which the metric is interpretable and the regularizer has a clear semantic target.

The accompanying bounds motivate why this style of audit should be considered a primitive for robust RAG. Theorem~2 formalizes a lower-bound phenomenon: if a system exhibits low causal mediation of evidence through the retrieval interface, then distractor injection processes exist that degrade accuracy by a constant amount, even when evidence is present and superficially indistinguishable from distractors. Conversely, Theorem~3 gives an upper bound on unsupported answering under retriever noise: when the retriever miss probability is η and the model satisfies a nontrivial mediation constraint together with a swap-sensitivity margin, the hallucination probability (with respect to the retrieved context) is controlled by a function that separates error from retriever misses (η) and error from weak evidence dependence (captured by α and the margin γ). The point is not that these constants are sharp in practice, but that the qualitative dependency is the correct one: without an explicit mechanism-level constraint, there is no reason to expect monotone improvements in evidence-conditioned behavior as we scale models or retrievers.

From an engineering perspective, the conclusion is that we should treat CEU-style audits analogously to unit tests for a software interface. In a 2026 deployment, one does not merely evaluate end-task accuracy on a benchmark. One also specifies (a) a corruption family 𝒞 that preserves superficial cues while altering the underlying evidence identity, (b) an interface site f that represents the intended retrieval-to-answer bottleneck, and (c) acceptance thresholds on evidence restoration and distractor suppression. Concretely, for a chosen supported-query slice, we can maintain dashboards of (i) the distribution of CEU_f, ev, (ii) the distribution of CEU_f, dist, and (iii) swap-sensitivity margins log p_θ(y^* ∣ q, D⁺) − log p_θ(y^* ∣ q, D⁻). These are not substitutes for accuracy; rather, they are orthogonal observables that diagnose failure modes that accuracy alone cannot detect.

We also emphasize a product-level implication. If the intended specification is answer whenever possible,'' then strong parametric knowledge is acceptable and low $\mathrm{CEU}$ may be tolerable. If the intended specification isanswer only when supported by retrieved evidence,’’ then mechanistic mediation becomes a requirement, and it is natural to couple CER-style training with abstention or confidence penalties under evidence removal. The virtue of CEU is that it makes these specification choices explicit and measurable at the mechanism boundary, rather than leaving them implicit in loosely phrased desiderata about ``groundedness.’’

Finally, we view the main conceptual contribution as a shift from evaluation to evaluation. Retrieval augmentation introduces a distinguished causal pathway, and we can interrogate that pathway with controlled counterfactuals. The resulting metrics and objectives are compute-feasible (constant additional passes), compatible with black-box probability queries plus white-box hooks at f, and extensible to richer settings by expanding the intervention family. We do not claim that a single scalar CEU_f certifies truthfulness or completeness; rather, we claim that mechanistic audits provide a stable intermediate target between unverifiable semantic criteria and purely behavioral benchmarks.

In summary, we advocate the following stance for RAG systems: (i) define the retrieval interface as an explicit causal bottleneck, (ii) enforce evidence mediation and distractor suppression during training when the product specification requires support, and (iii) report intervention-based audit statistics as first-class evaluation outputs. This stance converts ``grounding’’ from an aspirational property into a testable, optimizable constraint, thereby making evidence-conditioned behavior an object of routine model design rather than an after-the-fact hope.