← Back

Mechanistic Certificates: PAC-Style Robust Causal Guarantees for Interpretability via Component Replacement

Metadata

Table of Contents

  1. 1. Introduction and Motivation: why interpretability needs certifiable robustness in 2026-era deployment; limitations of anecdotal circuits and streetlight interpretability; overview of mechanistic certificates.
  2. 2. Background: mechanistic interpretability primitives (features/circuits), intervention methods (activation/path patching), causal abstraction/scrubbing; relationship to robustness and auditing.
  3. 3. Formal Setup: models as causal DAGs; intervention families; behavior functionals; OOD families; notion of explanation via component replacement.
  4. 4. Problem Statement (RMC): define Robust Mechanistic Certification; specify inputs/outputs; discuss what counts as success and what is explicitly not claimed.
  5. 5. Certification Protocol: modular pipeline (candidate mediator search, abstraction fitting, replacement construction, validation). Provide pseudocode and invariants.
  6. 6. Soundness Theorems (PAC Guarantees): finite-sample bounds for replacement error under explicit sampling assumptions; how to handle multiple hypotheses (union bounds).
  7. 7. Completeness in Compiled/Algorithmic Testbeds: conditions under which a sparse ground-truth abstraction is recoverable; proof sketch; discussion of identifiability assumptions and how they can fail.
  8. 8. Computational Complexity and Hardness: NP-hardness of minimum mediator selection; implications; approximation strategies and what we can still guarantee.
  9. 9. Practical Instantiations: (planned) SAE-based mediator variables; automated circuit discovery; causal scrubbing validation; guidelines for designing on-distribution interventions.
  10. 10. Experimental Plan (to strengthen contribution): compiled transformers (Tracr/InterpBench/TracrBench-like) for ground truth; progressively relax assumptions (noise, redundancy, superposition); small open LMs for external validity; falsification-first reporting.
  11. 11. Limitations and Failure Modes: off-distribution patching artifacts; hydra/self-repair; non-linear/manifold features; adversarially evasive models; what certificates do and do not imply.
  12. 12. Related Work and Positioning: contrast with causal abstraction alone, circuit discovery alone, and black-box evals; connect to auditing and governance standards.
  13. 13. Conclusion and Outlook: toward standardized mechanistic audit artifacts; open problems (dynamic/agentic settings, adversarial robustness, universality up to abstraction).

Content

1. Introduction and Motivation: why interpretability needs certifiable robustness in 2026-era deployment; limitations of anecdotal circuits and streetlight interpretability; overview of mechanistic certificates.

Deployed neural systems in 2026 increasingly occupy roles where a failure mode is not merely a misclassification but a deviation: an agent makes an unanticipated tool call, a code model introduces a subtle vulnerability, or a content model exhibits a policy-violating behavior under a distributional shift. In such settings we are typically not satisfied with a post hoc story that a model uses feature $f$'' or thathead h attends to token t’’; rather, we require a statement of the form: . The gap between descriptive interpretability and such a statement is the central motivation for our framework.

Mechanistic interpretability has produced a useful vocabulary—features, circuits, attribution graphs, and patched pathways—that can illuminate how a model may compute. However, these artifacts are often presented as case studies: a circuit is exhibited on a handful of prompts, validated by a small number of interventions, and then informally generalized. This style is informative, yet it does not produce a guarantee that the identified mechanism is sufficient for the behavior on a specified distribution, nor that it remains sufficient under the intervention operations used to probe it. In particular, when a circuit is discovered by exploring a narrow region of prompt space, we risk a form of : we understand what is easiest to elicit and visualize, while the model may implement alternative pathways elsewhere, including in rare but consequential OOD regimes.

The problem is sharpened by two facts. First, modern systems are routinely evaluated under synthetic stress tests (prompt-injection templates, red-team generators, compositional generalization suites) that define an explicit 𝒟OOD, and the deployment consequences are governed by behaviors B that are discrete and auditable (e.g. pass/fail of a policy checker, success/failure of a sandboxed task, or a forbidden tool invocation). Second, mechanistic claims are increasingly used as inputs to governance and engineering decisions (e.g. whether a safety mitigation is ``understood’’), which suggests that interpretability should support the same kind of error accounting common in robustness and verification: explicit assumptions, explicit sampling, and explicit confidence parameters.

We therefore pursue : statements that tie an interpretable explanation to observed behavior through a formal replacement test. Concretely, we treat the trained model M as inducing a causal graph over internal computations, and we allow interventions from a declared family that set, patch, or replace internal node values. An explanation is not merely a subgraph; it is a triple (A, π, S) where A is an abstract causal model with variables U, π : U → V maps abstract variables to concrete mediators inside M, and S is a replacement scheme producing a hybrid model M ∘ S that runs M except that mediator values are computed by A and injected according to S. The operational question is then well-posed: how often does M ∘ S disagree with M on the behavior of interest,
Prx ∼ 𝒟OOD [B(M(x)) ≠ B((M ∘ S)(x))] ?
If this probability is small, the explanation is not only plausible but for reproducing the behavior on the declared OOD family.

The certificate component enters by insisting that we upper-bound the above probability using held-out samples and a confidence parameter δ. The point is not to claim universal validity over all possible prompts, but to make the scope explicit: the guarantee is relative to 𝒟OOD and to the intervention interface . This mirrors the standard discipline of robustness evaluation, where a claim is meaningful only after specifying a threat model. Here the threat model is twofold: distributional shift (captured by 𝒟OOD) and permissible mechanistic perturbations (captured by ). Mechanistic certificates thereby reconcile interpretability with auditing practice: the same generator that defines the red-team distribution also defines the distribution on which the explanation must be behavior-preserving.

This perspective clarifies the limitations of purely anecdotal circuit validation. A circuit can appear necessary on a curated set while failing to be sufficient on an OOD generator, because the model may implement an alternative computation that coincides on the curated set. Conversely, a discovered circuit can be sufficient on a distribution even if it is not unique; the certificate records sufficiency rather than uniqueness, which is often the property required for safe replacement, modular analysis, or controlled editing. Moreover, replacement-based validation discourages a common failure mode in interpretability experiments: measuring large causal effects under interventions that implicitly move the input into an unnatural regime. By committing to and 𝒟OOD, we separate questions of from artifacts of intervention-induced distribution shift.

Finally, the certificate framework imposes an optimization objective that matches practical desiderata. Explanations should be small (few mediators, simple A) and they should come with tight error bounds. This yields a principled tradeoff between interpretability and fidelity: we may accept a slightly larger mediator set to reduce certified error, or accept a modest error ε to obtain a compact abstraction. The result is a protocol in which interpretability artifacts are not endpoints but hypotheses subjected to a controlled test: we search for (A, π, S), we evaluate the induced replacement model on held-out OOD samples, and we report a quantitatively meaningful certificate (ε̂, δ). This is the sense in which we aim to make mechanistic interpretability robust in deployment-relevant regimes.


2. Background: mechanistic interpretability primitives (features/circuits), intervention methods (activation/path patching), causal abstraction/scrubbing; relationship to robustness and auditing.

Mechanistic interpretability typically begins with a choice of internal objects regarded as candidate explanatory units. At the finest granularity one may take individual neurons or scalar coordinates in the residual stream; at coarser granularity one may take attention heads, MLP blocks, or entire layers. A central empirical difficulty is : single coordinates frequently participate in multiple computations, and interventions on them may entangle effects that are hard to attribute. This has motivated learned bases in which coordinates are intended to be more semantically ``atomic,’’ including sparse autoencoders (SAEs) or transcoders trained on activation distributions to yield sparse feature dictionaries. In that representation, the basic explanatory claim becomes: a small set of features (or heads) mediates the relevant dependence of the output on the input.

Given primitives, a is usually presented as a subset of components together with a qualitative dependency structure: information is extracted from particular input tokens, stored in intermediate states, and then combined to influence a downstream logit or decision. Circuit discovery methods include gradient-based attributions, attention pattern inspection, logit lens style probes, and graph-building procedures that connect upstream components to downstream readouts using activation correlations or linearized influence estimates. These methods are valuable as but they are not, by themselves, tests of causal sufficiency; indeed, correlations can be induced by shared causes or by alternative pathways that coincide on a narrow prompt slice. Consequently, mechanistic work increasingly pairs circuit proposals with explicit intervention experiments.

The most common family of interventions is . One chooses a clean'' input $x^{(c)}$ and acorrupt’’ input x(r) that differ in some targeted property, runs the model on both to obtain internal activations, and then constructs hybrid runs in which the value of a chosen node is replaced by the corresponding activation from the other run. The resulting change in some scalar quantity (a logit, a probability, or a discrete predicate evaluated on the output) is taken as evidence that the node carries causal information relevant to the change. Variants include mean ablation or resampling ablation (replacing an activation by a sample from a reference distribution), ``zeroing’’ interventions, and subspace patching (e.g. replacing only the component of an activation in a learned feature subspace). In practice these interventions can be implemented with cached activations and a modified forward pass, and they support efficient sweeps over many candidate nodes.

A known limitation of naive activation patching is that it may conflate information is represented with it is used. If we replace a late-layer activation, we may restore the correct output even if earlier computations were incorrect, because downstream computation is overpowered by the patch; conversely, patching a node may fail to change the output even if the node is part of the intended mechanism, because other redundant pathways compensate. To localize causal influence along specific computational routes, one may use (sometimes called edge patching): we patch only the contribution transmitted along selected edges while keeping other incoming influences fixed. In transformer terms, this often means patching the output of a particular attention head into a downstream residual stream while leaving other residual contributions unmodified, or patching the value vectors while holding attention patterns fixed. Such techniques approximate edge interventions in an internal causal graph and provide a more granular notion of ``which connection matters.’’

These intervention methodologies connect naturally to the literature on and . The guiding idea is that an explanation should not merely label components but should define an abstract causal model whose variables correspond to interpretable quantities (e.g. a parsed token identity, a parity bit, a latent state in an algorithm), together with a mapping from those variables to internal representations. One then evaluates whether the abstract model is to the original system under a specified set of interventions: interventions on abstract variables should correspond (via the mapping) to interventions on internal nodes that induce the same downstream behavior distribution, up to an allowed tolerance. ``Scrubbing’’ procedures operationalize this by replacing internal activations with samples conditioned on abstract variables, thereby attempting to remove all information not carried by the abstraction; if downstream behavior is preserved, this provides evidence that the abstraction is sufficient for that behavior under the scrubbing distribution. The same viewpoint clarifies why intervention design matters: if interventions induce severe distribution shift on downstream activations, then observed effects may reflect unnatural regimes rather than genuine causal dependence.

Finally, we note an alignment between these interpretability tools and standard robustness and auditing practice. Robust evaluation insists on an explicit threat model and an explicit test distribution; similarly, mechanistic claims become meaningful only relative to a declared family of inputs and a declared family of perturbations. In auditing settings, one frequently has an explicit generator of adversarial or stress-test prompts and a discrete pass/fail criterion; mechanistic experiments can be organized so that intervention sweeps are conducted on one set of generated samples, while any quantitative claim about sufficiency is evaluated on held-out samples from the same generator. This separation is essential to avoid adaptive overfitting of mechanistic hypotheses to a small number of hand-selected prompts, and it parallels multiple-comparisons issues familiar from hypothesis testing: when we test many candidate circuits, we must account for selection when reporting confidence.

The preceding primitives—features and circuits as hypotheses, patching and scrubbing as interventional tests, and robustness-style discipline in specifying distributions and threat models—motivate a unified formalization. We therefore proceed to model the internal computation as a causal directed acyclic graph with an explicit intervention interface, to define behavioral predicates evaluated on outputs, and to specify OOD input families via explicit samplers. Within that formal setup we can state, and later certify, when a proposed abstraction is sufficient in the operational sense that replacing the putative mediators preserves the behavior on the declared OOD family.


3. Formal Setup: models as causal DAGs; intervention families; behavior functionals; OOD families; notion of explanation via component replacement.

We model a trained neural network as a causal system by identifying its forward computation with a directed acyclic graph. Concretely, let M be a (possibly randomized) model mapping an input x to an output y = M(x), where y may be a token distribution, a tool-call decision, or a class label. We fix a decomposition of the forward pass into internal (e.g. residual stream states at each layer and position, attention head outputs, MLP outputs, or coordinates in a learned feature basis such as SAE features). This induces a DAG G = (V, E) whose vertices V are these internal nodes and whose edges E follow the dataflow of the computation graph. In the deterministic case, each node value is a deterministic function of its parents and the input; in the randomized case (e.g. dropout, sampling, stochastic tools) we may equivalently incorporate exogenous noise variables, but our certification target will always be stated as an expectation over the declared input distribution (and any fixed model randomness protocol).

We treat G as a structural causal model in Pearl’s sense: for each v ∈ V we write
v = fv(pa(v), x),
where pa(v) denotes the parent set of v in G. We emphasize that G is not an additional assumption about the world; it is simply a convenient representation of the already-defined forward computation, chosen at a granularity appropriate for explanation. The freedom to choose primitives (neurons, heads, features, ) is part of the modeling interface, and later bounds and algorithms will be stated relative to this choice.

A mechanistic claim is only meaningful relative to a declared family of permitted perturbations. We therefore fix an intervention family consisting of operations that modify internal computations. The basic primitive is a node intervention do(v ← z), which sets the value of an internal node v ∈ V to a specified value z during the forward pass, holding the rest of the computation unchanged except for downstream consequences. In practice, z may be obtained by running M on a different input and caching activations (activation patching), by sampling z from a reference distribution (resampling or mean ablation), or by restricting the modification to a subspace (e.g. patching only selected SAE coordinates). We also allow structured interventions that approximate edge-level manipulations (path/edge patching) when the implementation permits decomposing a node into additive contributions from distinct parents. The role of is twofold: it defines what causal questions we are allowed to ask, and it constrains the space of explanations we can validate without leaving the audit threat model.

We assume a white-box interface sufficient to evaluate both observational and interventional behavior: given x, we can compute M(x), access internal node values along the run, and evaluate interventional variants Mdo(v ← z)(x) for any (v, z) permitted by . This interface is the resource that distinguishes mechanistic certification from black-box testing: it enables targeted causal probes whose sample complexity would otherwise be prohibitive.

We decouple mechanistic analysis from the full output distribution by fixing a
B: (x, y) ↦ ℬ,
where is a finite set, typically {0, 1} (pass/fail) or a small label set. The functional B specifies what we intend to preserve under replacement. Examples include: whether a generated answer satisfies a constraint, whether a tool call is made, whether a classifier predicts a target class, or whether a trajectory prefix leads to a particular discrete decision. This choice determines what our certificate will and will not claim: we certify preservation of B, not necessarily equality of logits, hidden states, or free-form text. We write B(M(x)) as shorthand when B depends only on the output (and optionally on x in the background).

Robustness claims require an explicit test distribution. We therefore fix an out-of-distribution family 𝒟OOD given as an explicit sampler. In the simplest case 𝒟OOD is a single distribution over inputs; more generally it may be a finite union $\mathcal{D}_{\mathrm{OOD}}=\bigcup_{k=1}^K \mathcal{D}_k$ corresponding to multiple generators (e.g. different prompt templates, adversarial transformations, or environments). Our guarantees will be stated only with respect to sampling from this declared family, and any empirical evaluation used for certification is required to draw i.i.d. samples from it. We may also have access to an in-distribution source 𝒟 for fitting representations (e.g. SAEs) or proposing hypotheses, but 𝒟OOD is the distribution on which the certificate is ultimately judged.

An explanation in our framework is not merely a set of important components; it is an whose adequacy is judged by behavioral invariance under replacement. Formally, an explanation consists of an abstract causal model A with nodes U and edges EA, a mapping π : U → V identifying a set of concrete mediator nodes in M, and a replacement scheme S specifying how A is executed and how its values are injected into the corresponding internal nodes. We write M ∘ S for the : on input x, we run M except that whenever a mediator node v = π(u) would be computed by the original forward pass, we instead compute the corresponding abstract variable u according to A (possibly using upstream activations or other permitted inputs to A) and then set v to the injected value prescribed by S. The remainder of M is left unchanged.

This definition subsumes several common mechanistic tests. If A samples mediator activations from a conditional distribution given an abstract variable, we recover a form of causal scrubbing; if A deterministically computes mediator values from a small set of interpretable variables, we obtain a program-like abstraction. In all cases, the operative criterion is the on the declared OOD family:
ε(A, π, S) := Prx ∼ 𝒟OOD [B(M(x)) ≠ B((M ∘ S)(x))].
Our subsequent problem statement will ask for procedures that, using only interventions from and samples from 𝒟OOD, output an explanation (A, π, S) together with a high-confidence upper bound on ε(A, π, S).


4. Problem Statement (RMC): define Robust Mechanistic Certification; specify inputs/outputs; discuss what counts as success and what is explicitly not claimed.

We now state the certification problem that our protocol is designed to solve. Informally, we seek an —an abstract mechanism inserted into a small set of internal mediator nodes—together with a high-confidence bound on the induced behavioral disagreement over a declared out-of-distribution family.

We assume as given:

All guarantees are relative to these declared objects. In particular, is part of the threat model: it determines what causal probes are admissible, and it bounds what can be validated without stepping outside the auditing interface.

A candidate explanation is a triple (A, π, S) where A is an abstract causal model with nodes U, π : U → V maps abstract variables to concrete mediator nodes, and S specifies how to execute A and inject its values into the corresponding mediator nodes during a forward pass, yielding the replacement model M ∘ S. We measure adequacy solely by agreement under replacement on 𝒟OOD. Concretely, the (true) replacement disagreement risk is
ε(A, π, S) := Prx ∼ 𝒟OOD [B(M(x)) ≠ B((M ∘ S)(x))].
When M is randomized under a fixed randomness protocol (e.g. sampling), the probability above is taken over both x ∼ 𝒟OOD and the model randomness, unless stated otherwise.

A is a procedure that outputs either:

In , the probability statement is a requirement: whenever the procedure returns (A, π, S, ε̂), the bound must upper-bound the true replacement error over 𝒟OOD with confidence at least 1 − δ.

We emphasize the asymmetry between and . Soundness constrains only the former: returning is always permitted and is interpreted as ``no certified sparse explanation was obtained under the given interface and budget,’’ not as evidence that no such explanation exists.

The certificate in is meaningful only if it is computed from evaluation samples that are independent of the hypothesis selection and fitting steps. Thus, we require that any data used to propose mediators, fit A, or tune hyperparameters be disjoint from the data used to estimate replacement disagreement, or else that a valid uniform-convergence correction be applied over the effective hypothesis set tested. In particular, if multiple candidate explanations are adaptively searched, the reported confidence level must account for this adaptivity (e.g. by union bounds over a finite tested set, or by allocating fresh held-out samples per hypothesis).

Among all explainers that can be certified, we prefer those with small mediator sets, i.e. small |U| (equivalently, sparse π(U) ⊆ V), and with tight bounds ε̂. This preference is not itself a guarantee: by Theorem~5 (NP-hardness of minimum mediator discovery), exact minimality is intractable in general. Accordingly, we view mediator sparsity as an optimization target under compute constraints, not a property we can promise.

Our certification target is intentionally narrow.

With these boundaries fixed, the RMC problem asks for procedures that leverage white-box interventions to produce PAC-style bounds on behavioral invariance under mechanistic replacement, with sparsity treated as the principal explanatory objective subject to computational feasibility.


5. Certification Protocol: modular pipeline (candidate mediator search, abstraction fitting, replacement construction, validation). Provide pseudocode and invariants.

We now describe a concrete modular protocol that searches for a sparse set of mediators and then certifies behavioral preservation by on held-out samples from the declared sampler for 𝒟OOD. The protocol is intentionally structured so that (i) all causal probes are explicit interventions from , and (ii) the only statistical guarantee required downstream is a concentration bound for a Bernoulli disagreement indicator. The resulting soundness statements are proved in Section~6.

The protocol comprises four stages:

Stages (1)–(3) may use arbitrary modeling choices subject to the interface; only stage (4) is constrained to produce a reportable certificate.

We begin by sampling a fit set Sfit and a disjoint evaluation set Seval from the declared sampler for 𝒟OOD. The fit set may be used for all adaptive steps, including selecting mediators, fitting A, and tuning hyperparameters. The evaluation set is reserved exclusively for estimating replacement disagreement. If the search tests multiple hypotheses (distinct explanations), we either (a) allocate disjoint evaluation subsets per hypothesis, or (b) apply an explicit finite-family correction (formalized in Section~6).

We treat candidate discovery as a causal attribution problem: which internal nodes v ∈ V have large interventional influence on B? Concretely, for each probe we draw x ∼ Sfit, choose an admissible intervention do(v ← z) ∈ ℐ (for example, patching v from a counterfactual run, or setting coordinates in an SAE basis), and measure the induced change in B(M(x)). Aggregating across samples yields a score Score(v) intended as a proxy for the causal effect of v on B under the intervention family. We then select a small mediator set by greedy addition, using the marginal gain of adding a node to explain (or stabilize) B under interventions. This step is heuristic and may be viewed as an approximation algorithm for a combinatorial selection problem; our guarantees do not require that it succeed, only that its outputs be evaluated soundly.

Given a selected mediator set  ⊆ V, we fit an abstract model A with node set U and mapping π : U → . The fitting objective depends on the chosen hypothesis class (e.g. bounded-size DAGs, constrained programs, or regression models on SAE features). A typical objective is to predict mediator values (or sufficient statistics thereof) as functions of upstream observed quantities, optionally under interventions, so that when A is substituted into the computation of π(U), the downstream behavior B is preserved. Importantly, A is permitted to be imperfect; the certificate will quantify the induced behavioral error rather than assuming exact equivalence.

A replacement scheme S specifies when and how A is executed and injected. Operationally, a forward pass of M ∘ S proceeds as in M except that, when the computation reaches a mediator node v = π(u), the protocol computes u using A (given the agreed-upon inputs to A) and overwrites the value at v with the injected representation of u. The injection operation must be legal under ; for instance, we may overwrite an activation vector, replace only a projection onto a learned subspace, or set selected SAE coordinates. We emphasize that S is part of the explanation: distinct choices of where and how to inject can yield different replacement errors even for the same abstract graph.

On Seval = {x1, …, xm} we compute the empirical replacement disagreement
$$ \hat e\ :=\ \frac{1}{m}\sum_{i=1}^m \mathbf{1}\!\left[B(M(x_i))\neq B\big((M\circ S)(x_i)\big)\right]. $$
We then report a bound ε̂ obtained by adding an explicit concentration term to (and, if applicable, an adaptivity correction over the tested hypothesis set). The protocol returns only if ε̂ ≤ ε; otherwise it returns .

The modularity above is useful only if certain invariants are maintained throughout:

Given these invariants, the remaining work is to formalize the finite-sample bound used in step (4) and the correction needed when multiple explanations are tested, which we do next.


6. Soundness Theorems (PAC Guarantees): finite-sample bounds for replacement error under explicit sampling assumptions; how to handle multiple hypotheses (union bounds).

We now formalize the statistical guarantee used in the validation stage: from a held-out evaluation set drawn i.i.d. from the sampler for 𝒟OOD, we compute an empirical replacement-disagreement rate and convert it into a high-confidence upper bound on the replacement-disagreement probability. The salient point is that the certificate concerns only a Bernoulli random variable, so we require no modeling assumptions beyond independent sampling and the protocol invariants from Section~.

Fix a model M, a behavior functional B, a declared OOD sampling procedure for 𝒟OOD, and a fixed explanation (A, π, S) inducing a replacement model M ∘ S. We allow that M and/or M ∘ S may be randomized (e.g. dropout, stochastic tools, or randomized abstraction A); we interpret all probabilities below as additionally marginalizing over any internal randomness, unless explicitly conditioned.

Define the disagreement indicator
Z(x) := 1 [B(M(x)) ≠ B((M ∘ S)(x))] ∈ {0, 1},
and its true risk over the declared OOD distribution,
ε := 𝔼x ∼ 𝒟OODZ(x).
Given an evaluation set Seval = {x1, …, xm} drawn i.i.d. from the declared sampler, we compute the empirical risk
$$ \hat e\ :=\ \frac{1}{m}\sum_{i=1}^m Z(x_i). $$
Soundness is the statement that, with probability at least 1 − δ over the draw of Seval (and any internal randomness), the reported ε̂ satisfies ε ≤ ε̂.

For a fixed (A, π, S) chosen independently of Seval, Hoeffding’s inequality directly yields a one-sided upper confidence bound.

Let (A, π, S) be fixed without access to Seval. Let $x_1,\dots,x_m\overset{i.i.d.}{\sim}\mathcal{D}_{\mathrm{OOD}}$ and define Z and as above. Then for any δ ∈ (0, 1),
$$ \Pr\!\left[\varepsilon \le \hat e + \sqrt{\frac{\log(2/\delta)}{2m}}\right]\ \ge\ 1-\delta. $$
Accordingly, the certificate
$$ \hat\varepsilon\ :=\ \hat e + \sqrt{\frac{\log(2/\delta)}{2m}} $$
upper-bounds the true replacement disagreement probability over the declared 𝒟OOD with confidence at least 1 − δ.

Since Z(xi) ∈ [0, 1] are i.i.d. with mean ε, Hoeffding implies
Pr [ε −  ≥ t] ≤ exp (−2mt2) for all t > 0. Setting
$t=\sqrt{\frac{\log(2/\delta)}{2m}}$ yields the claim.

The protocol may search over many candidate explanations (distinct mediator sets, abstractions, or injection schemes). If we compute on Seval for several hypotheses and then select a favorable one, we must account for this multiplicity. A clean correction applies when the set of hypotheses tested on Seval is fixed inspecting Seval (Invariant~I2), e.g. by generating candidates using Sfit only.

Let test be a finite set of explanations (A, π, S) produced without access to Seval. For each h ∈ ℋtest, compute
$$ \hat\varepsilon_h\ :=\ \hat e_h + \sqrt{\frac{\log(2|\mathcal{H}_{\mathrm{test}}|/\delta)}{2m}}, \qquad \hat e_h := \frac{1}{m}\sum_{i=1}^m Z_h(x_i). $$
Then, with probability at least 1 − δ, we have simultaneously for all h ∈ ℋtest that εh ≤ ε̂h.

Apply Theorem~6.1 to each fixed h with confidence parameter δ/|ℋtest| and union bound the failure events.

If hypotheses are generated using feedback from Seval (e.g. repeatedly evaluating on the same holdout while tweaking A), the i.i.d. risk estimate is no longer valid without additional machinery. Our protocol therefore enforces one of the following: (i) strict holdout integrity, in which all adaptive choices are confined to Sfit and Seval is touched once; or (ii) an explicit accounting scheme, such as allocating disjoint evaluation subsets to successive tests, so that each reported bound conditions on a fresh independent sample. Either approach suffices for Invariant~I2 and thus for Theorems~6.1–6.2 to apply as stated.

Rearranging Theorem~6.1, to achieve an additive slack at most α in the certificate term it suffices that
$$ m\ \ge\ \frac{1}{2\alpha^2}\log\frac{2}{\delta}. $$
We emphasize that this dependence is intrinsic for estimating Bernoulli risks in the worst case; improving constants is possible (e.g. exact binomial confidence intervals), but does not change the qualitative scaling.

These soundness statements reduce certification to a transparent statistical test whose only input is the declared OOD sampler and the evaluated disagreement events. In the next section we complement soundness with a completeness discussion in compiled settings, where ground-truth abstractions exist and can be recovered under additional identifiability assumptions.


7. Completeness in Compiled/Algorithmic Testbeds: conditions under which a sparse ground-truth abstraction is recoverable; proof sketch; discussion of identifiability assumptions and how they can fail.

We next record a statement for settings in which the model is known to be generated from an interpretable algorithmic specification. The intended use is not that arbitrary frontier models satisfy these assumptions, but that our protocol can be validated end-to-end on testbeds where a sparse ground-truth abstraction exists (and is in principle recoverable). Typical examples include Tracr-compiled transformers, RASP-style programs, or synthetic models whose computation is implemented by a small number of identifiable attention heads and MLP features.

Assume there exists an abstract causal model A* with nodes U* = {u1, …, us} and edges EA* such that, for every input x in the support of 𝒟OOD, the behavior B(M(x)) is a (possibly randomized) function of the abstract variables U*(x), and moreover each abstract variable uj is realized inside M by a node vj ∈ V in the computation graph. Formally, we assume a mapping π* : U* → V and a replacement scheme S* such that, if we compute mediator values using A* and inject them at π*(U*), then the resulting replacement model M ∘ S* is behaviorally equivalent to M on 𝒟OOD:
Prx ∼ 𝒟OOD [B(M(x)) ≠ B((M ∘ S*)(x))] = 0,
or at most η if the compilation injects bounded noise (e.g. quantization or stochasticity) that cannot be removed by mediator replacement.

Completeness requires that the mediators be by the intervention family . We encode this in two assumptions, which are standard in causal identification arguments but here instantiated at internal nodes of M.

Under these assumptions, interventions furnish a separation oracle: if a node is causally upstream of B the mediator mechanism, then targeted interventions reveal a consistent causal effect on B; if a node is irrelevant (or only correlationally associated), then such effects vanish when mediators are held fixed.

Assume a compiled setting with a ground-truth explanation (A*, π*, S*) of mediator size s, and assume the identifiability and on-distribution intervention conditions above. Suppose further that (i) the candidate-generation subroutine returns a set containing π*(U*), and (ii) the abstraction fitting stage searches a hypothesis class containing A* (up to observational equivalence on the mediator distributions induced by 𝒟OOD and ). Then, given sufficient interventional query budget to estimate causal effects reliably on the fit set, the protocol can return an explanation (A, π, S) with true replacement disagreement at most η (often η = 0), and the subsequent validation stage certifies this with m = O(log (1/δ)) evaluation samples for any fixed target confidence 1 − δ.

We proceed by logical necessity from the assumptions. Since contains the true mediator nodes and mediators are identifiable, a greedy (or otherwise consistent) mediator-selection procedure that ranks nodes by estimated causal effect on B under interventions will include all mediators in π*(U*) once the estimation error is sufficiently small; the on-distribution intervention semantics ensure these causal effects are not artifacts of broken activations. With the correct mediator set available, the fitting step can represent A* within the hypothesis class, and hence can choose A and π that match the true mediator mechanism on the fit distribution (and under the interventions used to constrain the fit). Substituting this A into M via S yields M ∘ S that is behaviorally equivalent to M on 𝒟OOD, up to the irreducible compilation noise level η. Finally, the validation bound of Section~ converts the observed disagreement on m fresh OOD samples into a high-confidence upper bound; if the true disagreement is 0 (or η), m = O(log (1/δ)) suffices to make the statistical slack small.

The above argument is deliberately brittle: it clarifies where completeness should and should not be expected.
First, identifiability fails if mediators are so that no intervention can vary one mediator while holding the others fixed in distribution. This arises when two abstract variables are encoded in a rotated subspace of the residual stream and only permits coordinate-wise patching, or when layerwise operations (e.g. attention and MLP blocks) mix the mediators before any accessible node boundary.
Second, on-distribution intervention semantics can fail when the intervention induces a distribution shift in non-mediator nodes. For instance, hard-setting a residual-stream vector may disrupt layernorm and thereby change downstream computations in ways not attributable to the intended mediator; in such cases causal effect estimates confound mechanism with intervention damage.
Third, completeness can fail due to : even if a sparse A* exists, the learned feature basis used for mediator discovery (e.g. SAE features) may not align with π*(U*), so that no small mediator set in the chosen coordinates is sufficient.
Finally, there are failures of : if B is sensitive to microscopic differences (e.g. exact token probabilities rather than discrete decisions), then replacement equivalence may require preserving many degrees of freedom, and the smallest faithful mediator set may be large even in compiled settings.

These failure modes delimit the scope of completeness claims: they do not undermine soundness of any certificate, but they explain why the protocol may return despite the existence of an intuitive high-level story, and why we treat sparse mechanistic recovery as an assumption-dependent property rather than a universal guarantee.


8. Computational Complexity and Hardness: NP-hardness of minimum mediator selection; implications; approximation strategies and what we can still guarantee.

The protocol described above separates from the a small mediator set and a compatible abstraction. In this section we make explicit that the latter problem is intractable in the worst case, even under generous oracle access to internal interventions. Consequently, we do not interpret as evidence that no sparse mechanistic explanation exists; rather, it may reflect the inherent combinatorial difficulty of mediator selection under finite query budgets.

Fix a model M, behavior functional B, an OOD family 𝒟OOD, and an intervention family . Consider the decision problem:
$$ \textsc{SmallMediator}(k):\quad \exists~(A,\pi,S)\ \text{with}\ |\mathrm{im}(\pi)|\le k\ \text{and}\ \Pr_{x\sim \mathcal{D}_{\mathrm{OOD}}}\!\left[B(M(x))\neq B((M\circ S)(x))\right]=0? $$
Even if we restrict A to be a simple Boolean circuit over mediator variables (or a bounded-depth program) and grant oracle access to interventional evaluations of internal nodes, the following holds.

The optimization problem of finding the smallest mediator set S ⊆ V for which there exists an abstraction A and replacement scheme S(⋅) achieving zero replacement disagreement (or disagreement  ≤ ε for fixed ε < 1/2) is NP-hard. Equivalently, (k) is NP-complete under standard encodings.

Let (𝒰, 𝒮) be a instance with universe 𝒰 = {e1, …, em} and sets 𝒮 = {S1, …, Sn}. We construct a model M whose internal nodes include n candidate components v1, …, vn, each corresponding to a set Si. Inputs x encode an element index j ∈ [m] (drawn from 𝒟OOD uniformly over indices), and the behavior B is defined so that B(M(x)) = 1 if and only if at least one active component vi with ej ∈ Si is preserved by the replacement scheme. More concretely, the computation is arranged so that for input j the output depends on the disjunction
yj = ⋁iej ∈ Sivi,
and the replacement mechanism succeeds on all j if and only if the chosen mediator set contains, for every ej, some vi that covers ej. Under an intervention family permitting node replacement (or clean-to-corrupt patching) on each vi, any explanation with mediator set S can preserve B on all indices if and only if S is a set cover. Thus the smallest mediator set corresponds to the minimum set cover, implying NP-hardness.

Theorem~8.1 rules out worst-case guarantees of minimality (unless P = NP) for any algorithm that, given (M, B, 𝒟OOD, ℐ), always returns an explanation with minimum mediator size. This includes algorithms that have white-box access and can query interventional effects: the obstacle is not statistical estimation alone but the combinatorics of selecting a minimal causally sufficient subset among many interacting candidates. Accordingly, we treat mediator discovery as an approximation problem, and we interpret the certificate as a guarantee about behavioral preservation of the replacement model, not about optimality of the mediator set.

A standard relaxation is to select mediators by maximizing a tractable proxy for ``mechanistic coverage.’’ One such proxy is a monotone set function F(S) measuring the estimated causal effect of intervening on S on the behavior (e.g. reduction in replacement disagreement when S is preserved, or mutual information between S and B under interventions). When F is (approximately) submodular and monotone, greedy forward selection yields the classical (1 − 1/e) approximation ratio with respect to F:
F(Sgreedy) ≥ (1 − 1/e)max|S| ≤ kF(S).
We emphasize that this guarantee is only as meaningful as the proxy: submodularity may fail when mediators interact non-additively, and high proxy value does not by itself imply low replacement error without additional assumptions connecting F to replacement behavior.

A second strategy is to incorporate sparsity directly into abstraction fitting by penalizing mediator usage, e.g. by adding an 0-style budget (implemented by stepwise selection) or an 1 surrogate over mediator coefficients in a parametric A. This does not evade NP-hardness in general, but it yields practical tradeoffs and provides a well-defined sequence of hypotheses whose behavior can be certified on held-out OOD samples.

Hardness of search does not weaken the guarantee of . For any fixed candidate (A, π, S) produced by any heuristic, the replacement disagreement on 𝒟OOD is a Bernoulli risk, and concentration bounds yield a high-confidence upper bound from m independent evaluation samples. Thus, while we cannot generally promise to find the smallest mediator set, we can promise that whenever the protocol returns (A, π, S) with certificate (ε̂, δ), the inequality
Prx ∼ 𝒟OOD [B(M(x)) ≠ B((M ∘ S)(x))] ≤ ε̂
holds with probability at least 1 − δ over the evaluation sampling (and any internal randomness of M and the procedure). In particular, is conservative: it indicates that no tested hypothesis met the target bound under the declared sampling and hypothesis-testing corrections, not that the model lacks a sparse mechanism.

Finally, we record the operational cost drivers. Let Q be the number of interventional forward passes used for candidate scoring and mediator selection, and let m be the held-out evaluation size for certification. Then the end-to-end runtime is dominated by Q + m forward evaluations (plus abstraction fitting time), and the memory footprint is dominated by activation caching for the selected nodes. Theorems~8.1 and related lower bounds therefore motivate allocating compute to only insofar as they improve the chance of finding a certifiable explanation under the fixed, sound validation procedure.


9. Practical Instantiations: (planned) SAE-based mediator variables; automated circuit discovery; causal scrubbing validation; guidelines for designing on-distribution interventions.

We now record concrete instantiations of the abstract objects (U, A, π, S) and of the intervention family that make the protocol executable on contemporary transformer models. The guiding constraint is that, while Theorem-level guarantees concern only the stage, the stage must expose enough causal structure to make it plausible that a small mediator set exists and can be found under realistic query budgets.

A natural choice of mediator coordinates is provided by sparse autoencoders (SAEs) trained on residual-stream (or MLP/attention output) activations. Fix a layer and a site t (token position). Let r, t ∈ ℝd denote the corresponding activation. An SAE gives an encoding map Enc : ℝd → ℝp and decoding map Dec : ℝp → ℝd with a sparse code a, t = Enc(r, t) and reconstruction , t = Dec(a, t). We treat individual coordinates (or small groups) of a, t as candidate mediator nodes. Concretely, we may take
U = {u1, …, us},   π(uj) = (j, tj, kj),
where (, t, k) indexes the k-th SAE feature at layer and position t. The replacement scheme S then operates at the level of codes: we run M up to the chosen site(s), overwrite the selected coordinates of a, t with values produced by the abstraction A (or resampled from a reference cache), decode back to d, and inject the resulting activation into the residual stream. This yields an intervention family that is both expressive (feature-level edits) and localized (affecting a small subspace), which empirically improves stability relative to full-vector replacement.

Given a behavior functional B, we require a mechanism to propose a small pool Vcand ⊆ V of causally relevant internal sites prior to fitting A. We anticipate using automated circuit discovery methods instantiated as : for each candidate node v ∈ V and input x, we compare B(M(x)) to B(Mdo(v ← z)(x)) under legal z drawn from an appropriate distribution (e.g. a clean cache, a corrupt cache, or an SAE-feature perturbation). Aggregating over x ∼ 𝒟fit yields an estimated causal score
$$ \widehat{\mathrm{CE}}(v)\;=\;\mathbb{E}\big[\mathbf{1}\{B(M(x))\neq B(M_{\mathrm{do}(v\leftarrow z)}(x))\}\big], $$
or, when B is graded, a corresponding loss difference. We then form Vcand by thresholding and by enforcing diversity constraints (across layers/heads/features) to reduce redundancy. For feature-level mediators, a useful variant is to intervene on of SAE features that co-activate, since single-feature edits may understate causal influence when the representation is distributed.

From Vcand we select a mediator set of size s by a greedy procedure on a proxy objective consistent with the replacement goal: at each step we choose the node whose preservation most reduces empirical replacement disagreement on a small split. This yields a sequence of hypotheses indexed by s, suitable for Theorem~2 style corrections (or for a fixed held-out evaluation set used only once at the end). Given a mediator set, we fit A as a constrained predictor of mediator values and/or of B under interventions. In compiled settings, A may be taken as a small DAG whose structural equations are (piecewise) linear maps on discrete features; in uncompiled settings, we anticipate using sparse regression or small decision programs. The mapping π is then defined by the chosen internal sites (or SAE coordinates), and the replacement scheme S is the deterministic procedure that computes A on its inputs and injects its mediator outputs into M.

The causal scrubbing'' paradigm can be expressed within our framework by taking $S$ to \emph{resample} mediator nodes from a reference distribution conditioned on an abstract variable, leaving all other computation intact. In our notation, we compare $M$ to $M\circ S$ where, for each mediator $v=\pi(u)$, the replacement value is drawn as $z\sim \widehat{P}(v\mid u)$ estimated from a cache. Our certificate then bounds the probability that this scrubbing operation changes the behavior $B$ over $\mathcal{D}_{\mathrm{OOD}}$. This reframes causal scrubbing as a sound, distribution-indexed validation problem: the substantive work is in choosing interventions and conditioning variables so that the scrubbing distribution ison-distribution’’ in the sense required by the invariants.

The practical difficulty in mechanistic certification is not defining do(v ← z) but ensuring that such interventions do not create activation states far outside the model’s training manifold, which would make replacement disagreement uninterpretable. We therefore restrict to interventions that approximately preserve relevant conditional distributions of non-mediator nodes given mediators. Operationally, this suggests: (i) , where z is taken from activations observed on real samples (possibly matched on simple statistics such as token identity, position, or coarse latent state); (ii) , where only a small set of sparse coordinates is edited while the remaining coordinates are left untouched, reducing distribution shift; and (iii) , where z is drawn from a learned conditional model (v ∣ c) for a context variable c (e.g. token class or abstract state), thereby avoiding unconditional mixing of incompatible activation modes. We additionally recommend normalizing interventions to respect layer norm statistics (e.g. by patching pre-norm quantities or by reapplying normalization post-patch) when the underlying architecture is sensitive to scale.

Finally, we emphasize an accounting-compatible implementation strategy: we cache only the activations needed for the selected mediators (and any parents used by A), we separate the data used for fitting from the evaluation split used for the certificate, and we record the total number of interventional forward passes used to score candidates. This ensures that empirical certificates can be interpreted as guarantees relative to the explicitly declared 𝒟OOD and , independent of the heuristics used to propose (A, π, S).


10. Experimental Plan (to strengthen contribution): compiled transformers (Tracr/InterpBench/TracrBench-like) for ground truth; progressively relax assumptions (noise, redundancy, superposition); small open LMs for external validity; falsification-first reporting.

We begin with transformers for which a ground-truth abstraction is known by construction. Concretely, we use Tracr-style compilation pipelines and related benchmark suites (e.g. InterpBench/TracrBench-like tasks) where the source program P induces an abstract causal model A with an explicit mediator set U and a known alignment π : U → V (often up to benign symmetries such as permutation of redundant features). These settings allow us to operationalize completeness claims: CandidateSearch is judged by whether it returns a superset of π(U) under a declared , and FitAbstraction is judged by whether it reconstructs a model A whose replacement M ∘ S attains near-zero empirical disagreement on a held-out 𝒟OOD evaluation set. We will vary the task families to cover both local algorithmic behaviors (e.g. copy, parity, bracket matching) and multi-step state machines (e.g. finite automata) so that mediators include both token-local and trajectory-level state variables. In each case, we pre-register the OOD generators 𝒟k (length extrapolation, vocabulary permutations, distractor insertion, controlled distribution shift in latent state occupancy) and report certificates with respect to the declared family 𝒟OOD = ∪k𝒟k.

Having established a regime where the ground truth is accessible, we progressively relax the idealized conditions underlying identifiability and on-distribution interventions. First, we inject controlled compiler noise and architectural mismatch (e.g. added residual pathways, extra attention heads, or random linear mixing) to test robustness of mediator discovery when the true abstraction is present but not cleanly isolated. Second, we introduce : we explicitly compile multiple copies of the same abstract variable into disjoint internal subcircuits, thereby creating non-unique minimal mediator sets. Here the objective is not to recover π uniquely, but to certify a replacement-preserving mediator set and to quantify how mediator size scales with redundancy. Third, we enforce by mixing several abstract variables into shared internal directions before non-linearities; in SAE-based instantiations this tests whether learned feature bases Φ recover approximately disentangled coordinates adequate for stable interventions in . For each relaxation, we record (i) query budget Q needed by CandidateSearch to reach a target certified ε̂, (ii) mediator size s versus redundancy/superposition strength, and (iii) sensitivity of the certificate to the choice of intervention distribution (cache source, conditioning granularity, and patch magnitude constraints).

We then move to small open LMs where no ground-truth abstraction exists. Here the experimental goal is not completeness relative to a known A, but rather to demonstrate that the protocol can produce nontrivial certificates under realistic constraints. We fix a small family of behaviors B (e.g. format-following, simple tool-call triggering, short-horizon reasoning templates, or safety-relevant classifiers) and define explicit OOD generators (prompt paraphrases, adversarially templated inputs, length extrapolation, and distribution shifts induced by role/system-message changes when applicable). The primary deliverable is a set of certified explanations (A, π, S) with measured mediator sizes and well-defined , together with ablations showing that (a) certificates degrade predictably when interventions become more off-manifold (e.g. unconditional resampling), and (b) certificates tighten when interventions are made more on-distribution (e.g. conditional caches). We also report the stability of π and A across random seeds and across modest changes to 𝒟OOD, as a sanity check against overfitting to a narrow generator.

Because mechanistic claims are susceptible to subtle leakage between fitting and evaluation, we adopt falsification-first reporting. We include negative controls in which S replaces purported mediators with mismatched caches or with mediators from unrelated inputs, and we confirm that the estimated replacement disagreement increases, thereby validating that the test is sensitive. We also include mediator sets of matching cardinality sampled from layers/sites with low estimated causal score, to bound the risk that certificates are obtained merely by injecting generic noise that is behaviorally inert for accidental reasons. Finally, we explicitly separate: (i) performance of the heuristic (how often it finds a good candidate), from (ii) soundness of the conditional on a fixed hypothesis. When multiple hypotheses are adaptively tried, we either reserve a single final evaluation set untouched by selection, or we apply explicit hypothesis-testing corrections consistent with Theorem~2. Across all experiments, we will report not only successes but also systematic failure cases (no certificate under the compute budget, or only certificates with large mediator sets), since these delineate the practical boundary of the method under declared (𝒟OOD, ℐ) rather than suggesting unearned generality.


11. Limitations and Failure Modes: off-distribution patching artifacts; hydra/self-repair; non-linear/manifold features; adversarially evasive models; what certificates do and do not imply.

Our certificates are inherently to two declared objects: the OOD family 𝒟OOD and the intervention family . Theorem~1 upper-bounds the true disagreement risk Prx ∼ 𝒟OOD[B(M(x)) ≠ B((M ∘ S)(x))] for a explanation (A, π, S), but it does not speak to inputs outside the support of 𝒟OOD, nor to perturbations outside . Consequently, a small ε̂ should be read as: ``under the declared sampling assumptions, replacing mediators according to S preserves the behavioral predicate B on 𝒟OOD,’’ and not as a global equivalence claim about M and M ∘ S. In particular, if 𝒟OOD fails to assign non-negligible probability to a problematic regime (rare triggers, long-horizon states, or atypical tool-call contexts), then no finite-m evaluation can yield a meaningful bound for that regime, regardless of white-box access.

Many practically convenient interventions (e.g. activation patching from unrelated cache sources, unconditional resampling, or large-magnitude subspace edits) are off-manifold relative to the conditional activation distribution induced by M on 𝒟OOD. When do(v ← z) ∈ ℐ injects values z that are atypical given upstream context, the model may enter activation regions where downstream computations are poorly behaved (saturation, attention collapse, or spurious feature co-activations). In this case, observed sensitivity of B to patched nodes can reflect intervention artifacts rather than causal mediation in the unperturbed forward pass. This failure mode affects both stages: CandidateSearch may overestimate a node’s causal relevance because the intervention violates implicit invariances, and ValidateByReplacement may underestimate disagreement because both M and M ∘ S become similarly degraded under the same off-manifold perturbation patterns induced indirectly by the replacement. We mitigate but do not eliminate this risk by restricting to more on-distribution schemes (e.g. conditional caches keyed on upstream summaries), yet such restrictions reduce identifiability and can increase mediator size s.

Transformer computations often exhibit redundancy and dynamic rerouting: multiple internal pathways can compute correlated summaries, and later layers can partially reconstruct missing information from residual traces. Under interventions that sever or replace a subset of mediators, the remaining graph G may implement a form of whereby downstream components compensate, leading to small measured disagreement even when the proposed abstraction A captures only one of several redundant mechanisms. This is not a soundness violation—the certificate correctly bounds replacement disagreement—but it complicates interpretability claims: the mediators selected by greedy criteria need not correspond to a unique or ``true’’ mechanism, and the abstraction may be non-minimal even when ε̂ is small. Conversely, hydra-like redundancy can also CandidateSearch: if each single node has small marginal effect on B because other routes compensate, then submodular proxies may fail to select the correct set within a query budget Q, yielding false negatives (FAIL) despite the existence of a sparse ground-truth mediator set. This limitation is aligned with Theorem~5: searching for minimum mediator sets is computationally hard, and heuristic selection can be brittle when causal contribution is highly non-additive.

Our framework permits mediator nodes to be internal activations, attention head outputs, or coordinates in a learned feature basis (e.g. SAE features). However, many semantically coherent variables in large models appear as or manifolds in activation space rather than axis-aligned directions. If the relevant mediator is intrinsically nonlinear (e.g. a discrete state encoded by a polytope partition, or a feature whose representation rotates with context), then interventions that set linear coordinates may not correspond to any valid internal state of M. In such cases, fitting A to predict mediator values can be ill-posed (the conditional distribution of the mediator given inputs is multi-modal or non-smooth), and replacement via S can create inconsistent mediator configurations that do not occur in natural forward passes. Empirically, this manifests as replacement-induced brittleness: M ∘ S agrees with M on a narrow 𝒟OOD slice but diverges sharply under small changes in prompts, length, or upstream noise. The certificate remains correct for the declared distribution, but the explanation is less portable than the mechanistic language might suggest.

Because the protocol is explicit about 𝒟OOD and , an adaptive or adversarial model designer could attempt to construct M whose behavior matches a benign abstraction on the tested family while behaving differently elsewhere. This is a variant of distributional speciousness: a model can implement ``if input in test-like region then follow interpretable circuit else follow hidden circuit.’’ White-box access does not automatically prevent such evasions if the hidden circuit is activated outside 𝒟OOD or by intervention patterns not included in . More subtly, a model might detect the statistical footprint of patching (e.g. unusual activation correlations) and switch modes, causing validation to certify an artifact of the evaluation interface rather than the natural mechanism. These concerns underscore that certificates are only as strong as the governance around the declaration of 𝒟OOD and , and motivate conservative choices (broad generator unions, multiple intervention types, and negative controls), at the cost of larger m and weaker bounds.

A mechanistic certificate is a statement about for a specific predicate B, not about overall task performance, truthfulness, or safety of M. In particular: (i) ε̂ does not bound differences in token-level distributions unless B is chosen to be distribution-sensitive; (ii) it does not imply that A is causally sufficient for all downstream behaviors, only for the selected B; (iii) it does not guarantee minimality of s or uniqueness of π; and (iv) it does not preclude alternative explanations in H that achieve comparable ε̂ by exploiting slack in B. Finally, when hypotheses are adaptively proposed, sound reporting requires either strict holdout discipline or explicit multiple-testing corrections (Theorem~2); otherwise, one can obtain spuriously tight empirical disagreement by selection alone, even with large m.


Our protocol is most naturally situated at the intersection of (i) causal abstraction and causal model explanations, (ii) mechanistic interpretability and circuit discovery in neural networks, and (iii) empirical evaluation and governance-oriented auditing. The defining feature is that we treat an explanation as an —a replacement scheme S together with an abstract model A and mapping π—and we report a statistically valid bound on the resulting replacement-behavior disagreement over a 𝒟OOD and intervention interface .

The causal abstraction literature formalizes when a high-level causal model A is a valid abstraction of a low-level causal model G, typically via a mapping π and an exact or approximate commutation condition between interventions and predictions . This viewpoint is conceptually aligned with our use of π : U → V and the intervention notation do(⋅). However, classical formulations often treat the low-level model as a given SCM with a fixed interventional semantics, and they focus on or of abstractions rather than a computable, resource-bounded procedure that returns an auditable artifact with an explicit finite-sample guarantee. Our contribution is to make the abstraction claim in the setting where M is a neural computation graph: we instantiate interventions via on internal nodes/features, define a concrete replacement model M ∘ S, and certify (via Theorem~1/2) that the induced behavioral predicate B is preserved on an explicit OOD generator family. In this sense, we treat causal abstraction as a whose correctness is adjudicated by replacement-based risk, rather than as a purely structural relation.

A second line of work aims to identify computational subgraphs (circuits'') responsible for a behavior by combining attribution, ablation, and patching interventions in transformer internals \cite{olah2020zoom,elhage2021mathematical,wang2022interpretability}. Methods such as activation patching, path patching, causal tracing, and causal scrubbing provide practical tools for finding candidate mediators and for producing compact circuit hypotheses \cite{meng2022locating,chan2022causal}. Our CandidateSearch stage directly draws on this toolkit, but we differ in what is ultimately reported as an endpoint. Circuit discovery alone typically yields a ranked list of components or a hand-validated subgraph, often with qualitative or point-estimate metrics on a single benchmark distribution. We instead require an \emph{end-to-end replacement validation} step that produces a bound on $\Pr_{x\sim\mathcal{D}_{\mathrm{OOD}}}[B(M(x))\neq B((M\circ S)(x))]$ for held-out samples, with multiple-testing corrections when hypotheses are adaptively searched. Operationally, this shifts emphasis fromis this node important under this perturbation?’’ to ``does substituting the mediators by the abstraction preserve the behavior predicate under the declared sampling process?’’ The former is a discovery heuristic; the latter is a certificate-bearing claim.

A recurrent practical obstacle in mechanistic work is that meaningful internal variables may not align with individual neurons or heads. Recent work on sparse autoencoders, dictionary learning, and transcoder-style representations aims to provide a more interpretable coordinate system for interventions and analysis . Our framework is agnostic to the particular basis: we allow to include interventions on SAE features or other learned subspaces, and we permit π to map abstract variables to such coordinates. This should be read as a modular interface: feature learning can improve mediator discoverability and reduce mediator count s, but the certificate semantics depend only on the replacement disagreement definition and the declared 𝒟OOD, not on any interpretability claim about the learned features themselves.

In contemporary safety practice, models are often evaluated by black-box benchmarks, red teaming, and adversarial prompting. These provide useful evidence about , but they do not, in general, support strong conclusions about . This gap is formalized by our black-box lower bound (Theorem~4): without internal access and interventions, certifying a mechanism-level property over a rich OOD family can require exponentially many queries in the worst case. Our position is not that black-box evaluation is dispensable, but that it should be understood as answering a different question. Conversely, formal verification and robustness certification (e.g. Lipschitz bounds, randomized smoothing, reachability) typically target output stability under input perturbations and treat the model as an opaque function . Our certificate targets a different object: behavioral preservation under of a hypothesized mechanism. The statistical structure is correspondingly simpler (Bernoulli risk estimation over a declared sampler), but the access assumptions are stronger (white-box interventions).

From the perspective of model governance, our certificates are intended to function as standardized mechanistic audit artifacts, complementary to model cards, evaluation reports, and assurance cases . The output (A, π, S, ε̂, δ) is meaningful only together with the declared 𝒟OOD, , and the specification of B; these objects collectively define the scope of the claim and the conditions under which it is testable. This aligns with auditability norms that emphasize traceability of evaluation data, reproducibility of procedures, and explicit statements of coverage. Our protocol can therefore be viewed as an attempt to bring a narrow but precise form of mechanistic evidence into an audit pipeline: one can archive the abstraction code, mediator mapping, replacement implementation, and held-out evaluation seeds, and re-run ValidateByReplacement as models or OOD generators evolve.

Causal abstraction provides the language for high-level variables and interventional semantics; circuit discovery provides practical heuristics for proposing mediators; black-box evaluation provides broad behavioral signal but weak mechanism claims. We combine these into a single object whose central assertion is neither the circuit looks right'' northe benchmark score is high,’’ but rather a bounded-risk statement about replacement-behavior equivalence for a chosen predicate over a chosen OOD family. This positioning is deliberate: it yields claims that are limited in scope yet sufficiently formal to support cumulative, governance-relevant evidence.


13. Conclusion and Outlook: toward standardized mechanistic audit artifacts; open problems (dynamic/agentic settings, adversarial robustness, universality up to abstraction).

We have formalized a certificate-producing protocol for mechanistic explanations of a trained model M viewed as a causal DAG G = (V, E) with an explicit intervention interface . The central object is operational: an explanation is a triple (A, π, S) that induces a replacement model M ∘ S, and the claim we certify is a bounded-risk statement about behavioral preservation on a OOD family 𝒟OOD for a behavior functional B. Soundness is straightforward once we frame validation as estimation of a Bernoulli disagreement risk; completeness holds in compiled settings under identifiability and on-distribution-intervention assumptions; and hardness results justify why we target PAC-style certificates and approximate mediator selection rather than exact minimality.

The output (A, π, S, ε̂, δ) is best understood as an auditable artifact only when bundled with the specification of (B, 𝒟OOD, ℐ) and the evaluation procedure (including the held-out seeds and hypothesis-testing corrections). This suggests a standardization pathway: we can define a ``mechanistic audit report’’ format in which (i) 𝒟OOD is provided as a versioned sampler, (ii) is provided as an explicit API with documented constraints (which nodes, which subspaces, which patching operators), (iii) B is a deterministic functional with well-defined tie-breaking and termination rules, and (iv) (A, π, S) is provided as executable code (or a compact DSL) that can be re-run against M to reproduce M ∘ S. Under such a format, different laboratories can compare certificates across model versions and across proposed explanations while preserving the scoping discipline that makes the bound meaningful. In particular, we can separate disputes about the (choice of B and 𝒟OOD) from disputes about the (the observed ε̂ and its concentration bound), and we can meaningfully archive mechanistic claims in a way that supports later re-evaluation as samplers and intervention tooling improve.

A first open problem is to extend the certificate semantics to sequential, stateful, or tool-using systems where inputs are trajectories and where the sampling process may be history-dependent. Formally, x should be replaced by a trajectory prefix and y by an action or tool call; B may depend on the entire rollout, and 𝒟OOD may represent an environment distribution rather than an i.i.d. prompt distribution. Theorem~1 is then insufficient as stated: disagreements across time are not independent, and the evaluation data may exhibit temporal dependence or adaptive stopping. One direction is to adopt martingale concentration bounds for bounded differences along trajectories, provided we can sample rollouts independently across episodes; another is to define B as a per-episode indicator and treat episodes as i.i.d. samples from an episodic simulator. A second difficulty is interventional semantics: in an agent, mediators at time t influence the future state distribution through actions, so replacement may induce distribution shift that invalidates naive risk estimation unless the evaluation process is explicitly coupled. A satisfactory account will likely require a carefully defined simulator-level coupling between M and M ∘ S, and an intervention family that specifies how internal replacements are applied across time steps.

Our present guarantees are distributional: they are sound for the declared 𝒟OOD under i.i.d. sampling. This leaves open an adversarial axis: what if 𝒟OOD is adaptively chosen after seeing the explanation, or if an adversary perturbs the sampling process within some ambiguity set around a nominal generator? A principled strengthening would replace a single 𝒟OOD with a family {𝒟θ}θ ∈ Θ or an f-divergence/Wasserstein ball around a reference distribution, and seek a bound on supθ ∈ ΘPrx ∼ 𝒟θ[B(M(x)) ≠ B((M ∘ S)(x))]. Achieving such robustness will likely require either (i) explicit worst-case search over Θ (which reintroduces computational hardness akin to red-teaming), or (ii) structural assumptions about how disagreement varies with θ that permit uniform convergence bounds. A related question concerns adaptive hypothesis search: while Theorem~2 handles finite multiple testing via a union bound, mechanistic discovery pipelines are typically highly adaptive; developing reusable auditing protocols may therefore require reusable holdouts, differential privacy-style generalization control, or pre-registered hypothesis classes with strict evaluation discipline.

A third open problem is conceptual: to what extent should we expect a recovered abstraction to be unique, and what equivalence notion is appropriate? Even when replacement error is near zero, there may be many distinct mediator sets and abstract models that are behaviorally equivalent on 𝒟OOD. We may therefore want a theory of explanations : a partial order where h1 ≼ h2 if h2 refines h1 by adding mediators or splitting abstract variables, together with criteria under which minimal explanations are identifiable. Here our NP-hardness result is a warning: even defining the ``smallest’’ explanation is algorithmically intractable in general, so a useful notion of universality may need to be approximate (e.g. minimal within a restricted hypothesis class H) or structural (e.g. uniqueness under a faithfulness-like assumption for interventional effects on B). Clarifying these notions would also inform how we compare certificates across groups: two certificates with different (A, π, S) may nonetheless constitute the same mechanistic claim at the appropriate abstraction level.

We view the protocol as a step toward mechanistic evidence that is both operational and statistically scoped: explanations are executable replacements, and guarantees are explicit about what they cover. The remaining work is to broaden the scope (sequential agents, adversarial distribution shift) without losing the discipline that makes certificates auditable, and to develop theory and tooling that support comparison and compositional reuse of mechanistic audit artifacts across models and time.