Tool-using language models couple a high-dimensional internal computation to an external interface consisting of (i) discrete tool-call decisions, (ii) structured arguments, (iii) tool outputs that may be stochastic or adversarially perturbed, and (iv) persistent memory state. Even when the policy πθ is implemented by a standard transformer, the induced interaction trajectory τ = (o1 : T, u1 : T, y1 : T, m1 : T, a1 : T) is no longer governed solely by the model’s token distribution. Rather, the joint law of (Ut, At) depends on a closed-loop system in which tool behavior and memory updates feed back into later observations. This setting complicates mechanistic analysis: the objects we ultimately care to preserve or monitor are distributions over tool calls and actions along trajectories, not merely next-token probabilities under a fixed prompt.
A common mechanistic strategy in text-only settings is to search for
sparse
circuits'' defined in terms of individual neurons, attention heads, or short paths between them. For tool agents, such a strategy is unreliable for two distinct reasons. First, tool use introduces discontinuities in the input distribution: the same underlying task intent can be realized by heterogeneous tool outputs, formatting differences, and memory states. A circuit identified by correlational methods on one slice of trajectories may fail under slight changes in tool formatting, or under on-distribution perturbations that preserve semantics while altering surface form. Second, the parameters \(\theta\) of deployed agents are rarely static. Fine-tuning procedures (SFT, RLHF, DPO) often preserve tool-use semantics while redistributing computation across internal components. In particular, any identification ofthe’’
decisive attention head for calling a tool is unstable under even modest
downstream remapping of features into logits. Thus, neuron/head-level
attributions do not provide a stable interface for either verification
(ensuring behavior invariances) or monitoring (detecting shifts in
tool-use intent).
We therefore separate two questions that are conflated by naive circuit discovery: (i) which internal degrees of freedom are necessary as for the externally visible interaction distribution, and (ii) which specific low-level implementation (neurons, heads) realizes those degrees of freedom in a particular parameterization. Our approach is to define and search for a compact set of mediator variables that function as an interface between the agent’s internal computation and the tool/action decisions. Concretely, we fix write locations in the computation graph producing internal activations ht, and we train a sparse autoencoder or transcoder f to obtain coordinates zt = f(ht) ∈ ℝd. The coordinates zt, i are candidate mediators: they are designed to be sparse, reusable across contexts, and interpretable enough to support targeted interventions. We emphasize that this is not an interpretability claim per se; it is a representational choice that gives us a discrete index set [d] over which minimality statements are well-posed.
The central requirement is a distributional invariance property stated at the level of interaction behavior. We are given a family 𝒟 of trajectory distributions (including in-domain and specified OOD slices) and an admissible intervention family ℐ that preserves on-distribution statistics (e.g., resample ablations or within-distribution patching of activations, and type-correct or semantically equivalent perturbations of tool outputs). For a candidate index set S ⊆ [d], we ask whether holding zt, S fixed renders the joint distribution of (Ut, At) approximately invariant to admissible interventions on the remaining coordinates. Formally, we measure invariance by a distance Δ between distributions (defaulting to total variation), and we accept S if the induced deviation is at most ε across times t, distributions D ∈ 𝒟, and interventions I ∈ ℐ. This choice forces the basis to be an object: it is defined by what can change in tool calls and actions, not by what is salient under a particular interpretability visualization.
A definition at this level supports two desiderata that neuron/head circuits typically fail to satisfy. The first is : because interventions are constrained to preserve activation marginals (and tool outputs to remain type-correct or semantically equivalent), the invariance criterion discounts brittle circuits that only function under a narrow formatting regime. The second is : if fine-tuning updates are causally local in the sense that they do not alter the structural generation of the mediator variables zt, S from upstream observations, then the same S should remain sufficient even when downstream maps from zt, S to tool/action logits are modified. In other words, we aim to identify a mediator interface that can be tracked across model revisions without re-deriving an entirely new circuit decomposition.
We make five contributions, each aligned with a distinct failure mode
of existing mechanistic workflows for tool agents.
First, we formalize the problem: given πθ, 𝒟, ℐ, and
tolerances (ε, δ),
find a near-minimal subset S ⊆ [d] such that masking
or resampling all non-basis coordinates (while conditioning on zt, S)
preserves the joint law of tool calls and actions within ε, with confidence 1 − δ. This reframes mechanistic
identification as a constrained minimality problem over mediator
coordinates, avoiding reliance on unstable low-level units.
Second, we give a polynomial-time discovery algorithm, , based on constructing a finite coverage instance from causal effect estimates. For each context/test cj and feature coordinate i, we estimate the distributional effect of admissible interventions on zt, i on the resulting (Ut, At) via Δ. Thresholding these effects yields a binary matrix encoding which features are necessary (up to tolerance) for which contexts. The resulting combinatorial problem is a Set Cover instance, for which the greedy algorithm achieves a (1 + ln m)-approximation in the number m of contexts. This yields an explicit, auditable approximation guarantee tied to the number of causal tests.
Third, we provide finite-sample guarantees for the soundness of the constructed coverage instance. Because effect estimation is performed by interventional reruns under admissibility constraints, sampling error is controlled by standard concentration inequalities. We show that with q interventional queries per context-feature pair scaling as Õ(γ−2log (md/δ)), the thresholded coverage matrix is correct for all entries separated by margin γ, implying that the greedy solution is a valid interaction basis on the evaluated test set.
Fourth, we prove a stability result under a class of fine-tuning updates. Modeling the agent as an SCM over mediator variables Zt and residual internal state Rt, we show that if the structural equations generating Zt, S* are unchanged by the update θ ↦ θ′ (only downstream readouts into logits change), then S* remains an interaction basis under the same admissible intervention family. Consequently, when the coverage instance remains margin-separated, recovers the same basis (up to feature relabeling within an SAE equivalence class), formalizing the intuition that a mediator interface is more stable than a neuron-level circuit.
Fifth, we establish matching hardness and propose a validation protocol. We show that the minimum interaction basis problem on a finite test set is NP-hard via a reduction from Set Cover, and that logarithmic-factor approximation is essentially optimal unless P = NP. Since worst-case instances cannot be solved exactly, we emphasize validation: after selecting S, we perform causal scrubbing/component replacement tests in which only zt, S is preserved and all other degrees of freedom are resampled admissibly; acceptance requires that the observed (Ut, At) distributions match within ε on held-out contexts and OOD slices. This closes the loop between discovery (a surrogate combinatorial objective) and the operational interaction-level invariance criterion.
The net effect is a framework that treats tool-use behavior as the primary object, enforces on-distribution intervention semantics, and yields a principled mediator set whose size is approximately minimal relative to the tested contexts. In the next section we situate this approach relative to work on circuits and motifs, sparse feature dictionaries, patching-based causal methods, and prior analyses of fine-tuning and tool use.
A large body of mechanistic interpretability work studies internal structure in terms of neurons, attention heads, and small computation graphs (``circuits’’) that explain specific behaviors in language models and vision models . Typical objectives include attributing a prediction to a sparse set of components, identifying recurring motifs (e.g., induction heads), and isolating algorithmic subroutines by analyzing attention patterns, MLP activations, or residual-stream directions . These methods are most mature in text-only settings where the input distribution is comparatively homogeneous and the object of interest is a next-token distribution under a fixed prompt. In tool-using agents, the relevant behavioral object is a distribution over —tool calls, arguments, and direct actions—along closed-loop trajectories, and the same semantic state may be encoded through heterogeneous tool outputs and memory states. This complicates naive circuit transfer: a circuit found by attribution on one surface realization may not persist when tool outputs change format while preserving type or semantics. Our formulation differs by defining a mediator interface at the level of interaction distributions and enforcing invariance under admissible perturbations, rather than seeking an invariant set of low-level units.
Sparse autoencoders (SAEs) and related dictionary learning methods have been used to recover sparse feature coordinates from dense activations in transformers, motivated by superposition and the desire for an indexable feature basis . Subsequent work explores transcoders and other structured encoders that align features across layers or constrain reconstructions to improve interpretability and causal manipulability . A key practical advantage of SAE coordinates is that they provide a discrete set of candidate mediators over which minimality statements and selection procedures are well-posed, while remaining compatible with white-box interventions through decoding back into activation space. Our use of SAEs is aligned with this representational program: we do not assume that any given coordinate is semantically ``pure,’’ but rather treat coordinates as a convenient basis in which sparsity can be exploited for discovery and for near-distribution interventions.
Causal analysis by activation patching (also called causal tracing) replaces internal activations in a target run with activations from a source run to test necessity/sufficiency relationships for outputs . Extensions include path patching and head/MLP ablations to isolate information flow through particular edges in the computation graph . More recently, subspace patching and representation-level interventions emphasize changing only a selected subspace while maintaining the remaining activation statistics to avoid off-distribution artifacts . Our admissibility requirement is closest in spirit to these ``matched distribution’’ interventions: when estimating effects of candidate mediators, we restrict to interventions that preserve empirical activation marginals/conditionals (e.g., resample ablations) so that measured changes in tool/action distributions reflect causal dependence rather than distribution shift. The distinguishing feature of our setting is that interventions must also accommodate tool-output and memory perturbations, since these are part of the closed-loop causal graph defining the trajectory distribution.
The goal of identifying a compact set of mediators is closely related to causal abstraction and causal representation learning, where one seeks a coarse-grained variable set whose interventions correspond to interventions on the underlying system . Causal scrubbing and component replacement tests operationalize this idea by replacing internal components with sampled surrogates while preserving designated mediators, and then checking whether behavior is preserved . Our validation protocol follows this tradition: after selecting a mediator set, we test whether resampling non-selected degrees of freedom (subject to admissibility) preserves the interaction distribution within tolerance. Conceptually, our ``interaction basis’’ can be viewed as a minimal mediator set for a particular coarse-graining target—namely the joint law of tool calls and actions—and our acceptance criterion matches the causal-abstraction objective of preserving interventional behavior rather than correlational summaries.
Tool-using language models and agents (e.g., function calling, retrieval-augmented generation, ReAct-style agents) introduce an explicit tool channel and persistent state, and have been studied both empirically and architecturally . Tool interfaces create structured discontinuities (JSON schemas, API errors, pagination, partial results) and additional failure modes (hallucinated tool outputs, adversarially formatted returns). Parallel work on robustness and reliability in agentic settings emphasizes self-correction, reflection, critique, and error recovery when tools fail or return unexpected values . Systems sometimes incorporate redundant ``hydra’’-style designs—multiple heads, multiple proposals, or verifier modules—to reduce single-point failures and to enable self-repair via re-querying and consistency checks . These lines of work motivate our focus on distributional invariance under constrained tool-output perturbations: if a policy is intended to be robust to benign formatting variation or semantically equivalent tool returns, then the mediator interface governing tool-use decisions should be stable under corresponding admissible interventions.
A recurring empirical observation is that fine-tuning (SFT/RLHF/DPO, as well as parameter-efficient adapters) can preserve externally measured capabilities while substantially changing internal representations and component-level attributions . This instability complicates any mechanistic claim phrased at the granularity of specific neurons or attention heads, since equivalent computations may be re-encoded in different subspaces after training updates. Work on probing, CCA-based alignment, and feature matching across checkpoints aims to track representational changes and identify invariants under training . In the tool-use domain, fine-tuning often changes the agent’s preference over tool usage (e.g., calling a tool more or less often) while preserving task success, further entangling mechanism with policy-level incentives . Our stability theorem formalizes a particular sufficient condition under which a mediator set remains valid across such updates: if the structural generation of selected mediators is unchanged and only downstream readouts are modified, then the interaction basis remains stable (up to feature relabeling within an equivalence class). This provides a target for mechanistic workflows that aim to remain meaningful across model revisions.
Finally, our reduction to Set Cover situates interaction-basis discovery among a broader class of problems that seek minimal sets of explanatory variables subject to behavioral constraints, including minimal test sets, hitting sets, and sparse causal explanations. Greedy set cover and its logarithmic approximation guarantees are classical, but their utility in mechanistic settings depends on constructing a sound coverage instance from causal tests . Our contribution is to specify such a construction using admissible interventional effect estimates and to connect it to a validation procedure that checks the original invariance criterion on held-out contexts, thereby decoupling the combinatorial surrogate from the operational behavioral specification.
We fix a tool-using agent policy πθ implemented as a transformer with parameters θ and a specified tool API. Our goal in later sections is to state an invariance criterion for a small set of internal mediator variables; here we set up the causal and probabilistic objects needed to make such statements precise.
We model an interaction as a finite-horizon trajectory τ of length T consisting of observations, tool
calls, tool outputs, internal memory/state, and actions. Concretely we
write
τ = (o1 : T, u1 : T, y1 : T, m1 : T, a1 : T),
where ot
is the agent’s external observation at time t (including the user prompt and any
environment feedback), ut is the tool
call at time t (or a
distinguished
no tool'' symbol), $y_t$ is the tool output returned by the tool API (or a null output if $u_t$ isno
tool’’), mt denotes any
persistent agent-side state (e.g. scratchpad, retrieved documents cache,
or hidden memory variables exposed for analysis), and at is the
environment-facing action at time t (which may include emitting tokens
and/or committing to a final decision). We treat πθ as inducing
conditional distributions of the form
(ut, at, mt) ∼ πθ( ⋅ ∣o1 : t, y1 : t − 1, m1 : t − 1 ),
with the understanding that, internally, this distribution is realized
by a deterministic computation graph with injected randomness
(e.g. sampling from logits) and with intermediate activations.
To make counterfactual reasoning explicit, we place these variables in a structural causal model (SCM) whose exogenous noise variables include (i) environment randomness, (ii) tool randomness, and (iii) the agent’s sampling noise. The tool channel is represented by a structural equation yt := Tool(ut, ηttool), where ηttool captures tool nondeterminism and where Tool includes schema validation and error modes. The agent internal computation induces additional endogenous variables, including selected activation vectors ht at write locations we choose to instrument (e.g. residual stream vectors or pre-softmax tool-call logits).
A c specifies an initial condition and/or a partial trajectory template: it can fix the initial user prompt, constrain tool availability, specify a distribution over environment states, or impose other conditions on (o1, m0). Given a context c, the environment dynamics, the tool API behavior, and πθ jointly induce a distribution over trajectories. We write Dc for this induced trajectory distribution, and we consider a family 𝒟 of such distributions, typically including an in-domain component together with specified out-of-distribution slices (e.g. alternative tool-output formats that remain type-correct). When we later quantify invariance, we will require statements to hold uniformly over D ∈ 𝒟 (or over a prescribed finite subfamily generated by test contexts).
We allow interventions on (a) internal activations, (b) memory/state variables, and (c) tool outputs. However, since off-distribution perturbations can create artifacts unrelated to the mechanism of interest, we restrict to an intervention family ℐ designed to preserve on-distribution statistics. Formally, an intervention I ∈ ℐ is a do-operator acting on a designated subset of variables in the SCM (e.g. coordinates of ht or of a derived feature vector zt), together with a rule for sampling replacement values from a matched empirical distribution. Typical admissible internal interventions include:For tool and memory interventions, admissibility means that modified outputs remain type-correct and lie in a designated equivalence class (e.g. semantically equivalent strings, alternative formatting, or randomized but schema-valid payloads), so that we probe robustness to benign channel variation rather than to arbitrary corruption. We emphasize that admissibility is not merely a convenience: it is part of the problem specification and will be monitored via divergence checks on intervened variables.
Given a chosen activation stream ht ∈ ℝdh
(possibly concatenating multiple write locations), we train a sparse
autoencoder (or a transcoder) to obtain an encoder
f : ℝdh → ℝd, zt = f(ht),
where zt
is sparse in the sense of having few nonzero (or large-magnitude)
coordinates. We treat the coordinates of zt as a
discrete, indexable set of candidate mediator variables. When we
intervene on zt we implement
the intervention by decoding back into activation space via a decoder
g and patching the resulting
reconstruction into the model, or equivalently by applying an
intervention directly in ht constrained
to the subspace corresponding to the chosen coordinates. This choice
gives us two advantages used later: (i) minimality questions over
mediators become well-posed as subset selection problems over [d], and (ii) interventions can be
phrased as coordinate-wise operations (e.g. resampling zt, i)
while remaining close to the empirical activation manifold when combined
with admissibility constraints.
The behavioral object we care about is the distribution over
interaction decisions at each time t, namely the joint distribution of
the tool call and the action (Ut, At).
Depending on the agent design, At may include
both text emission and discrete environment actions, while Ut includes a
tool name and arguments or the null call. Since these variables are
typically discrete (e.g. JSON arguments can be canonicalized), we adopt
total variation (TV) distance as our default discrepancy measure:
$$
\Delta_{\mathrm{TV}}(P,Q)=\frac12\sum_x |P(x)-Q(x)|.
$$
When a KL-based metric is convenient (e.g. for logit-level analyses), we
write $\Delta_{\mathrm{KL}}(P\|Q)=\sum_x
P(x)\log\frac{P(x)}{Q(x)}$ and use Pinsker’s inequality $\Delta_{\mathrm{TV}}(P,Q)\le \sqrt{\tfrac12
\Delta_{\mathrm{KL}}(P\|Q)}$ to translate KL control into TV
control when needed. In later definitions, we will apply Δ to distributions over (Ut, At)
under different intervention regimes; when trajectories have variable
length, we can treat termination as an additional absorbing action so
that all distributions live on a common discrete space.
These preliminaries give us a common SCM and measurement interface: we can sample τ ∼ D ∈ 𝒟, extract ht, map to sparse coordinates zt, apply admissible interventions from ℐ either on the tool/memory channel or on internal coordinates, and quantify behavioral change via Δ on the induced laws of (Ut, At). The next section uses exactly these objects to define the interaction-basis invariance criterion and the minimality objective.
We now formalize the mediator notion that will be optimized in later sections. Fix a choice of write locations and an encoder f (SAE/transcoder) producing coordinates zt = f(ht) ∈ ℝd for each time t. We treat each coordinate zt, i as a candidate internal mediator variable, indexed by i ∈ [d], and we consider subsets S ⊆ [d] as candidate .
For each context c (hence
each induced D = Dc ∈ 𝒟),
we refine the trajectory SCM by inserting mediator variables (Zt)t = 1T
defined by the deterministic equation
Zt := f(Ht),
where Ht
denotes the selected activation stream (possibly a concatenation of
multiple write locations). The remaining internal degrees of
freedom—including the part of Ht not
represented by the chosen SAE coordinates, as well as other
uninstrumented activations—are bundled into auxiliary endogenous
variables Rt.
Conceptually, (Zt, Rt)
is a reparameterization of the agent internal state at time t relative to our measurement
interface. The action/tool heads then induce conditional laws
(Ut, At, Mt) ∼ πθ( ⋅ ∣o1 : t, y1 : t − 1, m1 : t − 1; Z1 : t, R1 : t ),
where the semicolon indicates that (Z1 : t, R1 : t)
are internal to the computation graph rather than externally observed.
This refinement is only used to define interventions and invariance
statements; we do not assume that (Zt, Rt)
are uniquely identifiable from behavior.
Let S ⊆ [d]. The
defining operation for our invariance criterion is an admissible
``masking’’ intervention on the complement coordinates. Informally, we
hold the selected coordinates Zt, S
fixed while resampling Zt, [d] \ S
from a matched empirical distribution so that the intervened activations
remain typical. Formally, we assume that ℐ contains (or induces) an intervention
scheme
mask([d] \ S) ∈ ℐ
which, at each time t,
replaces Zt, [d] \ S
by a draw from a reference distribution that is (i) estimated from
cached rollouts under the same context family, and (ii) matched on an
admissible conditioning set that includes Zt, S
and may include coarse summaries of the past (o1 : t, y1 : t − 1, m1 : t − 1).
We write this schematically as
Zt, [d] \ S′ ∼ P̂(Zt, [d] \ S ∣ Zt, S, match(o1 : t, y1 : t − 1, m1 : t − 1)), Zt, S′ := Zt, S,
with the understanding that the admissibility invariant monitors that
the induced intervened distribution over internal states remains near
the empirical activation distribution (e.g. by divergence checks). The
exact matching rule is part of the specification of ℐ; our definitions below quantify over all
admissible choices.
Given D ∈ 𝒟 and I ∈ ℐ, let PD, I
denote the induced law of the trajectory (or of any measurable function
of it) when we run the agent/environment/tool SCM under D and apply I at its designated targets. In
particular, PD, I(Ut, At)
denotes the marginal law of (Ut, At)
at time t under D and I. When we additionally apply
masking to all non-selected coordinates, we write
PD, I, mask([d] \ S)(Ut, At)
for the corresponding marginal law in the SCM where the intervention
family I is composed with the
masking intervention at the relevant internal sites and times (with
Zt, S
held fixed by construction of mask([d] \ S)).
We can now state the central invariance requirement: a set of mediator coordinates S is sufficient if, once Zt, S is held fixed, adversarially varying (within ℐ) the remaining internal coordinates does not materially change the agent’s tool-use and action distributions.
\end{definition}
This definition is an approximate conditional invariance: it asserts that, uniformly over admissible perturbations of non-basis coordinates, the distribution of interaction decisions is stable once the basis coordinates are preserved. The maxt ∈ [T] choice is a convenient worst-case aggregation; one may replace it by an average over t or by a trajectory-level discrepancy without changing the optimization structure used later, provided the discrepancy remains bounded.
Among all sets satisfying Definition~, we prefer those with smallest cardinality. This yields an intrinsic combinatorial problem.
We emphasize that minimality is defined relative to the chosen measurement interface (write locations and encoder f) and the admissible intervention family ℐ. Enlarging ℐ typically increases the required basis size, while enriching the representation (larger d or additional write locations) can decrease it by making mediators more granular.
SAE coordinates are not canonical: even when f is fixed by training, multiple coordinate systems may represent the same effective mediator subspace, and retraining f can permute or rescale features. Accordingly, we define the interaction basis only up to a natural equivalence relation induced by transformations that preserve the represented internal information.
Let g : ℝd → ℝdh
denote the decoder (or the corresponding reconstruction map used for
patching). For S ⊆ [d], define the decoder
subspace
𝒲(S) := span{ g(ei) : i ∈ S } ⊆ ℝdh,
where ei
is the ith standard basis
vector. We say S ∼ S′ if 𝒲(S) = 𝒲(S′)
(or, in a tolerant version, if the principal angle distance between the
subspaces is below a prescribed threshold). This captures the fact that
if two coordinate sets span the same patchable subspace in activation
space, then masking the complement relative to either set yields the
same family of admissible counterfactuals up to negligible
reconstruction error. In settings where the decoder columns are
approximately orthogonal and features are identifiable up to permutation
and sign, ∼ reduces to the expected
relabeling equivalence.
Under this convention, the ``minimum interaction basis’’ is properly a minimum [S]∼ rather than a unique subset of indices. Our algorithmic outputs later will return a concrete representative Ŝ, but all guarantees should be read as invariant under ∼ (and, where relevant, under small perturbations of 𝒲(S) induced by retraining the SAE within an equivalence class).
Definition~ specifies a semantic invariance property of a coordinate set S, but it does not by itself determine how we should search over S or how we should certify that a candidate set generalizes beyond a finite battery of tests. We therefore separate (i) the underlying combinatorial problems (optimization and decision versions) from (ii) the empirical acceptance criteria we will impose when only finitely many trajectories and interventions can be executed.
Fix (πθ, 𝒟, ℐ),
horizon T, and tolerance ε. The intrinsic optimization
problem is to find the smallest coordinate subset that satisfies the
invariance constraint:
ilon.
\end{equation}
The corresponding decision version asks whether some basis of
cardinality at most K
exists:
The distinction is largely conceptual: (IB-DEC) is the natural target
for complexity-theoretic reductions, while (IB-OPT) is the operational
objective for discovery. In either case, the quantification over D ∈ 𝒟 and I ∈ ℐ should be read as ``uniform
over the specified evaluation family and intervention semantics,’’ not
as a claim of distribution-free robustness.
In practice, 𝒟 is instantiated by a
finite collection of contexts {c1, …, cm}
(prompts, initial states, or trajectory templates), and ℐ is instantiated by a finite menu of
admissible intervention schemes (e.g. resample ablations with a
specified matching rule, or tool-output perturbations within a
type-correct equivalence class). Accordingly, we define a
test-restricted constraint by replacing the suprema in~ with maxima over
these finite families. Writing Dj := Dcj
and letting ℐtest ⊆ ℐ denote
the chosen finite intervention menu, we obtain the empirical feasibility
predicate
P{D_j,I}(U_t,A_t), P{D_j,I,([d]S)}(U_t,A_t))
.
\end{equation}
Our algorithm will explicitly optimize a surrogate derived from~. Later
generalization statements will therefore be relative to (i) held-out
contexts drawn from the same construction process that produced {cj} and (ii)
specified out-of-distribution slices included in 𝒟 by design.
Even on a finite test family, Φε(S) is not directly observable; we can only estimate it from finitely many rollouts and interventional reruns. We thus relax the goal from exact optimality to sets: we seek an output Ŝ that (a) passes an empirical validation procedure with high probability and (b) has size within a provable approximation factor of the smallest set that would pass the same tests under the same admissible intervention semantics. This is the appropriate notion of ``optimality’’ for a discovery procedure that is inherently limited by sampling noise and by the coarseness of the admissible intervention family.
A central methodological point is that we do not accept S solely because a collection of per-feature causal effect estimates suggests that S ``covers’’ all contexts. Coverage-style constructions are lossy: they summarize a complicated invariance condition by a finite table of effect magnitudes. We therefore impose an explicit validation criterion that re-checks the invariance statement more directly using the strongest available admissible interventions.
Concretely, for a proposed S we will perform a (component
replacement) test: we re-run trajectories under each cj (and
additional held-out contexts) while applying mask([d] \ S) at the
specified internal sites and times, and we compare the resulting
empirical distributions of (Ut, At)
to the unmasked baseline. Let P̂ and P̂mask denote the
corresponding empirical distributions (constructed from repeated runs
with independent randomness in the environment, tools, and the masking
resampling). We accept S
if
P^{D_j,I}(U_t,A_t), P^{D_j,I,([d]S)}(U_t,A_t))
+,
\end{equation}
for a slack η chosen to
account for finite-sample error, and where (mval, ℐval)
may be larger/stronger than the families used to propose S. This separates (which may use
coarse statistics) from (which directly targets the invariance
constraint).
It is necessary to be explicit about what can and cannot be identified from interventional access. First, even in the population limit, there need not be a unique minimizer S*: multiple disjoint coordinate sets may satisfy Definition~, and the equivalence relation S ∼ S′ defined previously further collapses solutions that span the same patchable subspace. Second, identifiability is relative to ℐ: if the admissible masking scheme conditions on a rich history summary, it may implicitly preserve information outside S via the matching variables, thereby making smaller S appear sufficient. Conversely, if admissible matching is too crude, masking may destroy legitimate correlations needed for stable behavior, inflating the apparent basis size. For this reason, the object we are computing is best viewed as a , rather than as a uniquely determined ``true’’ causal set.
The invariance criterion is only meaningful when the counterfactual trajectories induced by mask([d] \ S) remain in-distribution for the agent. If masking produces activation patterns that are atypical, the agent may enter regimes where its tool-call and action heads behave erratically, causing spurious large discrepancies Δ(⋅, ⋅) that do not reflect genuine dependence on the masked coordinates. This is precisely why Definition~ quantifies over an family ℐ and why we require the admissibility invariant stated earlier: interventions must preserve on-distribution statistics up to monitored divergence.
Tool-output and memory interventions present an analogous hazard. If a counterfactual tool output is syntactically invalid, semantically incoherent, or violates latent formatting conventions the agent has internalized, then changes in (Ut, At) may reflect distribution shift rather than causal dependence on the modified content. Accordingly, our tool-side interventions will be restricted to type-correct and, when appropriate, semantically equivalent perturbations (or randomized outputs drawn from a calibrated output model), and we will treat violations of these constraints as invalid tests rather than as evidence about the basis.
We will therefore aim to compute a set Ŝ that is (i) small, (ii) passes a stringent empirical version of the invariance constraint under admissible counterfactuals, and (iii) is reported only up to the natural SAE-induced equivalence ∼. The next section gives an explicit polynomial-time procedure achieving a logarithmic approximation factor on a constructed surrogate instance, together with a post-hoc validation step that enforces~ and triggers refinement when the surrogate is unsound.
We now describe a concrete discovery procedure, which we call with a post-hoc certification step . The procedure takes as input the trained encoder f (or trains it as a preprocessing step), the finite context family {cj}j = 1m, and an admissible intervention menu ℐtest, and it outputs a coordinate set Ŝ ⊆ [d] intended to satisfy the empirical invariance constraint. Operationally, the method proceeds by (i) estimating, for each context–feature pair (j, i), a bounded causal effect score that quantifies how much the behavior (Ut, At) changes when feature i is masked in an admissible way, (ii) thresholding these scores to obtain a Boolean coverage instance Ĉ ∈ {0, 1}m × d, (iii) running greedy Set Cover on Ĉ, and (iv) validating the resulting set by direct scrubbing tests of the invariance statement.
Fix a context cj and let Dj denote the
induced rollout distribution. For each feature i ∈ [d], we define an
intervention scheme mask(i) ∈ ℐtest that
replaces the coordinate zt, i
at the chosen write locations and times by a resampled value drawn from
a matched distribution, while leaving the rest of the computation
unchanged. We then measure the induced change in behavior by a distance
between the joint distributions of tool calls and actions. A canonical
choice consistent with our notation is the time-aggregated total
variation distance
_{}!(P{D_j}(U_t,A_t), P{D_j,(i)}(U_t,A_t)),
\end{equation}
where PDj
denotes the baseline distribution under cj (with
whatever stochasticity is present in the environment and tools), and
PDj, mask(i)
denotes the distribution under the same context with the single-feature
admissible masking applied. In settings where tool calls are sparse and
action distributions are high-entropy, we may replace the maxt ∈ [T]
aggregator by a weighted sum ∑twtΔTV(⋅, ⋅)
concentrating on decision times; the subsequent reduction and selection
steps depend only on the fact that ej, i ∈ [0, 1]
is a bounded score.
To estimate~ efficiently, we use paired reruns: we cache a baseline
rollout for cj (including
random seeds when possible), and then rerun the same context multiple
times while applying mask(i)
with independent resampling noise but matched external randomness. Let
q denote the number of such
interventional reruns per pair (j, i). From the resulting
samples we form empirical distributions P̂Dj(Ut, At)
and P̂Dj, mask(i)(Ut, At)
and compute
_{}!(P^{D_j}(U_t,A_t), P^{D_j,(i)}(U_t,A_t)).
\end{equation}
The admissibility diagnostics are enforced during this step: if the
patched activations violate the monitored in-distribution criterion
(e.g. a divergence threshold on the activation stream or a calibrated
log-likelihood under an activation model), we treat the corresponding
run as invalid and resample, rather than interpreting the resulting
behavior change as causal dependence.
Given the matrix of estimates êj, i,
we choose a threshold τ = τ(ε)
(optionally with a safety margin to accommodate estimation error) and
define a Boolean coverage matrix Ĉ ∈ {0, 1}m × d
by
The intended semantics is that Ĉj, i = 1
indicates that feature i is in
context cj
under our admissible masking scheme, in the sense that removing it
produces a detectable change in the distribution of (Ut, At).
We emphasize that Ĉ is a
surrogate object: it compresses a multi-time, distributional invariance
constraint into a finite incidence structure on (j, i). The role of below
is precisely to guard against unsoundness introduced by this
compression.
With Ĉ in hand, we solve the induced Set Cover instance: each feature i corresponds to the set of contexts it ``covers,’’ namely {j : Ĉj, i = 1}. We run the standard greedy rule: starting from S = ∅ and the uncovered context set J = [m], iteratively choose a feature maximizing the number of newly covered contexts, add it to S, and remove the contexts it covers. Implementationally, we store each column of Ĉ as a sparse bitset, allowing each greedy step to be computed by fast intersections and popcounts.
The greedy output is only a hypothesis about which coordinates form a sufficient mediator set. We therefore subject the candidate S to a direct scrubbing test: for each validation context (which may include held-out contexts and stronger intervention schemes), we rerun the agent while applying mask([d] \ S) at the same internal sites, and we compare the empirical distribution of (Ut, At) to the unmasked baseline. The acceptance criterion is precisely of the form~: we accept S if the observed discrepancy is at most ε + η uniformly over the validation family. This step is the only point at which we commit to the invariance property itself; the preceding coverage construction is treated as a proposal mechanism.
If rejects S, we refine rather than restarting. A simple refinement rule is residual augmentation: identify the contexts and times with the largest validation discrepancy, recompute (or reuse) êj, i restricted to those failing contexts, and add a small number of features with the largest residual effect. We may also perform local search in the opposite direction by : once a set passes validation, we attempt to remove features one at a time (or in small batches) and re-run validation, keeping any removal that preserves acceptance. This backward elimination is computationally more expensive than greedy selection but empirically improves near-minimality, and it preserves the monotonic structure needed for the approximation analysis provided the acceptance predicate is treated as a monotone constraint with respect to adding mediators.
Finally, given an accepted Ŝ, we may optionally construct a monitor MŜ(τ) based on statistics of zt, Ŝ (e.g. thresholds on feature magnitudes, a calibrated density model, or a classifier predicting tool-call patterns). Such monitors are auxiliary outputs: they do not enter the definition of the interaction basis, but they exploit the fact that Ŝ is designed to parameterize the agent’s interaction behavior under admissible counterfactuals.
Fix the thresholded incidence matrix Ĉ ∈ {0, 1}m × d
constructed from the empirical effect scores via~. For each feature
i ∈ [d] write Cov(i) := {j ∈ [m] : Ĉj, i = 1},
and for a set S ⊆ [d]
write Cov(S) := ⋃i ∈ SCov(i).
The greedy loop in Algorithm~ is exactly the classical greedy algorithm
for Set Cover on the universe [m] with candidate sets {Cov(i)}i = 1d.
Therefore, letting SĈ⋆
denote an optimal set cover for this constructed instance, the standard
analysis yields
We emphasize what~ does and does not imply. It is a purely combinatorial
statement about the surrogate object Ĉ; it does not, by itself, establish
the interaction-basis invariance constraint. To connect the surrogate
objective to invariance, we require a implication: informally, that
covering all contexts in Ĉ
suffices to make masking the complement [d] \ S behavior-preserving
under the admissible scrubbing semantics.
We isolate the model-specific assumptions into a single implication,
which we treat as the contract that the effect-score construction must
satisfy. Fix a context family {cj}j = 1m
and an admissible masking operator mask(⋅) as above. We say that Ĉ is (relative to the test family)
if every S ⊆ [d] with
Cov(S) = [m]
satisfies
Here η ≥ 0 is an explicit
slack capturing the fact that the surrogate Ĉ is built from single-feature
effects, whereas~ concerns a intervention on all non-selected
coordinates. A sufficient set of semantic conditions under which (ε, η)-soundness is
plausible is: (i) —for every context j there exists at least one
coordinate i such that masking
i alone produces a detectable
deviation (Ĉj, i = 1),
and any set S that includes at
least one such coordinate for each j blocks all large deviations when
the rest are jointly masked; and (ii) —the causal effect of
simultaneously masking multiple coordinates not in S does not exceed (up to η) the maximum of their individual
effects as measured by the metric used to define Ĉ. Condition (ii) is a form of ``no
destructive interference’’ for the masking intervention family; it is
not automatic, and it is precisely why we insist on the post-hoc
scrubbing step as the operational arbiter of~.
Under (ε, η)-soundness, we may interpret the set-cover optimum |SĈ⋆| as an upper bound on the smallest mediator set that passes the empirical invariance test (at tolerance ε + η) on the given contexts. In that regime,~ becomes an explicit near-minimality statement for the recovered mediator set, up to the unavoidable logarithmic factor in m.
The preceding discussion treats Ĉ as given; we now account for
statistical error in the empirical effect estimates~. Since each êj, i
is a bounded statistic in [0, 1] (by
boundedness of ΔTV
and our time aggregation), uniform concentration can be obtained by
Hoeffding plus a union bound over the md context–feature pairs.
Concretely, for any γ ∈ (0, 1), if we use q independent admissible reruns per
pair (j, i) with
matched external randomness and independent resampling noise for mask(i), then
suffices to ensure maxj, i|êj, i − ej, i| ≤ γ
with probability at least 1 − δ. A standard corollary is a for
correct thresholding: if there exists γ > 0 such that, for all (j, i), either ej, i ≥ τ + γ
or ej, i ≤ τ − γ,
then the thresholded matrix Ĉ
equals the population thresholded matrix C entrywise with probability at
least 1 − δ. In this
margin-separated regime, the approximation bound~ holds relative to the
coverage instance, and the only remaining gap to invariance is the
semantic soundness implication~, which is exactly what empirically
enforces on a held-out validation family.
Putting the pieces together, we obtain the following conditional guarantee. Suppose (a) the admissibility checks enforce that all masked reruns remain within the monitored in-distribution criterion, so that ej, i is meaningful as an in-distribution causal effect; (b) the margin-separated concentration precondition holds so that Ĉ matches C with probability 1 − δ; and (c) the induced coverage instance is (ε, η)-sound in the sense of~. Then, with probability at least 1 − δ, the greedy set Sgreedy satisfies the empirical invariance constraint at tolerance ε + η, and moreover its cardinality is within a factor (1 + ln m) of the minimum set size among all sets that satisfy the coverage constraints encoded by C. Finally, even when (c) fails, the certification step prevents false acceptance: directly tests the invariance statement and rejects Sgreedy if the surrogate construction was unsound on the validation family, at which point refinement augments the mediator set until the invariance discrepancy falls below the target tolerance.
We record the dominant costs in the natural parameters. Let T be the rollout horizon and Cfwd the cost of one rollout step of πθ including tool I/O. The effect-estimation phase performs m ⋅ d ⋅ q interventional reruns (plus baseline caching), each costing O(T Cfwd), for a total of Õ(mdqTCfwd) time, typically the leading term. Greedy Set Cover on the sparse bitset representation of Ĉ can be implemented in O(mdlog d) time (or faster with incremental updates), which is lower-order unless q is very small. Space is O(md) bits for Ĉ plus the storage required for cached activations used by the SAE and admissible resampling; streaming or checkpointing can reduce the latter at the cost of recomputation. The post-hoc validation requires V additional reruns for V validation contexts/interventions, contributing O(VTCfwd) time, and any backward-elimination pruning multiplies this cost by the number of attempted removals.
The logarithmic factor in~ is not an artifact of our analysis: even on the finite context set, the induced combinatorial problem inherits the worst-case inapproximability of Set Cover. Thus, absent additional structure (e.g. special forms of Ĉ, submodularity of a continuous relaxation, or strong causal sparsity assumptions), one should not expect a polynomial-time algorithm with a substantially better worst-case approximation ratio in m. In our setting, the meaningful avenues for improvement are therefore (i) strengthening the semantic validity of the surrogate construction (reducing η and increasing margin separation), and (ii) reducing the number of contexts m needed to achieve the desired behavioral coverage by designing tests that are maximally discriminative for tool-use semantics.
We now formalize a class of parameter updates under which the interaction basis is invariant (up to the usual SAE identifiability symmetries), and we record a corresponding approximate stability bound when the update is small in a causal sense. The guiding principle is that the basis is a property of the inside the agent—i.e. how upstream computation and observations produce the latent coordinates zt—rather than a property of any particular downstream action head. Consequently, updates that preserve the structural equations generating the selected mediators should preserve the interaction basis.
Fix write locations and an SAE/transcoder f so that zt = f(ht) ∈ ℝd.
We view a rollout as an SCM over time-indexed internal variables. For
each t, let Zt denote the
random vector of SAE coordinates and let Rt denote the
collection of remaining internal variables sufficient to render the
computation Markovian at the chosen granularity (e.g. other
residual-stream components, attention states, and any tool memory not
represented in Zt). Let Ot collect
exogenous inputs at time t
(observation tokens, tool outputs under the external tool semantics, and
any exogenous randomness). We assume the agent admits structural
equations of the form
Zt = gθ(Z < t, R < t, O ≤ t), Rt = rθ(Z ≤ t, R < t, O ≤ t),
and action/tool outputs are sampled from a head
(Ut, At) ∼ Headθ (Z ≤ t, R ≤ t, O ≤ t),
where the dependence on the past may be summarized by the usual
transformer recurrence through cached key/value states. A parameter
update θ ↦ θ′ is if,
for all t, the structural
equation for the selected mediators is unchanged,
while downstream maps (including Head
and the equations for Rt) may change
arbitrarily. Intuitively,~ says that fine-tuning may alter how the model
the mediators, but not how the mediators are from upstream information.
This abstraction captures common regimes in which the update is
restricted to a small set of parameters downstream of the write location
used to define ht
(e.g. tool-call head reweighting, adapter layers inserted late in the
network, or a low-rank update applied only after the mediator extraction
point), while leaving earlier computation approximately intact.
Under~, the interaction-basis invariance statement is preserved
because admissible masking interventions on [d] \ S are defined at the
level of Zt (or the
corresponding subspace in ht) conditional
on Zt, S.
If the law of Zt, S
given upstream variables is unchanged, then the conditional
distributions induced by admissible resampling of non-selected
coordinates remain the same object before and after fine-tuning, and the
causal-scrubbing criterion continues to hold with the same tolerance.
Formally, suppose that for πθ there exists
a set S⋆ such that
for all D ∈ 𝒟 and all
admissible interventions I ∈ ℐ,
maxt ΔTV (PD, I(Ut, At), PD, I, mask([d] \ S⋆)(Ut, At)) ≤ ε.
If θ ↦ θ′
is causally local with respect to S⋆ in the sense of~, then
the same inequality holds for πθ′
with the same S⋆
and the same ε under the same
intervention family. The proof is an SCM invariance argument: the
intervention mask([d] \ S⋆)
replaces only the structural equations (or sampled values) for
coordinates outside S⋆ while holding Zt, S⋆
fixed; since the generation of Zt, S⋆
is unchanged by~, the distribution of (Ut, At)
under masked reruns differs from the unmasked reruns in the same way
pre- and post-update, hence the same bound applies.
The preceding statement concerns the of a stable basis; we also want stability of the set returned by our procedure. Here we use the margin-separated regime implicit in the finite-sample analysis: if the population effect scores for πθ induce a thresholded coverage instance C whose relevant entries are separated from the threshold by a margin γ, then Theorem~2 implies that Ĉ equals C with high probability given sufficient interventional reruns. Under a causally local update, the population instance C is unchanged for any effect-score definition that depends only on the causal impact of masking non-selected coordinates conditional on Zt, S⋆. Therefore, provided we reuse the same SAE coordinate system (or an equivalent representation), the greedy selection step yields the same set cover solution up to tie-breaking. In practice, when the SAE is retrained after fine-tuning, the coordinate system is identifiable only up to permutation and small within-subspace mixing; thus we interpret stability ``up to feature relabeling within an SAE equivalence class’’ by aligning features via correlation of activations or by solving a minimum-cost matching between coordinates across checkpoints. Under such alignment, the discovered set Ŝ′ for πθ′ coincides with Ŝ for πθ whenever the aligned coverage instance remains margin-separated; this is the operational content of stability.
Causally local fine-tuning is an idealization. We therefore record a
perturbative statement: if the update perturbs the mediator-generation
mechanism only slightly, then the interaction-basis guarantee degrades
gracefully. Let S be a
candidate basis and suppose that, for each t and each D ∈ 𝒟, the conditional laws of the
selected mediators satisfy
in expectation over O ≤ t (or
uniformly on a high-probability event under D). Then for any admissible masking
intervention on [d] \ S, a coupling
argument plus the triangle inequality yields
ΔTV (PD(Ut, At), PD, mask([d] \ S)(Ut, At)) ≤ ε + O(κ) + ηadm,
where ηadm captures
any additional divergence introduced by using the same empirical
admissible resampling distributions after fine-tuning (e.g. if the
activation marginals used for resample ablation are slightly
mismatched). Concretely, if admissibility is implemented by resampling
from an empirical conditional Q(⋅ ∣ Zt, S)
learned under πθ, then ηadm can be bounded by
ΔTV(Qθ, Qθ′)
terms that are themselves controlled by~ under mild regularity. The
upshot is that small causal drift in mediator generation increases the
invariance error additively, and the certification step will accept the
old S whenever ε + O(κ) + ηadm
remains below the target tolerance.
This stability view motivates a simple operational protocol: after fine-tuning, we first attempt to reuse the previously certified set S and rerun . Failure of validation can be interpreted as evidence that either (i) the update was not causally local with respect to the chosen mediators (i.e. κ is not small), or (ii) the admissible intervention family has become mismatched (i.e. ηadm is not small). In both cases, the remedy is the same: re-estimate effect scores on a small diagnostic context suite to identify newly causal coordinates, augment S accordingly, and re-certify. When margin separation holds, this refinement typically recovers the same basis plus a small number of additional features that account for the newly introduced dependence.
We stress what the stability guarantee does not assert. If fine-tuning changes the upstream computation that produces Zt, S in a way that introduces qualitatively new latent variables relevant for tool use, then the true minimum basis may change, and no identification method can avoid re-discovery. Our claim is narrower: for the common class of updates that preserve tool-use semantics by reweighting or reshaping downstream decision boundaries while leaving the mediator computation essentially intact, the interaction basis is an invariant object, and our procedure recovers it consistently subject to the same finite-sample and margin conditions as in the base model.
We complement the approximation and stability guarantees with matching lower bounds of two kinds: (i) computational hardness of finding a minimum interaction basis even on a finite suite of tests, and (ii) information-theoretic lower bounds on the number of admissible interventional reruns required to identify a sparse basis with high confidence. These results justify both the logarithmic approximation factor of greedy selection and the essentially unavoidable dependence of sample complexity on sparsity and confidence.
We reduce to the finite-test decision problem : given contexts {cj}j = 1m,
candidate features [d], and a
budget K, decide whether there
exists S ⊆ [d] with
|S| ≤ K satisfying
the ε-invariance constraint on
the test set under the admissible masking interventions. Consider a
instance with universe [m] and
sets 𝒮1, …, 𝒮d ⊆ [m].
We build a one-step (T = 1)
agent-environment pair in which the context index j is revealed as an exogenous
observation O1 = j, and the
SAE coordinates Z1 ∈ {0, 1}d
are generated deterministically by
Z1, i := 1{j ∈ 𝒮i}, i ∈ [d].
We define a single binary action A1 ∈ {0, 1} (tool call
can be ignored or taken as a fixed dummy symbol) by the head
equation
$$
A_1\ :=\ \bigvee_{i=1}^d Z_{1,i},
$$
so that A1 = 1 on
every context j that is
contained in at least one set. (If an element j is uncovered by all 𝒮i, we may remove it from
the universe as usual.) Thus, under the unmasked dynamics, P(A1 = 1 ∣ O1 = j) = 1
for all j ∈ [m].
To instantiate admissible masking, we specify ℐ to include resample ablation on any subset
of coordinates: for i ∉ S, the intervention
replaces Z1, i by an
independent draw from the empirical marginal Bernoulli(p) with some fixed p ∈ (0, 1), while holding Z1, S fixed.
This is admissible in the sense that the post-intervention distribution
of Z1 remains in
the same product family and preserves coordinate-wise statistics; in
particular, for any i the
marginal under intervention equals the reference marginal by
construction. Under this masking, if S fails to cover context j (i.e. j ∉ ∪i ∈ S𝒮i),
then all coordinates with Z1, i = 1 lie in
[d] \ S and are
resampled. Consequently,
Pmask([d] \ S)(A1 = 1 ∣ O1 = j) = 1 − (1 − p)deg (j),
where deg (j) := |{i: j ∈ 𝒮i}|.
Choosing p such that (1 − p)deg (j) ≥ 2ε
for all j (e.g. take p small and note deg (j) ≤ d) forces
ΔTV (P(A1 ∣ O1 = j), Pmask([d] \ S)(A1 ∣ O1 = j)) ≥ ε
whenever S does not cover
j, since the unmasked law is a
point mass at A1 = 1 while the masked
law has at least 2ε mass on
A1 = 0. Conversely,
if S cover j, then there exists i ∈ S with Z1, i = 1 held
fixed under masking, hence A1 = 1 deterministically
both before and after masking and the TV distance is 0. Therefore, S satisfies the invariance
constraint on all m contexts
if and only if {𝒮i}i ∈ S
is a set cover. This establishes NP-hardness of and hence of the
optimization variant .
The above reduction is approximation-preserving in the standard sense: an interaction basis of size k corresponds to a set cover of size k and vice versa, without distorting the objective. Therefore, the classical inapproximability of transfers directly: unless P = NP, no polynomial-time algorithm can guarantee an approximation factor better than (1 − o(1))ln m in the worst case. In particular, the (1 + ln m) factor achieved by greedy selection on the constructed coverage instance is essentially tight under worst-case complexity assumptions.
A related hardness perspective arises when we view basis selection as
choosing mediators that
hit'' all distinguishable counterfactual discrepancies. Suppose we formulate constraints not aseach
context must be invariant under masking’’ but as a family of pairwise
requirements: for a collection of counterfactual trajectory pairs (τ, τ′) deemed
semantically distinct, masking non-selected variables must not collapse
the induced distributions of (Ut, At)
so as to confuse the pair beyond ε. One can encode each such pair as
an element of a universe and each candidate coordinate i as a test that detects (or
preserves) that distinction; selecting a mediator set that preserves all
distinctions is then a instance. Since is equivalent to under polynomial
reductions, this yields NP-hardness for strengthened variants in which
constraints are trajectory-level or pairwise rather than per-context.
The practical interpretation is that hardness is not an artifact of our
particular invariance definition; it persists across natural
reformulations that aim to preserve behavioral distinctions under
admissible scrubbing.
Computational hardness does not by itself preclude successful identification on structured instances; we therefore also record a statistical lower bound showing that even with unbounded computation, identifying a valid k-sparse basis among d candidates requires on the order of klog (d/k) bits of information and hence a corresponding number of interventional samples. We consider a class of one-step instances indexed by an unknown support S⋆ ⊆ [d] with |S⋆| = k. For each i ∈ [d], define a context ci whose unmasked action distribution is Bernoulli(1/2), but for which masking coordinate i (while holding all other coordinates fixed) shifts the action distribution by ±ε if and only if i ∈ S⋆; if i ∉ S⋆, masking i has zero effect. Concretely, we may implement this by letting a latent bit B ∼ Bernoulli(1/2) be part of O1, setting Z1, i = B for i ∈ S⋆ and Z1, i independent noise for i ∉ S⋆, and taking A1 = B unless masking destroys the relevant coordinate, in which case A1 is replaced by an independent fair coin. Under such a construction, the total variation effect size of masking i is exactly ε for i ∈ S⋆ and 0 otherwise, and all interventions remain admissible because masked coordinates are resampled from the same marginal family.
Any algorithm that outputs a basis Ŝ with success probability at least
1 − δ must, in particular,
distinguish among the $\binom{d}{k}$
possible supports. By Fano’s inequality, if I denotes the transcript of all adaptive
interventional queries and observations, then
$$
\mathbb{P}(\widehat S = S^\star)\ \le\ 1-\frac{H(S^\star\mid
\mathsf{I})-1}{\log \binom{d}{k}},
$$
so achieving ℙ(Ŝ = S⋆) ≥ 1 − δ
requires mutual information I(S⋆; I) ≳ klog (d/k) + log (1/δ).
On the other hand, each admissible interventional query produces an
observation whose distribution differs between two supports by at most
O(ε) in TV (and hence
by at most O(ε2) in KL for
bounded binary observations), so the information gained per query is
O(ε2). It
follows that the number of interventional samples must satisfy
$$
q\ =\
\Omega\!\left(\frac{k\log(d/k)+\log(1/\delta)}{\varepsilon^2}\right),
$$
up to universal constants, even allowing adaptivity and even when the
horizon is one. This lower bound matches, in order, the dependence
appearing in standard finite-sample analyses of effect estimation and
sparse support recovery, and it clarifies that the main bottleneck is
not merely algorithmic inefficiency but the intrinsic
indistinguishability of nearby causal instances under small tolerated TV
error.
Taken together, these lower bounds delineate the feasible guarantees: logarithmic-factor approximation is the best possible in the worst case for polynomial-time methods, and ε−2 interventional scaling with a klog (d/k) sparsity term is unavoidable for reliable identification. The role of structure (e.g. margins, modular tool-use circuits, and restricted intervention families) is therefore to move real instances away from worst-case behavior rather than to evade the fundamental barriers.
We evaluate whether the discovered coordinate set Ŝ ⊆ [d] functions as an interaction basis in the operational sense encoded by the ε-invariance constraint, and whether the greedy construction behaves as predicted on realistic tool-using agents. Our experiments therefore separate (i) (contexts/tests, trajectory distributions, and admissible interventions), (ii) (SAE training and effect estimation), and (iii) (held-out prompts, tool-output perturbations, and long-horizon execution). Throughout, we report confidence intervals at level 1 − δ using the same sampling assumptions as in the effect-estimation step (independent reruns conditional on cached prefixes when applicable).
We instantiate a suite of tool-use benchmarks spanning increasing
compositionality. Each benchmark defines (a) a tool API (e.g. , , , , )
with typed arguments and structured outputs, (b) a context family {cj}j = 1m
specifying prompts, initial memory, and (optionally) environment states,
and (c) a trajectory distribution family 𝒟 induced by sampling inputs and tool
responses. We include three tiers: tasks where the correct policy calls
exactly one tool or none; tasks where tool calls must occur in a fixed
partial order (e.g. search → extract
→ compute); and tasks where
intermediate tool outputs determine which subsequent tool is needed. The
last tier is designed to exercise conditional tool selection (and hence
the joint law of (Ut, At)),
not merely final-answer accuracy. For each tier we construct both
clean'' contexts (short prompts, canonical formatting) andmessy’’
contexts (distractors, irrelevant entities, and partially specified
instructions) so that the agent’s behavior is not trivially determined
by surface cues.
We generate counterfactuals via admissible interventions I ∈ ℐ at three loci: internal
activations, tool outputs, and memory. For internal interventions we use
resample ablation/patching at chosen write locations (e.g. residual
stream at selected blocks, tool-call head pre-softmax logits),
implemented by (i) caching a reference rollout τ under context cj, (ii)
selecting a time index t (or a
set of times), and (iii) replacing ht (or a
subspace corresponding to coordinates [d] \ S) by a draw from a
matched conditional distribution estimated from a replay buffer of
on-distribution activations. Operationally, we implement matched draws
by nearest-neighbor retrieval in activation space conditioned on a small
set of summary variables (prompt embedding bucket, tool name, and
discretized memory length), and we monitor admissibility by bounding a
divergence proxy (e.g. Mahalanobis distance under an empirical Gaussian
fit, or a classifier two-sample test) so that interventions remain
near-distribution.
For tool-output interventions, we define families of type-correct
perturbations: (a) rewrites (equivalent JSON key ordering, whitespace,
and paraphrase of textual snippets), (b) variants (scientific notation
vs decimal, reordered search results), and (c) within a declared
tolerance (e.g. adding irrelevant distractor documents while keeping at
least one relevant document present). For memory interventions, we apply
admissible edits such as re-serialization (equivalent formatting),
insertion of irrelevant but plausible past turns, or truncation to a
fixed window with compensating summary tokens drawn from a summarizer
distribution. These tool and memory interventions allow us to test
whether Ŝ mediates robust
tool-use decisions rather than brittle formatting dependencies.
Our primary metric is the empirical invariance gap for the discovered
set Ŝ:
$$
\widehat{\mathrm{Gap}}(\hat S)\ :=\ \max_{j\in[m]}\ \max_{t\le T}\
\Delta_{\mathrm{TV}}\!\Bigl(\widehat P(U_t,A_t\mid c_j),\ \widehat
P^{\mathrm{mask}([d]\setminus \hat S)}(U_t,A_t\mid c_j)\Bigr),
$$
estimated by rerunning the agent with masking interventions on
non-selected coordinates while holding zt, Ŝ
fixed (via subspace patching in ht). We also
report a variant using a product metric over time (e.g. average TV
across t) to separate
single catastrophic step'' failures from diffuse drift. As a complementary notion, we compute the \emph{tool-call invariance} restricted to $U_t$ and the \emph{action invariance} restricted to $A_t$, since some contexts are tool-sensitive but answer-insensitive (and vice versa). For minimality, we report $|\hat S|$ and the greedy coverage curve (fraction of contexts covered vs selected features), and we compare to a lower bound obtained by solving an integer program on small instances or by a dual-fitting bound from the set-cover relaxation. When we train an optional monitor $M_{\hat S}(\tau)$, we evaluate its predictive performance for downstream safety-relevant labels (e.g.\uses
external tool’‘, queries private memory'',executes code’’)
using AUROC and calibration error, but we treat this as secondary: the
principal claim is invariance, not classification.
We compare IB-Discover to baselines that select internal variables without the admissible interventional semantics. (i) baselines: choose coordinates with largest mean activation or variance; (ii) baselines: select features or heads with high gradient-based attribution to tool-call logits; (iii) k-subsets matched in size to Ŝ; (iv) feature selection: regress tool-call decisions on SAE coordinates with an ℓ1 penalty and select the support; and (v) : patch raw residual-stream components (or MLP neurons) chosen by their causal effect estimates, thereby testing whether the SAE coordinate system is providing a strictly better mediator basis. All baselines are validated using the same admissible masking protocol to avoid conflating selection with evaluation.
We ablate each design choice that can affect soundness of the constructed coverage instance. For the representation, we vary SAE dictionary size d, sparsity/regularization strength, and the choice of activation stream ht (early vs late blocks; tool-call head vs general residual). For effect estimation, we vary the number of interventional queries q, the time aggregation rule (max vs weighted sum over t), and the admissibility matching granularity (marginal resampling vs conditional resampling on prompt/tool state). For selection, we compare greedy set cover to alternatives (submodular maximization heuristics, stochastic greedy, and a simple ``top-k by effect’’ rule) to test whether coverage structure matters beyond effect magnitude. Finally, we ablate tool-output interventions: invariance may hold under internal masking but fail once tool outputs are perturbed, which indicates that the chosen mediators do not fully capture the policy’s dependence on the tool channel.
We include two classes of stress tests aimed at breaking hidden
assumptions behind 𝒟 and ℐ. The setting is a multi-headed benchmark in
which a single prompt multiplexes several independent subtasks, each
requiring a different tool pathway, and the agent must interleave tool
calls while maintaining separate scratchpad/memory threads. Concretely,
a context specifies r subtasks
with disjoint identifiers, and tool outputs are interleaved; failure
modes include tool-call leakage across threads and premature
termination. We test whether Ŝ
discovered on single-thread distributions continues to satisfy
invariance when the number of concurrent threads increases and when
distractor tool outputs are inserted.
For OOD, we evaluate on (a) prompt paraphrases and instruction
inversions, (b) distribution shifts in tool output style (newline
conventions, reordered fields), (c) longer horizons T with delayed tool-use decisions,
and (d) adversarially constructed trajectories that preserve
admissibility diagnostics (activation-distance bounds) but maximize
disagreement between masked and unmasked behavior. We report invariance
gaps separately for each OOD slice and track which additional features
are selected during refinement (Step 6) when validation fails, thereby
empirically characterizing how basis size must grow to maintain ε-invariance under increasingly
hostile but still type-correct perturbations.
The interaction-basis viewpoint reframes tool-use interpretability as a constrained mediator identification problem rather than as post hoc attribution. Once a set S satisfies the ε-invariance constraint against admissible masking of [d] \ S, we may treat zt, S as an operational bottleneck: the joint law of (Ut, At) is approximately preserved when all other internal degrees of freedom are randomized in a distribution-preserving way. This has two immediate safety implications. First, monitoring may be performed on a low-dimensional channel without requiring direct access to the full activation state. A monitor MS(τ) may be designed either as a simple statistic (e.g. thresholding a small subset of coordinates corresponding to ``calls ’’-like features) or as a learned classifier on the time series {zt, S}t ≤ T; the invariance guarantee justifies that such a monitor is not merely predictive on a fixed distribution, but is aligned with the causal interface the policy uses for tool decisions within the tested admissibility envelope. Second, editing becomes conceptually cleaner: interventions targeted to zt, S (or to the corresponding subspace in ht) can be interpreted as interventions on variables that are near-necessary for tool choice in the tested contexts. In particular, when S is small, a policy-level safety patch (e.g. suppressing a subset of coordinates associated with unsafe tool invocations) admits a causal interpretation that is typically absent from gradient-based steering on high-dimensional representations.
A further consequence is that stability under fine-tuning can be stated and tested at the level of the mediator representation. Theorem~3 isolates a class of parameter updates that preserve the structural equations generating Zt, S*, while permitting changes downstream. Under this causal locality condition, the interaction basis persists, and monitoring/editing rules defined on zt, S* are expected to transfer. Practically, this suggests an engineering discipline for alignment updates: if one can constrain fine-tuning procedures so that they predominantly reweight the mapping from a stable mediator interface to logits (for tool selection and action selection), then the discovered basis may remain valid across model versions. Conversely, when an update changes the upstream feature formation (e.g. by shifting attention patterns that determine whether a tool schema is recognized), we should expect basis drift; our methodology provides a diagnostic via failed invariance under masking validation.
The chief limitation is that our guarantees are only as strong as the
admissible intervention family ℐ and
the representational choice embodied by the SAE/transcoder f. An SAE provides a linear
coordinate system with sparsity; it does not in general supply a minimal
sufficient statistic for tool choice, nor does it guarantee
disentanglement of nonlinear interactions. If the true mediators are
fundamentally nonlinear (e.g. parity-like combinations, or
multiplicative gating between tool warranted'' andtool
arguments’’ features), then any coordinate-wise masking test may
mischaracterize necessity/sufficiency. In such cases, a set S may pass an ε-invariance test on a finite suite
while failing in untested combinations because the relevant mediator is
a function ϕ(zt, S)
not representable as a sparse subset. One remedy is to enlarge the
candidate mediator family beyond coordinates to include low-dimensional
subspaces or learned features with explicit compositional closure
(e.g. allowing pairwise products or small neural ``readout’’ networks
whose internal units become the candidates). The cost is that the
set-cover surrogate becomes less direct: the monotonicity and
thresholding assumptions used to construct C become harder to justify when
mediator candidates overlap nonlinearly.
A second limitation concerns adversarial or strategically robust agents. Our framework presumes that admissible interventions keep the computation near-distribution and that the agent does not explicitly condition its behavior on subtle indicators of being intervened upon. If the policy can detect intervention artifacts—even those that preserve low-order activation statistics—it may route around the mediator channel being tested, causing spurious acceptance of a basis on benign distributions while retaining hidden degrees of freedom for tool misuse. This failure mode is an instance of distributional contamination: the empirical indistinguishability that motivates admissibility is not equivalent to causal indistinguishability for a sufficiently capable model. One response is to strengthen admissibility diagnostics (e.g. two-sample tests on richer features, or training a discriminator to detect intervention vs control prefixes and enforcing near-chance performance), and to include adversarially chosen I ∈ ℐ that maximize disagreement subject to the admissibility constraints. Another response is to expand 𝒟 to include deliberate red-team trajectories; the interaction basis is, by definition, distribution-relative, and safety claims must be read as ``within (𝒟, ℐ)’’ rather than as global properties.
The same distribution-relativity appears in long-horizon settings.
Even if maxtΔTV(⋅, ⋅) ≤ ε
holds per step, trajectory-level divergence can accumulate with horizon
T, particularly when tool
calls influence future observations (via environment/tool outputs) and
when memory updates create feedback. For sequential composition, one may
prefer a bound on the trajectory law ΔTV(P(τ), Pmask(τ)),
or a coupling argument showing that per-step invariance implies bounded
regret in tool-use events. Achieving such statements likely requires
additional mixing or contraction assumptions on the induced interaction
dynamics, or a strengthened basis criterion that controls not only
immediate (Ut, At)
but also the distribution of memory states mt + 1 under
masking. Empirically, we expect the basis size required for ε-invariance to grow with horizon
unless the mediator coordinates encode a stable plan representation;
this motivates hierarchical variants in which one discovers bases at
multiple timescales (e.g. a
planner'' basis that governs coarse tool sequences and anexecutor’’
basis that governs argument formatting).
Extensions to multimodal tools are conceptually straightforward but technically delicate. If the tool API accepts or returns images, audio, or other structured modalities, then both ht and the admissible tool-output interventions must be modality-aware. On the representation side, one may train separate SAEs on modality-specific streams (vision encoder residuals, audio features) and allow the candidate mediator set to include coordinates across modalities, yielding zt = (zttext, ztvision, ztaudio). On the intervention side, ``type-correct’’ perturbations must preserve semantics in the relevant modality (e.g. image augmentations that preserve object identity, or audio time-shifts within tolerance). The main mathematical change is that admissibility should be defined with respect to the joint distribution of multimodal activations and tool I/O; naive marginal resampling risks violating cross-modal dependencies, thereby creating artifacts detectable by the policy. Conditional resampling keyed on multimodal summaries, or generative models for on-distribution activation patching, may be required.
Finally, long-horizon memory presents a special case of tool use where the ``tool’’ is a read/write channel into a persistent state. Here it is natural to treat memory operations as part of (Ut, At) and to include the memory state in the invariance target, but this increases the dimensionality of the object being preserved. A practical compromise is to define a memory-restricted basis guaranteeing invariance of (i) whether a memory read/write occurs and (ii) a coarse descriptor of the write (topic, keys accessed), while allowing low-level formatting to vary. Formally, we replace (Ut, At) by a projection ψ(Ut, At, mt + 1) and seek invariance for ψ; the choice of ψ is a policy/safety decision. Developing principled projections and corresponding admissible interventions for memory (e.g. summary-consistent truncations) is an open direction, and it is likely necessary for deploying interaction-basis monitoring in agents whose unsafe behavior is mediated primarily through long-term state rather than immediate tool calls.