Interaction Bases for Tool-Using LLM Agents: Approximation Algorithms, Stability, and Causal Validation

Metadata

Total Words: 12,159
Export Date: 2026-01-18 00:57:13
Description: Mechanistic interpretability has largely focused on static circuits in base models, but 2026-era systems are often tool-using agents fine-tuned by SFT/RLHF/DPO, where computations can re-route under ablation (the hydra effect) and neuron/head-level explanations fail to transfer. Building on the mechanistic interpretability view that behavior is mediated by internal features and circuits (often recoverable via sparse dictionary learning), we formalize a new target: an , a small set of internal mediator variables through which an agent’s tool-call and action-selection behavior factors under a specified family of admissible interventions. We give a concrete discovery-and-validation pipeline: (1) learn a sparse feature basis (SAE/transcoder) for relevant activation streams, (2) estimate per-feature causal effects using trajectory-level subspace/path patching and resample ablations, (3) select a near-minimal mediator set via a greedy approximation algorithm with a (1 + ln m) guarantee, and (4) verify the resulting abstraction via causal scrubbing / component replacement. We also give tight lower bounds: minimum-basis identification is NP-hard and hard to approximate beyond (1 − o(1))ln m, and any method needs Ω((klog (d/k) + log (1/δ))/ε²) interventional samples to identify a k-sparse basis among d candidate features. Finally, we outline experiments (tool-use benchmarks with counterfactual trajectory generation) to test stability across fine-tuning and demonstrate that monitors/edits operating on the interaction basis outperform neuron/head baselines in robustness and transfer.

1. Introduction: why neuron/head circuits fail for tool agents; define the need for stable interface mediators; contributions (definition, approximation, hardness, stability, validation).
2. Related Work: MI features/circuits/motifs; SAEs; activation/path/subspace patching; causal abstraction/scrubbing; hydra/self-repair; mechanistic studies of fine-tuning and tool use.
3. Preliminaries: causal models for agents; trajectory distributions; admissible interventions; SAE feature coordinates; distance metrics on action/tool distributions (TV/KL).
4. Formal Definition of Interaction Basis: SCM definition; ε-invariance criterion; minimality objective; equivalence classes under feature rotations/permutations.
5. Problem Formulation (IB-OPT and IB-DEC): optimization and decision versions; objective functions; validation criteria; discussion of identifiability and off-distribution pitfalls.
6. Algorithm (IB-GREEDY + CAUSAL-VALIDATE): estimate causal coverage matrix via interventions; greedy selection; post-hoc scrubbing validation; optional refinement via local search.
7. Guarantees I (Approximation and Complexity): (1 + ln m) bound; runtime bounds; conditions required for the coverage reduction to be valid.
8. Guarantees II (Stability Under Fine-Tuning): formal class of updates under which the interaction basis is invariant; approximate stability bounds when updates are small in a causal sense.
9. Lower Bounds and Hardness: NP-hardness via Set Cover/Hitting Set; inapproximability threshold; information-theoretic sample complexity lower bound for identifying sparse bases.
10. Experimental Plan (Flagged): benchmark design for tool-use; counterfactual trajectory generation; evaluation metrics; baselines; ablations; stress tests for hydra and prompt/trajectory OOD.
11. Discussion: implications for safety monitoring/editing; limits (nonlinear features, adversarial agents); how to extend to multimodal tools and long-horizon memory.

Content

1. Introduction: why neuron/head circuits fail for tool agents; define the need for stable interface mediators; contributions (definition, approximation, hardness, stability, validation).

Tool-using language models couple a high-dimensional internal computation to an external interface consisting of (i) discrete tool-call decisions, (ii) structured arguments, (iii) tool outputs that may be stochastic or adversarially perturbed, and (iv) persistent memory state. Even when the policy π_θ is implemented by a standard transformer, the induced interaction trajectory τ = (o_1 : T, u_1 : T, y_1 : T, m_1 : T, a_1 : T) is no longer governed solely by the model’s token distribution. Rather, the joint law of (U_t, A_t) depends on a closed-loop system in which tool behavior and memory updates feed back into later observations. This setting complicates mechanistic analysis: the objects we ultimately care to preserve or monitor are distributions over tool calls and actions along trajectories, not merely next-token probabilities under a fixed prompt.

A common mechanistic strategy in text-only settings is to search for sparse circuits'' defined in terms of individual neurons, attention heads, or short paths between them. For tool agents, such a strategy is unreliable for two distinct reasons. First, tool use introduces discontinuities in the input distribution: the same underlying task intent can be realized by heterogeneous tool outputs, formatting differences, and memory states. A circuit identified by correlational methods on one slice of trajectories may fail under slight changes in tool formatting, or under on-distribution perturbations that preserve semantics while altering surface form. Second, the parameters $\theta$ of deployed agents are rarely static. Fine-tuning procedures (SFT, RLHF, DPO) often preserve tool-use semantics while redistributing computation across internal components. In particular, any identification ofthe’’ decisive attention head for calling a tool is unstable under even modest downstream remapping of features into logits. Thus, neuron/head-level attributions do not provide a stable interface for either verification (ensuring behavior invariances) or monitoring (detecting shifts in tool-use intent).

We therefore separate two questions that are conflated by naive circuit discovery: (i) which internal degrees of freedom are necessary as for the externally visible interaction distribution, and (ii) which specific low-level implementation (neurons, heads) realizes those degrees of freedom in a particular parameterization. Our approach is to define and search for a compact set of mediator variables that function as an interface between the agent’s internal computation and the tool/action decisions. Concretely, we fix write locations in the computation graph producing internal activations h_t, and we train a sparse autoencoder or transcoder f to obtain coordinates z_t = f(h_t) ∈ ℝ^d. The coordinates z_t, i are candidate mediators: they are designed to be sparse, reusable across contexts, and interpretable enough to support targeted interventions. We emphasize that this is not an interpretability claim per se; it is a representational choice that gives us a discrete index set [d] over which minimality statements are well-posed.

The central requirement is a distributional invariance property stated at the level of interaction behavior. We are given a family 𝒟 of trajectory distributions (including in-domain and specified OOD slices) and an admissible intervention family ℐ that preserves on-distribution statistics (e.g., resample ablations or within-distribution patching of activations, and type-correct or semantically equivalent perturbations of tool outputs). For a candidate index set S ⊆ [d], we ask whether holding z_t, S fixed renders the joint distribution of (U_t, A_t) approximately invariant to admissible interventions on the remaining coordinates. Formally, we measure invariance by a distance Δ between distributions (defaulting to total variation), and we accept S if the induced deviation is at most ε across times t, distributions D ∈ 𝒟, and interventions I ∈ ℐ. This choice forces the basis to be an object: it is defined by what can change in tool calls and actions, not by what is salient under a particular interpretability visualization.

A definition at this level supports two desiderata that neuron/head circuits typically fail to satisfy. The first is : because interventions are constrained to preserve activation marginals (and tool outputs to remain type-correct or semantically equivalent), the invariance criterion discounts brittle circuits that only function under a narrow formatting regime. The second is : if fine-tuning updates are causally local in the sense that they do not alter the structural generation of the mediator variables z_t, S from upstream observations, then the same S should remain sufficient even when downstream maps from z_t, S to tool/action logits are modified. In other words, we aim to identify a mediator interface that can be tracked across model revisions without re-deriving an entirely new circuit decomposition.

We make five contributions, each aligned with a distinct failure mode of existing mechanistic workflows for tool agents.
First, we formalize the problem: given π_θ, 𝒟, ℐ, and tolerances (ε, δ), find a near-minimal subset S ⊆ [d] such that masking or resampling all non-basis coordinates (while conditioning on z_t, S) preserves the joint law of tool calls and actions within ε, with confidence 1 − δ. This reframes mechanistic identification as a constrained minimality problem over mediator coordinates, avoiding reliance on unstable low-level units.

Second, we give a polynomial-time discovery algorithm, , based on constructing a finite coverage instance from causal effect estimates. For each context/test c_j and feature coordinate i, we estimate the distributional effect of admissible interventions on z_t, i on the resulting (U_t, A_t) via Δ. Thresholding these effects yields a binary matrix encoding which features are necessary (up to tolerance) for which contexts. The resulting combinatorial problem is a Set Cover instance, for which the greedy algorithm achieves a (1 + ln m)-approximation in the number m of contexts. This yields an explicit, auditable approximation guarantee tied to the number of causal tests.

Third, we provide finite-sample guarantees for the soundness of the constructed coverage instance. Because effect estimation is performed by interventional reruns under admissibility constraints, sampling error is controlled by standard concentration inequalities. We show that with q interventional queries per context-feature pair scaling as Õ(γ⁻²log (md/δ)), the thresholded coverage matrix is correct for all entries separated by margin γ, implying that the greedy solution is a valid interaction basis on the evaluated test set.

Fourth, we prove a stability result under a class of fine-tuning updates. Modeling the agent as an SCM over mediator variables Z_t and residual internal state R_t, we show that if the structural equations generating Z_t, S^* are unchanged by the update θ ↦ θ^′ (only downstream readouts into logits change), then S^* remains an interaction basis under the same admissible intervention family. Consequently, when the coverage instance remains margin-separated, recovers the same basis (up to feature relabeling within an SAE equivalence class), formalizing the intuition that a mediator interface is more stable than a neuron-level circuit.

Fifth, we establish matching hardness and propose a validation protocol. We show that the minimum interaction basis problem on a finite test set is NP-hard via a reduction from Set Cover, and that logarithmic-factor approximation is essentially optimal unless P = NP. Since worst-case instances cannot be solved exactly, we emphasize validation: after selecting S, we perform causal scrubbing/component replacement tests in which only z_t, S is preserved and all other degrees of freedom are resampled admissibly; acceptance requires that the observed (U_t, A_t) distributions match within ε on held-out contexts and OOD slices. This closes the loop between discovery (a surrogate combinatorial objective) and the operational interaction-level invariance criterion.

The net effect is a framework that treats tool-use behavior as the primary object, enforces on-distribution intervention semantics, and yields a principled mediator set whose size is approximately minimal relative to the tested contexts. In the next section we situate this approach relative to work on circuits and motifs, sparse feature dictionaries, patching-based causal methods, and prior analyses of fine-tuning and tool use.

A large body of mechanistic interpretability work studies internal structure in terms of neurons, attention heads, and small computation graphs (``circuits’’) that explain specific behaviors in language models and vision models . Typical objectives include attributing a prediction to a sparse set of components, identifying recurring motifs (e.g., induction heads), and isolating algorithmic subroutines by analyzing attention patterns, MLP activations, or residual-stream directions . These methods are most mature in text-only settings where the input distribution is comparatively homogeneous and the object of interest is a next-token distribution under a fixed prompt. In tool-using agents, the relevant behavioral object is a distribution over —tool calls, arguments, and direct actions—along closed-loop trajectories, and the same semantic state may be encoded through heterogeneous tool outputs and memory states. This complicates naive circuit transfer: a circuit found by attribution on one surface realization may not persist when tool outputs change format while preserving type or semantics. Our formulation differs by defining a mediator interface at the level of interaction distributions and enforcing invariance under admissible perturbations, rather than seeking an invariant set of low-level units.

Sparse autoencoders (SAEs) and related dictionary learning methods have been used to recover sparse feature coordinates from dense activations in transformers, motivated by superposition and the desire for an indexable feature basis . Subsequent work explores transcoders and other structured encoders that align features across layers or constrain reconstructions to improve interpretability and causal manipulability . A key practical advantage of SAE coordinates is that they provide a discrete set of candidate mediators over which minimality statements and selection procedures are well-posed, while remaining compatible with white-box interventions through decoding back into activation space. Our use of SAEs is aligned with this representational program: we do not assume that any given coordinate is semantically ``pure,’’ but rather treat coordinates as a convenient basis in which sparsity can be exploited for discovery and for near-distribution interventions.

Causal analysis by activation patching (also called causal tracing) replaces internal activations in a target run with activations from a source run to test necessity/sufficiency relationships for outputs . Extensions include path patching and head/MLP ablations to isolate information flow through particular edges in the computation graph . More recently, subspace patching and representation-level interventions emphasize changing only a selected subspace while maintaining the remaining activation statistics to avoid off-distribution artifacts . Our admissibility requirement is closest in spirit to these ``matched distribution’’ interventions: when estimating effects of candidate mediators, we restrict to interventions that preserve empirical activation marginals/conditionals (e.g., resample ablations) so that measured changes in tool/action distributions reflect causal dependence rather than distribution shift. The distinguishing feature of our setting is that interventions must also accommodate tool-output and memory perturbations, since these are part of the closed-loop causal graph defining the trajectory distribution.

The goal of identifying a compact set of mediators is closely related to causal abstraction and causal representation learning, where one seeks a coarse-grained variable set whose interventions correspond to interventions on the underlying system . Causal scrubbing and component replacement tests operationalize this idea by replacing internal components with sampled surrogates while preserving designated mediators, and then checking whether behavior is preserved . Our validation protocol follows this tradition: after selecting a mediator set, we test whether resampling non-selected degrees of freedom (subject to admissibility) preserves the interaction distribution within tolerance. Conceptually, our ``interaction basis’’ can be viewed as a minimal mediator set for a particular coarse-graining target—namely the joint law of tool calls and actions—and our acceptance criterion matches the causal-abstraction objective of preserving interventional behavior rather than correlational summaries.

Tool-using language models and agents (e.g., function calling, retrieval-augmented generation, ReAct-style agents) introduce an explicit tool channel and persistent state, and have been studied both empirically and architecturally . Tool interfaces create structured discontinuities (JSON schemas, API errors, pagination, partial results) and additional failure modes (hallucinated tool outputs, adversarially formatted returns). Parallel work on robustness and reliability in agentic settings emphasizes self-correction, reflection, critique, and error recovery when tools fail or return unexpected values . Systems sometimes incorporate redundant ``hydra’’-style designs—multiple heads, multiple proposals, or verifier modules—to reduce single-point failures and to enable self-repair via re-querying and consistency checks . These lines of work motivate our focus on distributional invariance under constrained tool-output perturbations: if a policy is intended to be robust to benign formatting variation or semantically equivalent tool returns, then the mediator interface governing tool-use decisions should be stable under corresponding admissible interventions.

A recurring empirical observation is that fine-tuning (SFT/RLHF/DPO, as well as parameter-efficient adapters) can preserve externally measured capabilities while substantially changing internal representations and component-level attributions . This instability complicates any mechanistic claim phrased at the granularity of specific neurons or attention heads, since equivalent computations may be re-encoded in different subspaces after training updates. Work on probing, CCA-based alignment, and feature matching across checkpoints aims to track representational changes and identify invariants under training . In the tool-use domain, fine-tuning often changes the agent’s preference over tool usage (e.g., calling a tool more or less often) while preserving task success, further entangling mechanism with policy-level incentives . Our stability theorem formalizes a particular sufficient condition under which a mediator set remains valid across such updates: if the structural generation of selected mediators is unchanged and only downstream readouts are modified, then the interaction basis remains stable (up to feature relabeling within an equivalence class). This provides a target for mechanistic workflows that aim to remain meaningful across model revisions.

Finally, our reduction to Set Cover situates interaction-basis discovery among a broader class of problems that seek minimal sets of explanatory variables subject to behavioral constraints, including minimal test sets, hitting sets, and sparse causal explanations. Greedy set cover and its logarithmic approximation guarantees are classical, but their utility in mechanistic settings depends on constructing a sound coverage instance from causal tests . Our contribution is to specify such a construction using admissible interventional effect estimates and to connect it to a validation procedure that checks the original invariance criterion on held-out contexts, thereby decoupling the combinatorial surrogate from the operational behavioral specification.

3. Preliminaries: causal models for agents; trajectory distributions; admissible interventions; SAE feature coordinates; distance metrics on action/tool distributions (TV/KL).

We fix a tool-using agent policy π_θ implemented as a transformer with parameters θ and a specified tool API. Our goal in later sections is to state an invariance criterion for a small set of internal mediator variables; here we set up the causal and probabilistic objects needed to make such statements precise.

We model an interaction as a finite-horizon trajectory τ of length T consisting of observations, tool calls, tool outputs, internal memory/state, and actions. Concretely we write
τ = (o_1 : T, u_1 : T, y_1 : T, m_1 : T, a_1 : T),
where o_t is the agent’s external observation at time t (including the user prompt and any environment feedback), u_t is the tool call at time t (or a distinguished no tool'' symbol), $y_t$ is the tool output returned by the tool API (or a null output if $u_t$ isno tool’’), m_t denotes any persistent agent-side state (e.g. scratchpad, retrieved documents cache, or hidden memory variables exposed for analysis), and a_t is the environment-facing action at time t (which may include emitting tokens and/or committing to a final decision). We treat π_θ as inducing conditional distributions of the form
(u_t, a_t, m_t) ∼ π_θ( ⋅ ∣o_1 : t, y_{1 : t − 1}, m_{1 : t − 1} ),
with the understanding that, internally, this distribution is realized by a deterministic computation graph with injected randomness (e.g. sampling from logits) and with intermediate activations.

To make counterfactual reasoning explicit, we place these variables in a structural causal model (SCM) whose exogenous noise variables include (i) environment randomness, (ii) tool randomness, and (iii) the agent’s sampling noise. The tool channel is represented by a structural equation y_t := Tool(u_t, η_t^tool), where η_t^tool captures tool nondeterminism and where Tool includes schema validation and error modes. The agent internal computation induces additional endogenous variables, including selected activation vectors h_t at write locations we choose to instrument (e.g. residual stream vectors or pre-softmax tool-call logits).

A c specifies an initial condition and/or a partial trajectory template: it can fix the initial user prompt, constrain tool availability, specify a distribution over environment states, or impose other conditions on (o₁, m₀). Given a context c, the environment dynamics, the tool API behavior, and π_θ jointly induce a distribution over trajectories. We write D_c for this induced trajectory distribution, and we consider a family 𝒟 of such distributions, typically including an in-domain component together with specified out-of-distribution slices (e.g. alternative tool-output formats that remain type-correct). When we later quantify invariance, we will require statements to hold uniformly over D ∈ 𝒟 (or over a prescribed finite subfamily generated by test contexts).

We allow interventions on (a) internal activations, (b) memory/state variables, and (c) tool outputs. However, since off-distribution perturbations can create artifacts unrelated to the mechanism of interest, we restrict to an intervention family ℐ designed to preserve on-distribution statistics. Formally, an intervention I ∈ ℐ is a do-operator acting on a designated subset of variables in the SCM (e.g. coordinates of h_t or of a derived feature vector z_t), together with a rule for sampling replacement values from a matched empirical distribution. Typical admissible internal interventions include:

For tool and memory interventions, admissibility means that modified outputs remain type-correct and lie in a designated equivalence class (e.g. semantically equivalent strings, alternative formatting, or randomized but schema-valid payloads), so that we probe robustness to benign channel variation rather than to arbitrary corruption. We emphasize that admissibility is not merely a convenience: it is part of the problem specification and will be monitored via divergence checks on intervened variables.

Given a chosen activation stream h_t ∈ ℝ^d_h (possibly concatenating multiple write locations), we train a sparse autoencoder (or a transcoder) to obtain an encoder
f : ℝ^d_h → ℝ^d, z_t = f(h_t),
where z_t is sparse in the sense of having few nonzero (or large-magnitude) coordinates. We treat the coordinates of z_t as a discrete, indexable set of candidate mediator variables. When we intervene on z_t we implement the intervention by decoding back into activation space via a decoder g and patching the resulting reconstruction into the model, or equivalently by applying an intervention directly in h_t constrained to the subspace corresponding to the chosen coordinates. This choice gives us two advantages used later: (i) minimality questions over mediators become well-posed as subset selection problems over [d], and (ii) interventions can be phrased as coordinate-wise operations (e.g. resampling z_t, i) while remaining close to the empirical activation manifold when combined with admissibility constraints.

The behavioral object we care about is the distribution over interaction decisions at each time t, namely the joint distribution of the tool call and the action (U_t, A_t). Depending on the agent design, A_t may include both text emission and discrete environment actions, while U_t includes a tool name and arguments or the null call. Since these variables are typically discrete (e.g. JSON arguments can be canonicalized), we adopt total variation (TV) distance as our default discrepancy measure:
$$ \Delta_{\mathrm{TV}}(P,Q)=\frac12\sum_x |P(x)-Q(x)|. $$
When a KL-based metric is convenient (e.g. for logit-level analyses), we write $\Delta_{\mathrm{KL}}(P\|Q)=\sum_x P(x)\log\frac{P(x)}{Q(x)}$ and use Pinsker’s inequality $\Delta_{\mathrm{TV}}(P,Q)\le \sqrt{\tfrac12 \Delta_{\mathrm{KL}}(P\|Q)}$ to translate KL control into TV control when needed. In later definitions, we will apply Δ to distributions over (U_t, A_t) under different intervention regimes; when trajectories have variable length, we can treat termination as an additional absorbing action so that all distributions live on a common discrete space.

These preliminaries give us a common SCM and measurement interface: we can sample τ ∼ D ∈ 𝒟, extract h_t, map to sparse coordinates z_t, apply admissible interventions from ℐ either on the tool/memory channel or on internal coordinates, and quantify behavioral change via Δ on the induced laws of (U_t, A_t). The next section uses exactly these objects to define the interaction-basis invariance criterion and the minimality objective.

4. Formal Definition of Interaction Basis: SCM definition; ε-invariance criterion; minimality objective; equivalence classes under feature rotations/permutations.

We now formalize the mediator notion that will be optimized in later sections. Fix a choice of write locations and an encoder f (SAE/transcoder) producing coordinates z_t = f(h_t) ∈ ℝ^d for each time t. We treat each coordinate z_t, i as a candidate internal mediator variable, indexed by i ∈ [d], and we consider subsets S ⊆ [d] as candidate .

For each context c (hence each induced D = D_c ∈ 𝒟), we refine the trajectory SCM by inserting mediator variables (Z_t)_t = 1^T defined by the deterministic equation
Z_t := f(H_t),
where H_t denotes the selected activation stream (possibly a concatenation of multiple write locations). The remaining internal degrees of freedom—including the part of H_t not represented by the chosen SAE coordinates, as well as other uninstrumented activations—are bundled into auxiliary endogenous variables R_t. Conceptually, (Z_t, R_t) is a reparameterization of the agent internal state at time t relative to our measurement interface. The action/tool heads then induce conditional laws
(U_t, A_t, M_t) ∼ π_θ( ⋅ ∣o_1 : t, y_{1 : t − 1}, m_{1 : t − 1}; Z_1 : t, R_1 : t ),
where the semicolon indicates that (Z_1 : t, R_1 : t) are internal to the computation graph rather than externally observed. This refinement is only used to define interventions and invariance statements; we do not assume that (Z_t, R_t) are uniquely identifiable from behavior.

Let S ⊆ [d]. The defining operation for our invariance criterion is an admissible ``masking’’ intervention on the complement coordinates. Informally, we hold the selected coordinates Z_t, S fixed while resampling Z_{t, [d] \ S} from a matched empirical distribution so that the intervened activations remain typical. Formally, we assume that ℐ contains (or induces) an intervention scheme
mask([d] \ S) ∈ ℐ
which, at each time t, replaces Z_{t, [d] \ S} by a draw from a reference distribution that is (i) estimated from cached rollouts under the same context family, and (ii) matched on an admissible conditioning set that includes Z_t, S and may include coarse summaries of the past (o_1 : t, y_{1 : t − 1}, m_{1 : t − 1}). We write this schematically as
Z_{t, [d] \ S}^′ ∼ P̂(Z_{t, [d] \ S} ∣ Z_t, S, match(o_1 : t, y_{1 : t − 1}, m_{1 : t − 1})), Z_t, S^′ := Z_t, S,
with the understanding that the admissibility invariant monitors that the induced intervened distribution over internal states remains near the empirical activation distribution (e.g. by divergence checks). The exact matching rule is part of the specification of ℐ; our definitions below quantify over all admissible choices.

Given D ∈ 𝒟 and I ∈ ℐ, let P^D, I denote the induced law of the trajectory (or of any measurable function of it) when we run the agent/environment/tool SCM under D and apply I at its designated targets. In particular, P^D, I(U_t, A_t) denotes the marginal law of (U_t, A_t) at time t under D and I. When we additionally apply masking to all non-selected coordinates, we write
P^{D, I, mask([d] \ S)}(U_t, A_t)
for the corresponding marginal law in the SCM where the intervention family I is composed with the masking intervention at the relevant internal sites and times (with Z_t, S held fixed by construction of mask([d] \ S)).

We can now state the central invariance requirement: a set of mediator coordinates S is sufficient if, once Z_t, S is held fixed, adversarially varying (within ℐ) the remaining internal coordinates does not materially change the agent’s tool-use and action distributions.

\end{definition}

This definition is an approximate conditional invariance: it asserts that, uniformly over admissible perturbations of non-basis coordinates, the distribution of interaction decisions is stable once the basis coordinates are preserved. The max_{t ∈ [T]} choice is a convenient worst-case aggregation; one may replace it by an average over t or by a trajectory-level discrepancy without changing the optimization structure used later, provided the discrepancy remains bounded.

Among all sets satisfying Definition~, we prefer those with smallest cardinality. This yields an intrinsic combinatorial problem.

We emphasize that minimality is defined relative to the chosen measurement interface (write locations and encoder f) and the admissible intervention family ℐ. Enlarging ℐ typically increases the required basis size, while enriching the representation (larger d or additional write locations) can decrease it by making mediators more granular.

SAE coordinates are not canonical: even when f is fixed by training, multiple coordinate systems may represent the same effective mediator subspace, and retraining f can permute or rescale features. Accordingly, we define the interaction basis only up to a natural equivalence relation induced by transformations that preserve the represented internal information.

Let g : ℝ^d → ℝ^d_h denote the decoder (or the corresponding reconstruction map used for patching). For S ⊆ [d], define the decoder subspace
𝒲(S) := span{ g(e_i) : i ∈ S } ⊆ ℝ^d_h,
where e_i is the ith standard basis vector. We say S ∼ S^′ if 𝒲(S) = 𝒲(S^′) (or, in a tolerant version, if the principal angle distance between the subspaces is below a prescribed threshold). This captures the fact that if two coordinate sets span the same patchable subspace in activation space, then masking the complement relative to either set yields the same family of admissible counterfactuals up to negligible reconstruction error. In settings where the decoder columns are approximately orthogonal and features are identifiable up to permutation and sign, ∼ reduces to the expected relabeling equivalence.

Under this convention, the ``minimum interaction basis’’ is properly a minimum [S]_∼ rather than a unique subset of indices. Our algorithmic outputs later will return a concrete representative Ŝ, but all guarantees should be read as invariant under ∼ (and, where relevant, under small perturbations of 𝒲(S) induced by retraining the SAE within an equivalence class).

5. Problem Formulation (IB-OPT and IB-DEC): optimization and decision versions; objective functions; validation criteria; discussion of identifiability and off-distribution pitfalls.

Definition~ specifies a semantic invariance property of a coordinate set S, but it does not by itself determine how we should search over S or how we should certify that a candidate set generalizes beyond a finite battery of tests. We therefore separate (i) the underlying combinatorial problems (optimization and decision versions) from (ii) the empirical acceptance criteria we will impose when only finitely many trajectories and interventions can be executed.

Fix (π_θ, 𝒟, ℐ), horizon T, and tolerance ε. The intrinsic optimization problem is to find the smallest coordinate subset that satisfies the invariance constraint:
ilon.
\end{equation}
The corresponding decision version asks whether some basis of cardinality at most K exists:

The distinction is largely conceptual: (IB-DEC) is the natural target for complexity-theoretic reductions, while (IB-OPT) is the operational objective for discovery. In either case, the quantification over D ∈ 𝒟 and I ∈ ℐ should be read as ``uniform over the specified evaluation family and intervention semantics,’’ not as a claim of distribution-free robustness.

In practice, 𝒟 is instantiated by a finite collection of contexts {c₁, …, c_m} (prompts, initial states, or trajectory templates), and ℐ is instantiated by a finite menu of admissible intervention schemes (e.g. resample ablations with a specified matching rule, or tool-output perturbations within a type-correct equivalence class). Accordingly, we define a test-restricted constraint by replacing the suprema in~ with maxima over these finite families. Writing D_j := D_{c_j} and letting ℐ_test ⊆ ℐ denote the chosen finite intervention menu, we obtain the empirical feasibility predicate
P^{{D_j,I}(U_t,A_t), P}{D_j,I,([d]S)}(U_t,A_t))
.
\end{equation}
Our algorithm will explicitly optimize a surrogate derived from~. Later generalization statements will therefore be relative to (i) held-out contexts drawn from the same construction process that produced {c_j} and (ii) specified out-of-distribution slices included in 𝒟 by design.

Even on a finite test family, Φ_ε(S) is not directly observable; we can only estimate it from finitely many rollouts and interventional reruns. We thus relax the goal from exact optimality to sets: we seek an output Ŝ that (a) passes an empirical validation procedure with high probability and (b) has size within a provable approximation factor of the smallest set that would pass the same tests under the same admissible intervention semantics. This is the appropriate notion of ``optimality’’ for a discovery procedure that is inherently limited by sampling noise and by the coarseness of the admissible intervention family.

A central methodological point is that we do not accept S solely because a collection of per-feature causal effect estimates suggests that S ``covers’’ all contexts. Coverage-style constructions are lossy: they summarize a complicated invariance condition by a finite table of effect magnitudes. We therefore impose an explicit validation criterion that re-checks the invariance statement more directly using the strongest available admissible interventions.

Concretely, for a proposed S we will perform a (component replacement) test: we re-run trajectories under each c_j (and additional held-out contexts) while applying mask([d] \ S) at the specified internal sites and times, and we compare the resulting empirical distributions of (U_t, A_t) to the unmasked baseline. Let P̂ and P̂^mask denote the corresponding empirical distributions (constructed from repeated runs with independent randomness in the environment, tools, and the masking resampling). We accept S if
P^{D_j,I}(U_t,A_t), P^{D_j,I,([d]S)}(U_t,A_t))
+,
\end{equation}
for a slack η chosen to account for finite-sample error, and where (m_val, ℐ_val) may be larger/stronger than the families used to propose S. This separates (which may use coarse statistics) from (which directly targets the invariance constraint).

It is necessary to be explicit about what can and cannot be identified from interventional access. First, even in the population limit, there need not be a unique minimizer S^*: multiple disjoint coordinate sets may satisfy Definition~, and the equivalence relation S ∼ S^′ defined previously further collapses solutions that span the same patchable subspace. Second, identifiability is relative to ℐ: if the admissible masking scheme conditions on a rich history summary, it may implicitly preserve information outside S via the matching variables, thereby making smaller S appear sufficient. Conversely, if admissible matching is too crude, masking may destroy legitimate correlations needed for stable behavior, inflating the apparent basis size. For this reason, the object we are computing is best viewed as a , rather than as a uniquely determined ``true’’ causal set.

The invariance criterion is only meaningful when the counterfactual trajectories induced by mask([d] \ S) remain in-distribution for the agent. If masking produces activation patterns that are atypical, the agent may enter regimes where its tool-call and action heads behave erratically, causing spurious large discrepancies Δ(⋅, ⋅) that do not reflect genuine dependence on the masked coordinates. This is precisely why Definition~ quantifies over an family ℐ and why we require the admissibility invariant stated earlier: interventions must preserve on-distribution statistics up to monitored divergence.

Tool-output and memory interventions present an analogous hazard. If a counterfactual tool output is syntactically invalid, semantically incoherent, or violates latent formatting conventions the agent has internalized, then changes in (U_t, A_t) may reflect distribution shift rather than causal dependence on the modified content. Accordingly, our tool-side interventions will be restricted to type-correct and, when appropriate, semantically equivalent perturbations (or randomized outputs drawn from a calibrated output model), and we will treat violations of these constraints as invalid tests rather than as evidence about the basis.

We will therefore aim to compute a set Ŝ that is (i) small, (ii) passes a stringent empirical version of the invariance constraint under admissible counterfactuals, and (iii) is reported only up to the natural SAE-induced equivalence ∼. The next section gives an explicit polynomial-time procedure achieving a logarithmic approximation factor on a constructed surrogate instance, together with a post-hoc validation step that enforces~ and triggers refinement when the surrogate is unsound.

6. Algorithm (IB-GREEDY + CAUSAL-VALIDATE): estimate causal coverage matrix via interventions; greedy selection; post-hoc scrubbing validation; optional refinement via local search.

We now describe a concrete discovery procedure, which we call with a post-hoc certification step . The procedure takes as input the trained encoder f (or trains it as a preprocessing step), the finite context family {c_j}_j = 1^m, and an admissible intervention menu ℐ_test, and it outputs a coordinate set Ŝ ⊆ [d] intended to satisfy the empirical invariance constraint. Operationally, the method proceeds by (i) estimating, for each context–feature pair (j, i), a bounded causal effect score that quantifies how much the behavior (U_t, A_t) changes when feature i is masked in an admissible way, (ii) thresholding these scores to obtain a Boolean coverage instance Ĉ ∈ {0, 1}^m × d, (iii) running greedy Set Cover on Ĉ, and (iv) validating the resulting set by direct scrubbing tests of the invariance statement.

Fix a context c_j and let D_j denote the induced rollout distribution. For each feature i ∈ [d], we define an intervention scheme mask(i) ∈ ℐ_test that replaces the coordinate z_t, i at the chosen write locations and times by a resampled value drawn from a matched distribution, while leaving the rest of the computation unchanged. We then measure the induced change in behavior by a distance between the joint distributions of tool calls and actions. A canonical choice consistent with our notation is the time-aggregated total variation distance

_{}!(P^{{D_j}(U_t,A_t), P}{D_j,(i)}(U_t,A_t)),
\end{equation}
where P^D_j denotes the baseline distribution under c_j (with whatever stochasticity is present in the environment and tools), and P^{D_j, mask(i)} denotes the distribution under the same context with the single-feature admissible masking applied. In settings where tool calls are sparse and action distributions are high-entropy, we may replace the max_{t ∈ [T]} aggregator by a weighted sum ∑_tw_tΔ_TV(⋅, ⋅) concentrating on decision times; the subsequent reduction and selection steps depend only on the fact that e_j, i ∈ [0, 1] is a bounded score.

To estimate~ efficiently, we use paired reruns: we cache a baseline rollout for c_j (including random seeds when possible), and then rerun the same context multiple times while applying mask(i) with independent resampling noise but matched external randomness. Let q denote the number of such interventional reruns per pair (j, i). From the resulting samples we form empirical distributions P̂^D_j(U_t, A_t) and P̂^{D_j, mask(i)}(U_t, A_t) and compute

_{}!(P^{D_j}(U_t,A_t), P^{D_j,(i)}(U_t,A_t)).
\end{equation}
The admissibility diagnostics are enforced during this step: if the patched activations violate the monitored in-distribution criterion (e.g. a divergence threshold on the activation stream or a calibrated log-likelihood under an activation model), we treat the corresponding run as invalid and resample, rather than interpreting the resulting behavior change as causal dependence.

Given the matrix of estimates ê_j, i, we choose a threshold τ = τ(ε) (optionally with a safety margin to accommodate estimation error) and define a Boolean coverage matrix Ĉ ∈ {0, 1}^m × d by

The intended semantics is that Ĉ_j, i = 1 indicates that feature i is in context c_j under our admissible masking scheme, in the sense that removing it produces a detectable change in the distribution of (U_t, A_t). We emphasize that Ĉ is a surrogate object: it compresses a multi-time, distributional invariance constraint into a finite incidence structure on (j, i). The role of below is precisely to guard against unsoundness introduced by this compression.

With Ĉ in hand, we solve the induced Set Cover instance: each feature i corresponds to the set of contexts it ``covers,’’ namely {j : Ĉ_j, i = 1}. We run the standard greedy rule: starting from S = ∅ and the uncovered context set J = [m], iteratively choose a feature maximizing the number of newly covered contexts, add it to S, and remove the contexts it covers. Implementationally, we store each column of Ĉ as a sparse bitset, allowing each greedy step to be computed by fast intersections and popcounts.

The greedy output is only a hypothesis about which coordinates form a sufficient mediator set. We therefore subject the candidate S to a direct scrubbing test: for each validation context (which may include held-out contexts and stronger intervention schemes), we rerun the agent while applying mask([d] \ S) at the same internal sites, and we compare the empirical distribution of (U_t, A_t) to the unmasked baseline. The acceptance criterion is precisely of the form~: we accept S if the observed discrepancy is at most ε + η uniformly over the validation family. This step is the only point at which we commit to the invariance property itself; the preceding coverage construction is treated as a proposal mechanism.

If rejects S, we refine rather than restarting. A simple refinement rule is residual augmentation: identify the contexts and times with the largest validation discrepancy, recompute (or reuse) ê_j, i restricted to those failing contexts, and add a small number of features with the largest residual effect. We may also perform local search in the opposite direction by : once a set passes validation, we attempt to remove features one at a time (or in small batches) and re-run validation, keeping any removal that preserves acceptance. This backward elimination is computationally more expensive than greedy selection but empirically improves near-minimality, and it preserves the monotonic structure needed for the approximation analysis provided the acceptance predicate is treated as a monotone constraint with respect to adding mediators.

Finally, given an accepted Ŝ, we may optionally construct a monitor M_Ŝ(τ) based on statistics of z_t, Ŝ (e.g. thresholds on feature magnitudes, a calibrated density model, or a classifier predicting tool-call patterns). Such monitors are auxiliary outputs: they do not enter the definition of the interaction basis, but they exploit the fact that Ŝ is designed to parameterize the agent’s interaction behavior under admissible counterfactuals.

7. Guarantees I (Approximation and Complexity): (1 + ln m) bound; runtime bounds; conditions required for the coverage reduction to be valid.

Fix the thresholded incidence matrix Ĉ ∈ {0, 1}^m × d constructed from the empirical effect scores via~. For each feature i ∈ [d] write Cov(i) := {j ∈ [m] : Ĉ_j, i = 1}, and for a set S ⊆ [d] write Cov(S) := ⋃_i ∈ SCov(i). The greedy loop in Algorithm~ is exactly the classical greedy algorithm for Set Cover on the universe [m] with candidate sets {Cov(i)}_i = 1^d. Therefore, letting S_Ĉ^⋆ denote an optimal set cover for this constructed instance, the standard analysis yields

We emphasize what~ does and does not imply. It is a purely combinatorial statement about the surrogate object Ĉ; it does not, by itself, establish the interaction-basis invariance constraint. To connect the surrogate objective to invariance, we require a implication: informally, that covering all contexts in Ĉ suffices to make masking the complement [d] \ S behavior-preserving under the admissible scrubbing semantics.

We isolate the model-specific assumptions into a single implication, which we treat as the contract that the effect-score construction must satisfy. Fix a context family {c_j}_j = 1^m and an admissible masking operator mask(⋅) as above. We say that Ĉ is (relative to the test family) if every S ⊆ [d] with Cov(S) = [m] satisfies

Here η ≥ 0 is an explicit slack capturing the fact that the surrogate Ĉ is built from single-feature effects, whereas~ concerns a intervention on all non-selected coordinates. A sufficient set of semantic conditions under which (ε, η)-soundness is plausible is: (i) —for every context j there exists at least one coordinate i such that masking i alone produces a detectable deviation (Ĉ_j, i = 1), and any set S that includes at least one such coordinate for each j blocks all large deviations when the rest are jointly masked; and (ii) —the causal effect of simultaneously masking multiple coordinates not in S does not exceed (up to η) the maximum of their individual effects as measured by the metric used to define Ĉ. Condition (ii) is a form of ``no destructive interference’’ for the masking intervention family; it is not automatic, and it is precisely why we insist on the post-hoc scrubbing step as the operational arbiter of~.

Under (ε, η)-soundness, we may interpret the set-cover optimum |S_Ĉ^⋆| as an upper bound on the smallest mediator set that passes the empirical invariance test (at tolerance ε + η) on the given contexts. In that regime,~ becomes an explicit near-minimality statement for the recovered mediator set, up to the unavoidable logarithmic factor in m.

The preceding discussion treats Ĉ as given; we now account for statistical error in the empirical effect estimates~. Since each ê_j, i is a bounded statistic in [0, 1] (by boundedness of Δ_TV and our time aggregation), uniform concentration can be obtained by Hoeffding plus a union bound over the md context–feature pairs. Concretely, for any γ ∈ (0, 1), if we use q independent admissible reruns per pair (j, i) with matched external randomness and independent resampling noise for mask(i), then

suffices to ensure max_j, i|ê_j, i − e_j, i| ≤ γ with probability at least 1 − δ. A standard corollary is a for correct thresholding: if there exists γ > 0 such that, for all (j, i), either e_j, i ≥ τ + γ or e_j, i ≤ τ − γ, then the thresholded matrix Ĉ equals the population thresholded matrix C entrywise with probability at least 1 − δ. In this margin-separated regime, the approximation bound~ holds relative to the coverage instance, and the only remaining gap to invariance is the semantic soundness implication~, which is exactly what empirically enforces on a held-out validation family.

Putting the pieces together, we obtain the following conditional guarantee. Suppose (a) the admissibility checks enforce that all masked reruns remain within the monitored in-distribution criterion, so that e_j, i is meaningful as an in-distribution causal effect; (b) the margin-separated concentration precondition holds so that Ĉ matches C with probability 1 − δ; and (c) the induced coverage instance is (ε, η)-sound in the sense of~. Then, with probability at least 1 − δ, the greedy set S_greedy satisfies the empirical invariance constraint at tolerance ε + η, and moreover its cardinality is within a factor (1 + ln m) of the minimum set size among all sets that satisfy the coverage constraints encoded by C. Finally, even when (c) fails, the certification step prevents false acceptance: directly tests the invariance statement and rejects S_greedy if the surrogate construction was unsound on the validation family, at which point refinement augments the mediator set until the invariance discrepancy falls below the target tolerance.

We record the dominant costs in the natural parameters. Let T be the rollout horizon and C_fwd the cost of one rollout step of π_θ including tool I/O. The effect-estimation phase performs m ⋅ d ⋅ q interventional reruns (plus baseline caching), each costing O(T C_fwd), for a total of Õ(mdqTC_fwd) time, typically the leading term. Greedy Set Cover on the sparse bitset representation of Ĉ can be implemented in O(mdlog d) time (or faster with incremental updates), which is lower-order unless q is very small. Space is O(md) bits for Ĉ plus the storage required for cached activations used by the SAE and admissible resampling; streaming or checkpointing can reduce the latter at the cost of recomputation. The post-hoc validation requires V additional reruns for V validation contexts/interventions, contributing O(VTC_fwd) time, and any backward-elimination pruning multiplies this cost by the number of attempted removals.

The logarithmic factor in~ is not an artifact of our analysis: even on the finite context set, the induced combinatorial problem inherits the worst-case inapproximability of Set Cover. Thus, absent additional structure (e.g. special forms of Ĉ, submodularity of a continuous relaxation, or strong causal sparsity assumptions), one should not expect a polynomial-time algorithm with a substantially better worst-case approximation ratio in m. In our setting, the meaningful avenues for improvement are therefore (i) strengthening the semantic validity of the surrogate construction (reducing η and increasing margin separation), and (ii) reducing the number of contexts m needed to achieve the desired behavioral coverage by designing tests that are maximally discriminative for tool-use semantics.

8. Guarantees II (Stability Under Fine-Tuning): formal class of updates under which the interaction basis is invariant; approximate stability bounds when updates are small in a causal sense.

We now formalize a class of parameter updates under which the interaction basis is invariant (up to the usual SAE identifiability symmetries), and we record a corresponding approximate stability bound when the update is small in a causal sense. The guiding principle is that the basis is a property of the inside the agent—i.e. how upstream computation and observations produce the latent coordinates z_t—rather than a property of any particular downstream action head. Consequently, updates that preserve the structural equations generating the selected mediators should preserve the interaction basis.

Fix write locations and an SAE/transcoder f so that z_t = f(h_t) ∈ ℝ^d. We view a rollout as an SCM over time-indexed internal variables. For each t, let Z_t denote the random vector of SAE coordinates and let R_t denote the collection of remaining internal variables sufficient to render the computation Markovian at the chosen granularity (e.g. other residual-stream components, attention states, and any tool memory not represented in Z_t). Let O_t collect exogenous inputs at time t (observation tokens, tool outputs under the external tool semantics, and any exogenous randomness). We assume the agent admits structural equations of the form
Z_t = g_θ(Z_< t, R_< t, O_≤ t), R_t = r_θ(Z_≤ t, R_< t, O_≤ t),
and action/tool outputs are sampled from a head
(U_t, A_t) ∼ Head_θ (Z_≤ t, R_≤ t, O_≤ t),
where the dependence on the past may be summarized by the usual transformer recurrence through cached key/value states. A parameter update θ ↦ θ^′ is if, for all t, the structural equation for the selected mediators is unchanged,

while downstream maps (including Head and the equations for R_t) may change arbitrarily. Intuitively,~ says that fine-tuning may alter how the model the mediators, but not how the mediators are from upstream information. This abstraction captures common regimes in which the update is restricted to a small set of parameters downstream of the write location used to define h_t (e.g. tool-call head reweighting, adapter layers inserted late in the network, or a low-rank update applied only after the mediator extraction point), while leaving earlier computation approximately intact.

Under~, the interaction-basis invariance statement is preserved because admissible masking interventions on [d] \ S are defined at the level of Z_t (or the corresponding subspace in h_t) conditional on Z_t, S. If the law of Z_t, S given upstream variables is unchanged, then the conditional distributions induced by admissible resampling of non-selected coordinates remain the same object before and after fine-tuning, and the causal-scrubbing criterion continues to hold with the same tolerance. Formally, suppose that for π_θ there exists a set S^⋆ such that for all D ∈ 𝒟 and all admissible interventions I ∈ ℐ,
max_t Δ_TV (P^D, I(U_t, A_t), P^{D, I, mask([d] \ S^⋆)}(U_t, A_t)) ≤ ε.
If θ ↦ θ^′ is causally local with respect to S^⋆ in the sense of~, then the same inequality holds for π_θ^′ with the same S^⋆ and the same ε under the same intervention family. The proof is an SCM invariance argument: the intervention mask([d] \ S^⋆) replaces only the structural equations (or sampled values) for coordinates outside S^⋆ while holding Z_t, S^⋆ fixed; since the generation of Z_t, S^⋆ is unchanged by~, the distribution of (U_t, A_t) under masked reruns differs from the unmasked reruns in the same way pre- and post-update, hence the same bound applies.

The preceding statement concerns the of a stable basis; we also want stability of the set returned by our procedure. Here we use the margin-separated regime implicit in the finite-sample analysis: if the population effect scores for π_θ induce a thresholded coverage instance C whose relevant entries are separated from the threshold by a margin γ, then Theorem~2 implies that Ĉ equals C with high probability given sufficient interventional reruns. Under a causally local update, the population instance C is unchanged for any effect-score definition that depends only on the causal impact of masking non-selected coordinates conditional on Z_t, S^⋆. Therefore, provided we reuse the same SAE coordinate system (or an equivalent representation), the greedy selection step yields the same set cover solution up to tie-breaking. In practice, when the SAE is retrained after fine-tuning, the coordinate system is identifiable only up to permutation and small within-subspace mixing; thus we interpret stability ``up to feature relabeling within an SAE equivalence class’’ by aligning features via correlation of activations or by solving a minimum-cost matching between coordinates across checkpoints. Under such alignment, the discovered set Ŝ^′ for π_θ^′ coincides with Ŝ for π_θ whenever the aligned coverage instance remains margin-separated; this is the operational content of stability.

Causally local fine-tuning is an idealization. We therefore record a perturbative statement: if the update perturbs the mediator-generation mechanism only slightly, then the interaction-basis guarantee degrades gracefully. Let S be a candidate basis and suppose that, for each t and each D ∈ 𝒟, the conditional laws of the selected mediators satisfy

in expectation over O_≤ t (or uniformly on a high-probability event under D). Then for any admissible masking intervention on [d] \ S, a coupling argument plus the triangle inequality yields
Δ_TV (P^D(U_t, A_t), P^{D, mask([d] \ S)}(U_t, A_t)) ≤ ε + O(κ) + η_adm,
where η_adm captures any additional divergence introduced by using the same empirical admissible resampling distributions after fine-tuning (e.g. if the activation marginals used for resample ablation are slightly mismatched). Concretely, if admissibility is implemented by resampling from an empirical conditional Q(⋅ ∣ Z_t, S) learned under π_θ, then η_adm can be bounded by Δ_TV(Q_θ, Q_θ^′) terms that are themselves controlled by~ under mild regularity. The upshot is that small causal drift in mediator generation increases the invariance error additively, and the certification step will accept the old S whenever ε + O(κ) + η_adm remains below the target tolerance.

This stability view motivates a simple operational protocol: after fine-tuning, we first attempt to reuse the previously certified set S and rerun . Failure of validation can be interpreted as evidence that either (i) the update was not causally local with respect to the chosen mediators (i.e. κ is not small), or (ii) the admissible intervention family has become mismatched (i.e. η_adm is not small). In both cases, the remedy is the same: re-estimate effect scores on a small diagnostic context suite to identify newly causal coordinates, augment S accordingly, and re-certify. When margin separation holds, this refinement typically recovers the same basis plus a small number of additional features that account for the newly introduced dependence.

We stress what the stability guarantee does not assert. If fine-tuning changes the upstream computation that produces Z_t, S in a way that introduces qualitatively new latent variables relevant for tool use, then the true minimum basis may change, and no identification method can avoid re-discovery. Our claim is narrower: for the common class of updates that preserve tool-use semantics by reweighting or reshaping downstream decision boundaries while leaving the mediator computation essentially intact, the interaction basis is an invariant object, and our procedure recovers it consistently subject to the same finite-sample and margin conditions as in the base model.

9. Lower Bounds and Hardness: NP-hardness via Set Cover/Hitting Set; inapproximability threshold; information-theoretic sample complexity lower bound for identifying sparse bases.

We complement the approximation and stability guarantees with matching lower bounds of two kinds: (i) computational hardness of finding a minimum interaction basis even on a finite suite of tests, and (ii) information-theoretic lower bounds on the number of admissible interventional reruns required to identify a sparse basis with high confidence. These results justify both the logarithmic approximation factor of greedy selection and the essentially unavoidable dependence of sample complexity on sparsity and confidence.

We reduce to the finite-test decision problem : given contexts {c_j}_j = 1^m, candidate features [d], and a budget K, decide whether there exists S ⊆ [d] with |S| ≤ K satisfying the ε-invariance constraint on the test set under the admissible masking interventions. Consider a instance with universe [m] and sets 𝒮₁, …, 𝒮_d ⊆ [m]. We build a one-step (T = 1) agent-environment pair in which the context index j is revealed as an exogenous observation O₁ = j, and the SAE coordinates Z₁ ∈ {0, 1}^d are generated deterministically by
Z_1, i := 1{j ∈ 𝒮_i}, i ∈ [d].
We define a single binary action A₁ ∈ {0, 1} (tool call can be ignored or taken as a fixed dummy symbol) by the head equation
$$ A_1\ :=\ \bigvee_{i=1}^d Z_{1,i}, $$
so that A₁ = 1 on every context j that is contained in at least one set. (If an element j is uncovered by all 𝒮_i, we may remove it from the universe as usual.) Thus, under the unmasked dynamics, P(A₁ = 1 ∣ O₁ = j) = 1 for all j ∈ [m].

To instantiate admissible masking, we specify ℐ to include resample ablation on any subset of coordinates: for i ∉ S, the intervention replaces Z_1, i by an independent draw from the empirical marginal Bernoulli(p) with some fixed p ∈ (0, 1), while holding Z_1, S fixed. This is admissible in the sense that the post-intervention distribution of Z₁ remains in the same product family and preserves coordinate-wise statistics; in particular, for any i the marginal under intervention equals the reference marginal by construction. Under this masking, if S fails to cover context j (i.e. j ∉ ∪_i ∈ S𝒮_i), then all coordinates with Z_1, i = 1 lie in [d] \ S and are resampled. Consequently,
P^{mask([d] \ S)}(A₁ = 1 ∣ O₁ = j) = 1 − (1 − p)^deg (j),
where deg (j) := |{i: j ∈ 𝒮_i}|. Choosing p such that (1 − p)^deg (j) ≥ 2ε for all j (e.g. take p small and note deg (j) ≤ d) forces
Δ_TV (P(A₁ ∣ O₁ = j), P^{mask([d] \ S)}(A₁ ∣ O₁ = j)) ≥ ε
whenever S does not cover j, since the unmasked law is a point mass at A₁ = 1 while the masked law has at least 2ε mass on A₁ = 0. Conversely, if S cover j, then there exists i ∈ S with Z_1, i = 1 held fixed under masking, hence A₁ = 1 deterministically both before and after masking and the TV distance is 0. Therefore, S satisfies the invariance constraint on all m contexts if and only if {𝒮_i}_i ∈ S is a set cover. This establishes NP-hardness of and hence of the optimization variant .

The above reduction is approximation-preserving in the standard sense: an interaction basis of size k corresponds to a set cover of size k and vice versa, without distorting the objective. Therefore, the classical inapproximability of transfers directly: unless P = NP, no polynomial-time algorithm can guarantee an approximation factor better than (1 − o(1))ln m in the worst case. In particular, the (1 + ln m) factor achieved by greedy selection on the constructed coverage instance is essentially tight under worst-case complexity assumptions.

A related hardness perspective arises when we view basis selection as choosing mediators that hit'' all distinguishable counterfactual discrepancies. Suppose we formulate constraints not aseach context must be invariant under masking’’ but as a family of pairwise requirements: for a collection of counterfactual trajectory pairs (τ, τ^′) deemed semantically distinct, masking non-selected variables must not collapse the induced distributions of (U_t, A_t) so as to confuse the pair beyond ε. One can encode each such pair as an element of a universe and each candidate coordinate i as a test that detects (or preserves) that distinction; selecting a mediator set that preserves all distinctions is then a instance. Since is equivalent to under polynomial reductions, this yields NP-hardness for strengthened variants in which constraints are trajectory-level or pairwise rather than per-context. The practical interpretation is that hardness is not an artifact of our particular invariance definition; it persists across natural reformulations that aim to preserve behavioral distinctions under admissible scrubbing.

Computational hardness does not by itself preclude successful identification on structured instances; we therefore also record a statistical lower bound showing that even with unbounded computation, identifying a valid k-sparse basis among d candidates requires on the order of klog (d/k) bits of information and hence a corresponding number of interventional samples. We consider a class of one-step instances indexed by an unknown support S^⋆ ⊆ [d] with |S^⋆| = k. For each i ∈ [d], define a context c_i whose unmasked action distribution is Bernoulli(1/2), but for which masking coordinate i (while holding all other coordinates fixed) shifts the action distribution by ±ε if and only if i ∈ S^⋆; if i ∉ S^⋆, masking i has zero effect. Concretely, we may implement this by letting a latent bit B ∼ Bernoulli(1/2) be part of O₁, setting Z_1, i = B for i ∈ S^⋆ and Z_1, i independent noise for i ∉ S^⋆, and taking A₁ = B unless masking destroys the relevant coordinate, in which case A₁ is replaced by an independent fair coin. Under such a construction, the total variation effect size of masking i is exactly ε for i ∈ S^⋆ and 0 otherwise, and all interventions remain admissible because masked coordinates are resampled from the same marginal family.

Any algorithm that outputs a basis Ŝ with success probability at least 1 − δ must, in particular, distinguish among the $\binom{d}{k}$ possible supports. By Fano’s inequality, if I denotes the transcript of all adaptive interventional queries and observations, then
$$ \mathbb{P}(\widehat S = S^\star)\ \le\ 1-\frac{H(S^\star\mid \mathsf{I})-1}{\log \binom{d}{k}}, $$
so achieving ℙ(Ŝ = S^⋆) ≥ 1 − δ requires mutual information I(S^⋆; I) ≳ klog (d/k) + log (1/δ). On the other hand, each admissible interventional query produces an observation whose distribution differs between two supports by at most O(ε) in TV (and hence by at most O(ε²) in KL for bounded binary observations), so the information gained per query is O(ε²). It follows that the number of interventional samples must satisfy
$$ q\ =\ \Omega\!\left(\frac{k\log(d/k)+\log(1/\delta)}{\varepsilon^2}\right), $$
up to universal constants, even allowing adaptivity and even when the horizon is one. This lower bound matches, in order, the dependence appearing in standard finite-sample analyses of effect estimation and sparse support recovery, and it clarifies that the main bottleneck is not merely algorithmic inefficiency but the intrinsic indistinguishability of nearby causal instances under small tolerated TV error.

Taken together, these lower bounds delineate the feasible guarantees: logarithmic-factor approximation is the best possible in the worst case for polynomial-time methods, and ε⁻² interventional scaling with a klog (d/k) sparsity term is unavoidable for reliable identification. The role of structure (e.g. margins, modular tool-use circuits, and restricted intervention families) is therefore to move real instances away from worst-case behavior rather than to evade the fundamental barriers.

10. Experimental Plan (Flagged): benchmark design for tool-use; counterfactual trajectory generation; evaluation metrics; baselines; ablations; stress tests for hydra and prompt/trajectory OOD.

We evaluate whether the discovered coordinate set Ŝ ⊆ [d] functions as an interaction basis in the operational sense encoded by the ε-invariance constraint, and whether the greedy construction behaves as predicted on realistic tool-using agents. Our experiments therefore separate (i) (contexts/tests, trajectory distributions, and admissible interventions), (ii) (SAE training and effect estimation), and (iii) (held-out prompts, tool-output perturbations, and long-horizon execution). Throughout, we report confidence intervals at level 1 − δ using the same sampling assumptions as in the effect-estimation step (independent reruns conditional on cached prefixes when applicable).

We instantiate a suite of tool-use benchmarks spanning increasing compositionality. Each benchmark defines (a) a tool API (e.g. , , , , ) with typed arguments and structured outputs, (b) a context family {c_j}_j = 1^m specifying prompts, initial memory, and (optionally) environment states, and (c) a trajectory distribution family 𝒟 induced by sampling inputs and tool responses. We include three tiers: tasks where the correct policy calls exactly one tool or none; tasks where tool calls must occur in a fixed partial order (e.g. search → extract → compute); and tasks where intermediate tool outputs determine which subsequent tool is needed. The last tier is designed to exercise conditional tool selection (and hence the joint law of (U_t, A_t)), not merely final-answer accuracy. For each tier we construct both clean'' contexts (short prompts, canonical formatting) andmessy’’ contexts (distractors, irrelevant entities, and partially specified instructions) so that the agent’s behavior is not trivially determined by surface cues.

We generate counterfactuals via admissible interventions I ∈ ℐ at three loci: internal activations, tool outputs, and memory. For internal interventions we use resample ablation/patching at chosen write locations (e.g. residual stream at selected blocks, tool-call head pre-softmax logits), implemented by (i) caching a reference rollout τ under context c_j, (ii) selecting a time index t (or a set of times), and (iii) replacing h_t (or a subspace corresponding to coordinates [d] \ S) by a draw from a matched conditional distribution estimated from a replay buffer of on-distribution activations. Operationally, we implement matched draws by nearest-neighbor retrieval in activation space conditioned on a small set of summary variables (prompt embedding bucket, tool name, and discretized memory length), and we monitor admissibility by bounding a divergence proxy (e.g. Mahalanobis distance under an empirical Gaussian fit, or a classifier two-sample test) so that interventions remain near-distribution.
For tool-output interventions, we define families of type-correct perturbations: (a) rewrites (equivalent JSON key ordering, whitespace, and paraphrase of textual snippets), (b) variants (scientific notation vs decimal, reordered search results), and (c) within a declared tolerance (e.g. adding irrelevant distractor documents while keeping at least one relevant document present). For memory interventions, we apply admissible edits such as re-serialization (equivalent formatting), insertion of irrelevant but plausible past turns, or truncation to a fixed window with compensating summary tokens drawn from a summarizer distribution. These tool and memory interventions allow us to test whether Ŝ mediates robust tool-use decisions rather than brittle formatting dependencies.

Our primary metric is the empirical invariance gap for the discovered set Ŝ:
$$ \widehat{\mathrm{Gap}}(\hat S)\ :=\ \max_{j\in[m]}\ \max_{t\le T}\ \Delta_{\mathrm{TV}}\!\Bigl(\widehat P(U_t,A_t\mid c_j),\ \widehat P^{\mathrm{mask}([d]\setminus \hat S)}(U_t,A_t\mid c_j)\Bigr), $$
estimated by rerunning the agent with masking interventions on non-selected coordinates while holding z_t, Ŝ fixed (via subspace patching in h_t). We also report a variant using a product metric over time (e.g. average TV across t) to separate single catastrophic step'' failures from diffuse drift. As a complementary notion, we compute the \emph{tool-call invariance} restricted to $U_t$ and the \emph{action invariance} restricted to $A_t$, since some contexts are tool-sensitive but answer-insensitive (and vice versa). For minimality, we report $|\hat S|$ and the greedy coverage curve (fraction of contexts covered vs selected features), and we compare to a lower bound obtained by solving an integer program on small instances or by a dual-fitting bound from the set-cover relaxation. When we train an optional monitor $M_{\hat S}(\tau)$, we evaluate its predictive performance for downstream safety-relevant labels (e.g.\uses external tool’‘, queries private memory'',executes code’’) using AUROC and calibration error, but we treat this as secondary: the principal claim is invariance, not classification.

We compare IB-Discover to baselines that select internal variables without the admissible interventional semantics. (i) baselines: choose coordinates with largest mean activation or variance; (ii) baselines: select features or heads with high gradient-based attribution to tool-call logits; (iii) k-subsets matched in size to Ŝ; (iv) feature selection: regress tool-call decisions on SAE coordinates with an ℓ₁ penalty and select the support; and (v) : patch raw residual-stream components (or MLP neurons) chosen by their causal effect estimates, thereby testing whether the SAE coordinate system is providing a strictly better mediator basis. All baselines are validated using the same admissible masking protocol to avoid conflating selection with evaluation.

We ablate each design choice that can affect soundness of the constructed coverage instance. For the representation, we vary SAE dictionary size d, sparsity/regularization strength, and the choice of activation stream h_t (early vs late blocks; tool-call head vs general residual). For effect estimation, we vary the number of interventional queries q, the time aggregation rule (max vs weighted sum over t), and the admissibility matching granularity (marginal resampling vs conditional resampling on prompt/tool state). For selection, we compare greedy set cover to alternatives (submodular maximization heuristics, stochastic greedy, and a simple ``top-k by effect’’ rule) to test whether coverage structure matters beyond effect magnitude. Finally, we ablate tool-output interventions: invariance may hold under internal masking but fail once tool outputs are perturbed, which indicates that the chosen mediators do not fully capture the policy’s dependence on the tool channel.

We include two classes of stress tests aimed at breaking hidden assumptions behind 𝒟 and ℐ. The setting is a multi-headed benchmark in which a single prompt multiplexes several independent subtasks, each requiring a different tool pathway, and the agent must interleave tool calls while maintaining separate scratchpad/memory threads. Concretely, a context specifies r subtasks with disjoint identifiers, and tool outputs are interleaved; failure modes include tool-call leakage across threads and premature termination. We test whether Ŝ discovered on single-thread distributions continues to satisfy invariance when the number of concurrent threads increases and when distractor tool outputs are inserted.
For OOD, we evaluate on (a) prompt paraphrases and instruction inversions, (b) distribution shifts in tool output style (newline conventions, reordered fields), (c) longer horizons T with delayed tool-use decisions, and (d) adversarially constructed trajectories that preserve admissibility diagnostics (activation-distance bounds) but maximize disagreement between masked and unmasked behavior. We report invariance gaps separately for each OOD slice and track which additional features are selected during refinement (Step 6) when validation fails, thereby empirically characterizing how basis size must grow to maintain ε-invariance under increasingly hostile but still type-correct perturbations.

11. Discussion: implications for safety monitoring/editing; limits (nonlinear features, adversarial agents); how to extend to multimodal tools and long-horizon memory.

The interaction-basis viewpoint reframes tool-use interpretability as a constrained mediator identification problem rather than as post hoc attribution. Once a set S satisfies the ε-invariance constraint against admissible masking of [d] \ S, we may treat z_t, S as an operational bottleneck: the joint law of (U_t, A_t) is approximately preserved when all other internal degrees of freedom are randomized in a distribution-preserving way. This has two immediate safety implications. First, monitoring may be performed on a low-dimensional channel without requiring direct access to the full activation state. A monitor M_S(τ) may be designed either as a simple statistic (e.g. thresholding a small subset of coordinates corresponding to ``calls ’’-like features) or as a learned classifier on the time series {z_t, S}_t ≤ T; the invariance guarantee justifies that such a monitor is not merely predictive on a fixed distribution, but is aligned with the causal interface the policy uses for tool decisions within the tested admissibility envelope. Second, editing becomes conceptually cleaner: interventions targeted to z_t, S (or to the corresponding subspace in h_t) can be interpreted as interventions on variables that are near-necessary for tool choice in the tested contexts. In particular, when S is small, a policy-level safety patch (e.g. suppressing a subset of coordinates associated with unsafe tool invocations) admits a causal interpretation that is typically absent from gradient-based steering on high-dimensional representations.

A further consequence is that stability under fine-tuning can be stated and tested at the level of the mediator representation. Theorem~3 isolates a class of parameter updates that preserve the structural equations generating Z_t, S^*, while permitting changes downstream. Under this causal locality condition, the interaction basis persists, and monitoring/editing rules defined on z_t, S^* are expected to transfer. Practically, this suggests an engineering discipline for alignment updates: if one can constrain fine-tuning procedures so that they predominantly reweight the mapping from a stable mediator interface to logits (for tool selection and action selection), then the discovered basis may remain valid across model versions. Conversely, when an update changes the upstream feature formation (e.g. by shifting attention patterns that determine whether a tool schema is recognized), we should expect basis drift; our methodology provides a diagnostic via failed invariance under masking validation.

The chief limitation is that our guarantees are only as strong as the admissible intervention family ℐ and the representational choice embodied by the SAE/transcoder f. An SAE provides a linear coordinate system with sparsity; it does not in general supply a minimal sufficient statistic for tool choice, nor does it guarantee disentanglement of nonlinear interactions. If the true mediators are fundamentally nonlinear (e.g. parity-like combinations, or multiplicative gating between tool warranted'' andtool arguments’’ features), then any coordinate-wise masking test may mischaracterize necessity/sufficiency. In such cases, a set S may pass an ε-invariance test on a finite suite while failing in untested combinations because the relevant mediator is a function ϕ(z_t, S) not representable as a sparse subset. One remedy is to enlarge the candidate mediator family beyond coordinates to include low-dimensional subspaces or learned features with explicit compositional closure (e.g. allowing pairwise products or small neural ``readout’’ networks whose internal units become the candidates). The cost is that the set-cover surrogate becomes less direct: the monotonicity and thresholding assumptions used to construct C become harder to justify when mediator candidates overlap nonlinearly.

A second limitation concerns adversarial or strategically robust agents. Our framework presumes that admissible interventions keep the computation near-distribution and that the agent does not explicitly condition its behavior on subtle indicators of being intervened upon. If the policy can detect intervention artifacts—even those that preserve low-order activation statistics—it may route around the mediator channel being tested, causing spurious acceptance of a basis on benign distributions while retaining hidden degrees of freedom for tool misuse. This failure mode is an instance of distributional contamination: the empirical indistinguishability that motivates admissibility is not equivalent to causal indistinguishability for a sufficiently capable model. One response is to strengthen admissibility diagnostics (e.g. two-sample tests on richer features, or training a discriminator to detect intervention vs control prefixes and enforcing near-chance performance), and to include adversarially chosen I ∈ ℐ that maximize disagreement subject to the admissibility constraints. Another response is to expand 𝒟 to include deliberate red-team trajectories; the interaction basis is, by definition, distribution-relative, and safety claims must be read as ``within (𝒟, ℐ)’’ rather than as global properties.

The same distribution-relativity appears in long-horizon settings. Even if max_tΔ_TV(⋅, ⋅) ≤ ε holds per step, trajectory-level divergence can accumulate with horizon T, particularly when tool calls influence future observations (via environment/tool outputs) and when memory updates create feedback. For sequential composition, one may prefer a bound on the trajectory law Δ_TV(P(τ), P^mask(τ)), or a coupling argument showing that per-step invariance implies bounded regret in tool-use events. Achieving such statements likely requires additional mixing or contraction assumptions on the induced interaction dynamics, or a strengthened basis criterion that controls not only immediate (U_t, A_t) but also the distribution of memory states m_t + 1 under masking. Empirically, we expect the basis size required for ε-invariance to grow with horizon unless the mediator coordinates encode a stable plan representation; this motivates hierarchical variants in which one discovers bases at multiple timescales (e.g. a planner'' basis that governs coarse tool sequences and anexecutor’’ basis that governs argument formatting).

Extensions to multimodal tools are conceptually straightforward but technically delicate. If the tool API accepts or returns images, audio, or other structured modalities, then both h_t and the admissible tool-output interventions must be modality-aware. On the representation side, one may train separate SAEs on modality-specific streams (vision encoder residuals, audio features) and allow the candidate mediator set to include coordinates across modalities, yielding z_t = (z_t^text, z_t^vision, z_t^audio). On the intervention side, ``type-correct’’ perturbations must preserve semantics in the relevant modality (e.g. image augmentations that preserve object identity, or audio time-shifts within tolerance). The main mathematical change is that admissibility should be defined with respect to the joint distribution of multimodal activations and tool I/O; naive marginal resampling risks violating cross-modal dependencies, thereby creating artifacts detectable by the policy. Conditional resampling keyed on multimodal summaries, or generative models for on-distribution activation patching, may be required.

Finally, long-horizon memory presents a special case of tool use where the ``tool’’ is a read/write channel into a persistent state. Here it is natural to treat memory operations as part of (U_t, A_t) and to include the memory state in the invariance target, but this increases the dimensionality of the object being preserved. A practical compromise is to define a memory-restricted basis guaranteeing invariance of (i) whether a memory read/write occurs and (ii) a coarse descriptor of the write (topic, keys accessed), while allowing low-level formatting to vary. Formally, we replace (U_t, A_t) by a projection ψ(U_t, A_t, m_t + 1) and seek invariance for ψ; the choice of ψ is a policy/safety decision. Developing principled projections and corresponding admissible interventions for memory (e.g. summary-consistent truncations) is an open direction, and it is likely necessary for deploying interaction-basis monitoring in agents whose unsafe behavior is mediated primarily through long-term state rather than immediate tool calls.