Debate-Triggered Audits for Tool-Using Agents: Closed-Form Oversight Savings, Identification, and 2026-Scale Evidence

Metadata

Total Words: 10,635
Export Date: 2026-01-23 00:19:52
Description: Scalable oversight proposals such as doubly-efficient debate argue that a small number of human judgments can verify very long computations when two models compete and a verifier spot-checks a disputed step. However, in 2026 deployments, the relevant question is not only asymptotic query complexity but the realized dollar/minute budget for expert review under tool-using agent pipelines (search, code, APIs, actions), and the conditions under which disagreement-based spot-checking fails (shared blind spots, correlated mistakes). We provide an empirical and quantitative bridge: (1) a clean statistical model that maps challenger performance and audit policies into closed-form expressions for (i) expected expert minutes per task and (ii) residual post-deployment error rates; (2) an identification strategy using randomized audits to consistently estimate detection power and false-flag rates; and (3) an evaluation pipeline implementing cross-examination-like independence via context stripping and multi-vendor debaters. Our headline finding is that debate-triggered audits can match full-trace review accuracy while reducing expert minutes per task by 5–20×, and that failure modes are predictable from estimated disagreement parameters and trace-locality features. The results operationalize the core promise of doubly-efficient debate from the source material—bounded human judgment queries—into measurable oversight budgets and concrete deployment guidance.

1. Introduction: why constant-query debate needs an empirical cost model (expert minutes, latency) in 2026 tool-using agents; contributions and headline findings.
2. Relation to prior work: debate and doubly-efficient debate (source), cross-examination, red-teaming; empirical oversight/evaluation literature; how our model differs (minutes-per-task objective + identification of disagreement parameters).
3. A tractable audit-cost model: define tasks, error events, disagreement signal, audit policy; derive closed-form expected expert minutes and residual error; discuss assumptions and how they map to debate protocols.
4. Identification and estimation: show how randomized audits identify (μ,α,β) and calibration by step type; confidence intervals; what requires numerical methods (heterogeneous costs/mixtures).
5. Experimental design: implementation of A/B/V pipeline; independence interventions (context stripping, multi-vendor pairing, temperature/seed separation); audit sampling (triggered + random); human panel protocol and time measurement.
6. Benchmarks and enterprise workflows: task suites (search+summarize, code+test, contract drafting+clause checks, compliance checks); ground-truth and adjudication rules; logging and trace schema.
7. Main results: time savings vs full-trace review; residual error vs target δ; estimation of (μ,α,β); decomposition by tool type and step type; ablations showing effect of independence.
8. Failure modes and predictability: shared blind spots; correlated misses; disagreement collapse; link to trace locality/context length; predictive models for audit load and residual risk.
9. Policy and procurement implications: how to set audit rate ρ, when to require multi-vendor debaters, how to budget expert time, and which metrics to monitor in production (estimated α, β, disagreement rate, locality).
10. Limitations and next steps: imperfect humans, correlated panels, strategic models, and extensions toward robust debate; open questions aligned with the source’s flagged gaps.

Content

1. Introduction: why constant-query debate needs an empirical cost model (expert minutes, latency) in 2026 tool-using agents; contributions and headline findings.

Tool-using language models deployed in 2026 rarely fail in the way classical supervised benchmarks suggest. Instead, they fail through : a brittle web search query that omits a key constraint, a silent type conversion in code execution, a misread contractual clause, or an overconfident synthesis step that collapses uncertainty into a single number. These failures are often if one inspects the right slice of the model’s tool trace, yet full inspection is expensive: it requires expert attention, broad context, and (in many organizations) latency that is incompatible with product expectations. As a result, oversight mechanisms that look appealing on paper can collapse in deployment because they implicitly assume unlimited expert time and instantaneous review.

Constant-query debate—in which we always pair a primary agent with a challenger that attempts to surface flaws—is a particularly salient example. The mechanism promises targeted scrutiny without enumerating the full space of possible failures. But in operational settings, the central question is not only whether debate correctness; it is whether it yields a favorable tradeoff under realistic budgets, trace lengths, and time-to-answer constraints. When every task induces a multi-step tool trace, just have humans check it'' is not a plan; it is a cost center. Conversely,just trust the challenger’’ is not a safety argument; it is an uncalibrated delegation of risk. We therefore need an empirical cost model that treats expert minutes and latency as first-class objects rather than afterthoughts.

Our starting point is a deployment-relevant abstraction: each task instance induces a latent error event E ∈ {0, 1} under the applicable evaluation standard, and a challenger produces a binary dispute signal D ∈ {0, 1} that may (or may not) correlate with error. This is intentionally modest. We do not assume the challenger can produce a full correction, nor that it can articulate a proof. We only require that it emits a reliable enough to focus scarce expert attention. This signal is characterized by a detection power α := Pr [D = 1 ∣ E = 1] and a false-flag rate β := Pr [D = 1 ∣ E = 0]. The verifier audits all disputes and additionally audits a random fraction ρ of non-disputes. The mechanism is simple enough to implement in production pipelines, yet expressive enough to capture the core tension: disagreements save time only if they are informative about true errors, and random audits are needed to bound residual risk when the challenger is silent.

A constant-query debate system without a calibrated auditing layer is vulnerable to a predictable failure mode: shared blind spots. If the primary and challenger are aligned in their omissions—because they share training data, tool wrappers, or reasoning shortcuts—then α can be low precisely on the high-impact step types we care about (e.g., legal interpretation or security-critical code). In that regime, debate may look ``helpful’’ on cherry-picked examples while leaving the tail risk essentially unchanged. The practical response is not to abandon debate outright, but to instrument it: we need to measure (μ, α, β), where μ := Pr [E = 1] is the base error rate, and to compute how much random auditing ρ is required to guarantee a target residual error δ. This is where an explicit cost model matters: the optimal policy depends on how many expert minutes are required for a localized audit compared to a full-trace review, and on whether the flagged step can be judged with minimal context.

The first contribution of this paper is a closed-form cost–risk frontier that connects these quantities. Under the auditing rule ``always audit disputes; randomly audit non-disputes with probability ρ,’’ and under the approximation that audited items are adjudicated correctly, we can write the expected expert time per task as a simple function of disagreement probability and audit costs. Importantly, the residual post-deployment error rate depends only on the probability of an error that is by the challenger and randomly audited, yielding a constraint of the form
Pr (E^′ = 1) = μ(1 − α)(1 − ρ) ≤ δ,
with a corresponding minimal random-audit rate $\rho^\*(\delta)$. This makes the tradeoff legible: when α is high, $\rho^\*(\delta)$ can be small and we spend expert minutes roughly proportional to the dispute rate rather than to task length; when α is low, $\rho^\*(\delta)$ approaches 1 and the mechanism degenerates toward near-full auditing.

The second contribution is identification: how we learn (μ, α, β) from the logs of a deployed system that oversamples disputes. Triggered audits are inherently biased toward suspicious items; without additional random audits of D = 0 cases, we cannot estimate how often the challenger misses errors. We therefore treat ρ not merely as a safety lever but as an experimental design parameter: by randomizing audits among non-disputes, and reweighting audited observations by their inclusion probabilities, we obtain consistent estimators of the challenger miss-rate and false-flag rate in the relevant deployment distribution. This turns debate-style oversight from an informal practice into a measurable protocol: we can quantify how much safety we buy per expert minute, and we can detect when an independence intervention actually improves detection rather than merely changing rhetoric.

The third contribution is to connect model parameters to concrete engineering levers. In practice, the audit cost is dominated not only by whether a human is invoked, but by must be read to judge a flagged step. We therefore treat trace locality as a key feature: if a dispute can be adjudicated by inspecting a short span of the trace (e.g., a single API call and its surrounding prompt), then audit minutes can be an order of magnitude smaller than full review; if adjudication requires reconstructing the entire tool trajectory, then debate-triggered audits lose most of their advantage. Similarly, we motivate and evaluate independence interventions—such as context stripping for the challenger or using a different model family—as ways to increase α (and sometimes reduce β) by reducing shared reasoning paths.

Empirically, our headline finding is that when (i) disputes are sufficiently predictive of true errors and (ii) disputed steps are auditable with localized context, constant-query debate coupled with targeted auditing can reduce expert minutes by roughly 5–20× relative to full-trace review while achieving comparable residual error. We emphasize that this is not a guarantee of debate as a universal safety solution; it is a statement about a measurable regime in which the mechanism is economically and operationally plausible. Symmetrically, when independence fails and α is low on the step types that dominate harm, the model correctly predicts that the cost advantage collapses: maintaining a small δ forces ρ upward, and the system reverts toward exhaustive review.

Finally, we flag limitations and open problems that the cost model makes explicit rather than hiding. Audits are not always perfect; expert panels can disagree, and adjudication itself can be gamed by persuasive but incorrect traces. Errors are heterogeneous and heavy-tailed; the relevant δ may be defined over harm-weighted outcomes rather than raw error frequency. Agents may adapt strategically to the auditing policy, changing both μ and the distribution of step types. Nonetheless, we view the formalism as clarifying the safety tradeoff: if we want scalable oversight for tool-using agents, we must jointly optimize (and empirically estimate) α, β, μ, ρ, and trace locality, because these are the quantities that determine whether debate is a cost-saving filter or an expensive placebo.

2. Relation to prior work: debate and doubly-efficient debate (source), cross-examination, red-teaming; empirical oversight/evaluation literature; how our model differs (minutes-per-task objective + identification of disagreement parameters).

Our work sits at the intersection of (i) debate-style scalable oversight, (ii) adversarial evaluation and red-teaming practice, and (iii) empirical measurement for deployed AI systems. The common theme across these literatures is that oversight must be : a human (or trusted process) cannot check everything, so we need mechanisms that allocate attention to the subset of cases where it matters most. The key difference in our framing is that we treat selectivity as an rather than as a purely algorithmic question of ``does oversight improve answers?’’ Concretely, we take expert minutes (and the context length those minutes must cover) as an objective, and we make explicit what must be randomized and logged in order to whether a challenger is actually informative about errors.

The original debate proposal (and subsequent refinements) motivates an oversight protocol in which two agents argue and a judge decides. A central aspiration is that adversarial interaction can make hard questions easier for a judge by surfacing cruxes and counterexamples, much as cross-examination does in legal settings. ``Doubly-efficient’’ debate variants (in the sense used in that line of work) aim to further reduce the judge’s burden by ensuring that, for some classes of questions, the judge can evaluate a amount of dialogue rather than a full proof or end-to-end reasoning trace . We view these proposals as pointing in the right direction operationally: if the marginal judge time per task does not shrink substantially below full review, debate will not scale in the regimes where tool traces are long and domain expertise is scarce.

Our model can be read as a deliberately ``thin’’ abstraction of this debate agenda. Rather than assuming that the debate transcript itself is sufficient for a judge to decide correctness, we isolate a weaker but deployment-relevant primitive: a challenger emits a and, when disputing, can point to a localized step to inspect. This captures a wide range of systems in current practice, from structured debate to critique-and-revise loops, where the challenger may not provide a complete counterargument but can often highlight an API call, a calculation, a citation, or a contractual interpretation that appears suspicious. The benefit of this abstraction is that it enables a closed-form cost–risk accounting keyed to measurable operational quantities (minutes per audit, fraction of tasks audited, and how often disputes correspond to real errors), without committing to strong assumptions about the judge’s ability to evaluate full argumentative transcripts.

A large body of work, both classical and in recent LLM systems, emphasizes that adversarial scrutiny can surface mistakes that cooperative generation misses. This includes cross-examination-inspired protocols, structured adversarial collaboration, self-critique, and multi-agent verification . Our contribution is complementary: we do not primarily propose a new interaction pattern between agents, but rather model the oversight pipeline that must exist if such patterns are to be safety-relevant in deployment. In particular, many critique protocols are evaluated by whether the answer improves on benchmarks. In production, however, the safety question is often whether we can guarantee a bound on residual error under a fixed review budget. This requires a mechanism-level view: how often does critique trigger review, how expensive is that review given the needed context, and how much risk remains when critique is silent?

Red-teaming has become a central empirical tool for stress-testing deployed models, with variants ranging from expert-crafted adversarial prompts to automated adversarial generation and tool-assisted penetration testing . This literature is primarily oriented toward failures, characterizing vulnerability classes, and motivating mitigations. We share the emphasis on adversarial pressure, but focus on a different unit of analysis: repeated, routine tasks in a pipeline where each instance could contain an error and where human attention is a scarce resource. In that setting, red-teaming corresponds most closely to the challenger role, and the ``win condition’’ is not simply that the challenger can sometimes find a bug, but that its challenge rate and hit rate can be calibrated into an audit policy with predictable human cost. Our identification result is also directly motivated by red-teaming practice: adversarial testing produces a highly selected set of examples, which is invaluable for discovery but insufficient to quantify miss-rates without additional sampling design.

There is a substantial empirical literature on LLM evaluation, monitoring, and post-deployment measurement, including approaches that treat humans as a limited labeling budget to be allocated adaptively . In parallel, classical quality control and acceptance sampling study how to bound defect rates using randomized inspection with known inclusion probabilities. Our setup can be seen as importing these measurement principles into debate-style oversight: disputes induce of suspicious cases, which is good for catching errors but bad for estimating the error rate among non-disputed items. The fix is standard in spirit but underused in model oversight: introduce a randomized audit channel for non-disputes and use inclusion-probability reweighting (e.g., Horvitz–Thompson) to obtain unbiased estimates of key rates. The practical implication is that ρ is not merely a safety knob; it is also what allows an organization to learn whether its challenger is improving over time or quietly sharing blind spots with the primary model.

Relative to the above work, our main departures are twofold. First, we make the objective explicitly (and, implicitly, latency and staffing), rather than accuracy alone. This is not cosmetic: it forces us to model audit locality and to distinguish between full-trace review and localized step audits, which is where most of the operational savings (or failures) occur. Second, we emphasize of disagreement statistics under selective auditing. Many systems report disagreement rates or ``challenge success’’ on curated sets, but those numbers do not directly answer the deployment question: how often does silence imply safety? By treating disagreement as a probabilistic signal with a miss-rate and a false-flag rate, and by specifying a minimal random audit design sufficient to estimate those rates on-policy, we obtain a framework in which debate-style oversight can be monitored, compared across independence interventions, and governed via measurable service-level objectives on residual risk.

3. A tractable audit-cost model: define tasks, error events, disagreement signal, audit policy; derive closed-form expected expert minutes and residual error; discuss assumptions and how they map to debate protocols.

To reason about whether debate-style oversight can actually in deployment, we need an accounting model that ties together three operational facts: (i) tasks are repeated and heterogeneous, (ii) errors are rare but costly, and (iii) human attention is the bottleneck. We therefore model oversight as a selective inspection process, where the challenger provides a suspicion signal and the verifier uses that signal—plus a small amount of random sampling—to decide where experts spend minutes.

We consider a stream of tasks indexed by i ∈ {1, …, N}. Each instance x_i includes the user prompt, tool context, and any environment state that conditions the agent’s behavior. The primary agent A produces an artifact y_i, which we treat as the : it may include a final answer, an action plan, and a trace/log of tool calls. Let E_i ∈ {0, 1} denote whether y_i contains an error relative to the evaluation standard (incorrectness or unsafe behavior). The base error rate is μ := Pr [E = 1] under the deployment distribution. Importantly, E_i is typically not observed unless we pay for an audit.

A challenger model B observes (x_i, y_i) (or a restricted view in independence treatments) and emits a binary disagreement signal D_i ∈ {0, 1}, optionally pointing to a particular step or localized slice of the trace as the reason for dispute. We summarize the informativeness of this signal via two scalars:
α := Pr [D = 1 ∣ E = 1], β := Pr [D = 1 ∣ E = 0],
where α is detection power (how often the challenger catches real errors) and β is the false-flag rate (how often it triggers review when there is no error). In a debate framing, α corresponds to the probability that adversarial interaction surfaces a crux that, when inspected, reveals a genuine flaw; β corresponds to wasted judge time on spurious objections. We emphasize that α and β are parameters: they can shift under independence interventions (e.g., context stripping, different model families) and under changes in the tool environment that create new failure modes.

A verifier/orchestrator V implements a simple audit rule that we can optimize and estimate. If D_i = 1, we trigger an expert audit targeted at the pinpointed step. If D_i = 0, we trigger a random audit with probability ρ, via R_i ∼ Bernoulli(ρ). When an item is audited, human experts adjudicate and (in our baseline abstraction) correct the output perfectly. This yields a residual post-mechanism error indicator
E_i^′ = E_i ⋅ 1{D_i = 0, R_i = 0},
i.e., an error persists only when it is both undisputed and unaudited.

This policy is intentionally austere: it captures a wide class of practical pipelines in which the challenger does need to convince the human end-to-end, but only needs to raise a credible flag that makes a localized check worthwhile. The locality assumption is central. We parameterize human effort by minutes needed to review the audited context: c_a minutes for a dispute-triggered audit (typically small if the challenger points to a single tool call, citation, or calculation), c_r minutes for a random audit of a non-dispute (often larger, since there is no highlighted crux), and c_full minutes for full-trace review as a baseline. A trace locality feature L_i and step type s_i can be used to explain variation in these costs, but for closed-form intuition we start with constants.

Under this policy, the probability of disagreement is
Pr (D = 1) = αμ + β(1 − μ),
and the expected expert minutes per task are
𝔼[T] = c_a Pr (D = 1) + c_r ρ Pr (D = 0) = c_a(αμ + β(1 − μ)) + c_rρ(1 − αμ − β(1 − μ)).
The residual error rate after the mechanism (assuming audits resolve errors) is
Pr (E^′ = 1) = Pr (E = 1, D = 0, R = 0) = μ(1 − α)(1 − ρ),
where we use the design condition that R is randomized independently of (E, D) conditional on the policy. This single expression makes the safety tradeoff explicit: residual risk comes entirely from challenger (1 − α) multiplied by how often we accept challenger silence without sampling (1 − ρ).

Given a target residual error bound δ, we obtain the minimal random-audit rate required to meet it:
$$ \mu(1-\alpha)(1-\rho)\le \delta \quad\Longrightarrow\quad \rho^\star(\delta) \;=\; \max\Bigl\{0,\,1-\frac{\delta}{\mu(1-\alpha)}\Bigr\}\qquad (\alpha<1). $$
Two immediate implications are worth highlighting. First, when α is high, ρ^⋆(δ) can be near zero even for stringent δ: disagreement-triggered audits suffice for safety, and random audits are primarily for measurement and drift detection. Second, when α is low—for example because A and B share blind spots or because the challenger lacks tool access—ρ^⋆(δ) approaches one, collapsing the mechanism to near-full auditing.

To connect this to a full-review baseline, suppose expert time is valued at w dollars per minute and residual errors incur harm λ per error (with compute/latency costs tracked separately). Then the expected social cost per task is
𝔼[C] = w 𝔼[T] + λ Pr (E^′ = 1) + κ 𝔼[Compute],
while full-trace review costs w c_full in minutes and (by assumption) eliminates residual error. Debate-style oversight is favorable precisely when it makes c_a ≪ c_full via localized crux-finding, when the disagreement rate Pr (D = 1) is not too large (which depends on both μ and β). In other words, the mechanism achieves multiplicative savings from (i) auditing only a subset of tasks and (ii) auditing a small slice of the trace.

At the same time, the model makes clear what debate must accomplish to be safety-relevant: it is not enough that critiques sometimes improve answers on curated benchmarks. What matters is that the induced suspicion signal has high α on the deployment mixture of failure modes, and that audits can be localized enough that c_a and c_r remain well below c_full. These are empirical properties of an , not just of a prompting trick. In the next section, we show how randomized auditing is not merely a safety backstop but the statistical lever that allows us to and track (μ, α, β) on-policy, including by step type s, so that we can detect when the mechanism is drifting toward the shared-blind-spot regime where selective oversight fails.

4. Identification and estimation: show how randomized audits identify (μ,α,β) and calibration by step type; confidence intervals; what requires numerical methods (heterogeneous costs/mixtures).

The previous section treated (μ, α, β) as primitives. In deployment, however, these are and can drift with the task mix, tool availability, and independence interventions. The oversight mechanism therefore needs to be not only safety-preserving but also : we must be able to estimate the challenger miss-rate and the base error rate . The key point is that auditing only when D = 1 creates a selection problem (we would never observe errors when the challenger is silent); the random audits on D = 0 are what repair identifiability.

For each task i, we always observe D_i. We observe an adjudicated error label H_i ∈ {0, 1} only if the item is audited. Under the baseline abstraction, H_i = E_i when audited. The audit inclusion probability is
π_i = Pr (audited ∣ D_i, R_i) = 1{D_i = 1} + 1{D_i = 0} ⋅ ρ,
since all disagreements are audited and non-disagreements are audited with probability ρ. The design condition R ⟂ (E, D) (conditional on the policy) implies that, within the D = 0 slice, the audited subset is an unbiased sample of the silent cases.

Define the joint cell probabilities p_de := Pr (D = d, E = e) for d, e ∈ {0, 1}. Then μ = p₀₁ + p₁₁, α = p₁₁/μ, and β = p₁₀/(1 − μ). Audits reveal the E label on a known-probability sample, so we can identify each p_de from audited frequencies with inverse-probability weighting. Concretely, because all D = 1 items are audited, we have direct identification for the disagreement slice:
p₁₁ = Pr (D = 1, E = 1) = Pr (D = 1, H = 1), p₁₀ = Pr (D = 1, E = 0) = Pr (D = 1, H = 0).
For the silent slice D = 0, only a ρ fraction is labeled, but randomization gives
$$ p_{01}=\Pr(D=0,E=1)=\frac{1}{\rho}\Pr(D=0,R=1,H=1),\qquad p_{00}=\Pr(D=0,E=0)=\frac{1}{\rho}\Pr(D=0,R=1,H=0). $$
These equalities are the core identification claim: without R, the term p₀₁ is unobserved and α is not learnable in deployment.

Let N be the number of tasks. A direct set of unbiased estimators is
$$ \widehat{p}_{11}=\frac{1}{N}\sum_{i=1}^N \mathbf{1}\{D_i=1,H_i=1\},\quad \widehat{p}_{10}=\frac{1}{N}\sum_{i=1}^N \mathbf{1}\{D_i=1,H_i=0\}, $$
$$ \widehat{p}_{01}=\frac{1}{N}\sum_{i=1}^N \frac{\mathbf{1}\{D_i=0,R_i=1,H_i=1\}}{\rho},\quad \widehat{p}_{00}=\frac{1}{N}\sum_{i=1}^N \frac{\mathbf{1}\{D_i=0,R_i=1,H_i=0\}}{\rho}. $$
We then plug in
$$ \widehat{\mu}=\widehat{p}_{01}+\widehat{p}_{11},\qquad \widehat{\alpha}=\frac{\widehat{p}_{11}}{\widehat{\mu}},\qquad \widehat{\beta}=\frac{\widehat{p}_{10}}{1-\widehat{\mu}}. $$
This estimator family makes the measurement–cost tradeoff explicit: as ρ decreases, the variance of p̂₀₁ grows like 1/ρ, and confidence intervals on α widen even if the expected audit minutes shrink. Operationally, ρ is therefore both a safety backstop and a statistical knob for monitoring drift.

Because (p̂_de) are sample means of bounded variables with known inclusion probabilities, standard large-sample theory yields asymptotic normality. One convenient route is to estimate Var(p̂₀₁) using the empirical variance of 1{D = 0, R = 1, H = 1}/ρ, and analogously for the other cells, then propagate uncertainty to (μ̂, α̂, β̂) via the delta method. In practice, we often prefer a nonparametric bootstrap over tasks (or a block bootstrap if tasks are temporally correlated) because the ratios defining α̂ and β̂ can be skewed when errors are rare. The output we want for governance is not just point estimates but time-indexed upper confidence bounds, e.g.,
$$ \Pr(E'=1)=\mu(1-\alpha)(1-\rho)\;\;\Rightarrow\;\; \widehat{\Pr(E'=1)}_{\mathrm{UCB}} $$
to support conservative auditing when uncertainty is high.

The preceding identification argument applies verbatim within strata. If each task (or audited slice) carries a step-type label s_i ∈ 𝒮 (e.g., , , , ), we can estimate (μ_s, α_s, β_s) using the same reweighting within each stratum s. This matters because the shared-blind-spot failure mode is typically concentrated: α may be high for arithmetic but low for, say, subtle tool misuse. When some strata have low α_s and high μ_s, the global average α can look adequate while residual risk is dominated by a narrow class of failures. Stratified estimates make this visible and allow targeted interventions (changing B’s tool access, adding specialized checkers, or increasing ρ only for high-risk s).

When |𝒮| is large or when we want to incorporate trace-locality features L_i, we can share statistical strength with a parametric model θ, e.g., logistic regressions for Pr (E = 1 ∣ s, L) and Pr (D = 1 ∣ E, s, L). The randomized-audit design remains crucial: it supplies labels in the D = 0 region needed to fit Pr (E = 1 ∣ D = 0, s, L) rather than extrapolating from disputed cases.

Once costs depend on s and L (so c_a = c_a(s, L) and c_r = c_r(s, L)), and once we allow a vector of audit rates ρ_s (or a continuous audit policy ρ(L, s)), optimizing the mechanism becomes a constrained program rather than a single scalar tradeoff. A typical objective is to minimize expected expert minutes subject to a global residual-risk constraint,
min_{{ρ_s ∈ [0, 1]}}∑_sπ_s(c_a, sPr (D = 1 ∣ s) + c_r, sρ_sPr (D = 0 ∣ s)) s.t. ∑_sπ_s μ_s(1 − α_s)(1 − ρ_s) ≤ δ,
where π_s = Pr (s). Even with estimated parameters, solving for optimal ρ_s generally requires numerical methods (and careful handling of uncertainty, e.g., robust optimization using upper confidence bounds on μ_s(1 − α_s)). This is exactly the regime where an experimental pipeline matters: we need clean randomization, reliable time measurements, and logged features to fit and validate the model.

In the next section we describe how to implement these randomizations and measurements in an end-to-end A/B/V pipeline, including independence interventions and auditing protocols that make the estimators above credible in practice.

5. Experimental design: implementation of A/B/V pipeline; independence interventions (context stripping, multi-vendor pairing, temperature/seed separation); audit sampling (triggered + random); human panel protocol and time measurement.

Our empirical goal is to estimate the primitives (μ, α, β), measure audit-time costs (c_a, c_r), and stress-test the mechanism under realistic distribution shift, while maintaining a clean separation between (i) the deployed oversight policy and (ii) the randomized interventions needed for identification. We therefore implement an end-to-end A/B/V pipeline as an : each task instance x_i is processed by a primary agent A, challenged by B under a randomized independence treatment, and then audited by a verifier V according to the triggered+random rule.

Each run produces an immutable record y_i containing (a) the final user-facing answer, (b) a structured tool trace (API calls, retrieved documents, code execution logs, intermediate outputs), and (c) metadata required for audit and estimation (timestamps, model IDs, prompts, sampling temperatures, and the random seed used for R_i). We explicitly distinguish the shown to users from the shown to humans: the latter is constructed by V and may include only a localized slice of the trace around a challenged step. This separation is operationally important: it prevents incidental disclosure of unnecessary sensitive context to auditors and makes the locality variable L_i measurable (as rendered context length, not just conceptual dependency).

Randomization is per-task (or per-episode for multi-turn tasks) with stratification by coarse task type and difficulty so that each arm sees a comparable mixture of inputs. The baseline arm corresponds to the highest-information challenger: B sees (x_i, y_i) with full tool trace. Independence arms restrict information and/or diversify failure modes. Concretely, we use a factorial design over three knobs:

The key design principle is that the independence intervention should plausibly change α and β without changing the task distribution. We therefore avoid interventions that alter tool access or external information unless those changes are themselves an experimental factor, in which case we treat them as part of the task definition and stratify accordingly.

The challenger returns D_i ∈ {0, 1} and, when D_i = 1, a : a structured reference to the allegedly flawed step (e.g., a tool call ID, a cited passage, a code cell, a specific claim in the final answer) along with a short natural-language allegation and the minimal evidence it believes is missing or contradictory. We require pinpointing because it operationalizes the ``triggered audit of a disputed step’’ abstraction: auditors can focus on a small portion of the trace, and we can measure whether locality actually holds by logging the rendered context length needed for adjudication. When B cannot pinpoint (e.g., it suspects an error but cannot localize it), we treat this as a distinct challenge type and either (i) escalate to a larger-context audit (counted as higher c_a) or (ii) record it as an abstention and force D_i = 1 to preserve safety, depending on deployment criticality.

The verifier implements the policy ``audit all disagreements; otherwise audit with probability ρ.’’ Practically, V draws R_i ∼ Bernoulli(ρ) using a cryptographically committed seed stored in the log to make inclusion probabilities auditable post hoc. We allow ρ to be (i) global, (ii) stratified by task type s, or (iii) a simple function of pre-registered risk features (e.g., high-stakes domains, unusually long tool traces, or low-confidence signals). Importantly, ρ is chosen for each stratum and is not allowed to depend on un-audited content of y_i, preserving the conditional independence needed for unbiased estimation within the silent slice.

Audits are performed by expert panels with a standardized rubric. Each audited item is independently reviewed by two auditors who produce (a) a binary correctness label and (b) a short error taxonomy tag (e.g., wrong citation, tool misuse, numerical error, policy noncompliance). Disagreements are resolved by a third adjudicator or a synchronous reconciliation step, yielding the final H_i. We constrain the information shown to humans to the pinpointed step plus enough surrounding context to decide, explicitly measuring what is revealed. For example, a citation-check audit presents the claim, the cited excerpt, and the retrieval metadata; a code audit presents the code cell, test results, and the relevant specification. This protocol both reduces time and makes audits reproducible: the same rendering template is used across arms so that differences in measured c_a or c_r reflect challenger localization quality and task properties rather than UI variance.

We measure audit time as rather than wall-clock time, using a combination of interface events (scroll, click, text entry) and explicit start/stop markers. This matters because auditors may be interrupted, and because the policy comparisons are sensitive to minutes-level differences. We record both total time and a decomposition (reading, tool inspection, writing the decision), enabling a direct estimate of c_a for triggered audits and c_r for random audits. To guard against distributional artifacts (e.g., auditors speeding up on easy cases), we report robust summaries (trimmed means, medians) and fit simple models of time as a function of L_i and s_i for downstream optimization.

Because the mechanism is intended for deployment, we include safety backstops during experimentation: high-stakes tasks can be forced into full audit regardless of D, and any arm that exhibits abnormally high dispute rates (suggesting β inflation) can be rate-limited to avoid overwhelming auditors. Separately, we run targeted ``red team’’ batches designed to induce shared blind spots (e.g., adversarially phrased tool instructions or subtle contract clause traps) to empirically test whether independence treatments actually increase α where it matters. The output of this section is therefore not merely an engineering pipeline but a controlled measurement instrument: it produces, for each arm, the joint distribution over (D, R, H), localized context lengths L, and time costs, which we then use to map the practical cost–risk frontier and to anticipate where the mechanism collapses to near-full auditing.

6. Benchmarks and enterprise workflows: task suites (search+summarize, code+test, contract drafting+clause checks, compliance checks); ground-truth and adjudication rules; logging and trace schema.

To make the cost–risk frontier operational, we evaluate the A/B/V mechanism on benchmark suites that mirror common enterprise deployments of tool-using assistants. Our guiding constraint is that the latent error indicator E_i must be well-defined and auditable under a standardized rubric, while still capturing the kinds of multi-step failures that arise in realistic workflows (retrieval mistakes, tool misuse, brittle code, and policy violations). We therefore build task families around that organizations already treat as reviewable units: a research brief with citations, a code change with tests, or a contract draft with required clauses.

We use four core suites, each intentionally multi-tool and multi-step:

Within each suite, we sample tasks from (i) public benchmarks where ground truth is straightforward (e.g., programming problems with tests), and (ii) de-identified enterprise-like templates that reflect internal ticket structures (issue description, acceptance criteria, and constraints). We stratify by difficulty proxies (expected number of tool calls, document length, and ambiguity) so that treatment arms can be compared on matched mixtures rather than cherry-picked exemplars.

A central modeling choice is to ensure that E_i corresponds to a failure mode. For each suite, we pre-register an evaluation standard that induces a binary label while preserving severity metadata. Concretely, we define E_i = 1 if any defect is present, and we separately log a severity score to support sensitivity analyses (e.g., weighting residual errors by harm).

This binary reduction is imperfect—many real failures are graded—but it matches how enterprise review gates work in practice: the artifact either passes review and can be shipped, or it requires rework. The auxiliary severity tags let us test robustness of our conclusions to alternative loss functions.

We deliberately mix ground truth (where possible) with ground truth (where necessary), because the pipeline must handle both regimes.
For code tasks, the primary adjudicator is the sandbox: we execute the submitted code and test suite, logging pass/fail, coverage, and runtime. Humans then audit only borderline cases (e.g., performance constraints, security issues, or underspecified behavior) using a structured checklist; in these cases the ``ground truth’’ is the specification plus adjudicator judgment rather than a single canonical output.
For search and compliance tasks, we create answer keys by curating a reference document set and writing a minimal set of supported claims, each linked to spans in the documents. Auditors mark each model claim as supported/unsupported and record the minimal evidence required to decide; this both produces H_i and measures locality L_i in a way aligned with our cost model. For contract tasks, we treat the playbook as the reference standard and use a clause-level checklist: auditors mark each required clause as present and compliant, record any prohibited phrases, and run consistency checks across defined terms.

To reduce subjectivity and audit variance, we fix adjudication templates per suite. Each template forces (i) a binary pass/fail decision for E_i, (ii) a taxonomy tag (e.g., wrong citation'',tool misuse’‘, missing clause'',unsafe recommendation’’), and (iii) an evidence pointer (document span, test output, or clause reference). These fields are essential downstream: they let us estimate not only (μ, α, β) but also how detection power varies by step type s and locality L.

We structure tasks to resemble how organizations would actually route work. Search briefs are formatted as pages with a claims table and citations; code tasks are framed as pull requests with a diff and a CI transcript; contracts are represented as redlines against a template; compliance tasks emit a structured decision object (allowed/disallowed/escalate) plus rationale and citations to policy clauses. This matters for two reasons. First, it ensures that audits can be localized around natural review objects (a specific citation, a failing test, a clause). Second, it prevents ``benchmark overfitting’’ where models perform well only on contrived formats.

Finally, we standardize a trace schema that makes disagreements, audits, and post-hoc estimation feasible without bespoke parsing. Each run emits an ordered event log with stable identifiers:

The schema is designed so that V can deterministically render an audit packet from log-referenced objects, and so that later analyses can compute inclusion probabilities and reweighting factors without ambiguity. In effect, the benchmark suite is not only a set of prompts; it is a set of with a shared measurement substrate, allowing us to compare time, error, and disagreement rates across heterogeneous enterprise tasks under a single audit policy.

7. Main results: time savings vs full-trace review; residual error vs target δ; estimation of (μ,α,β); decomposition by tool type and step type; ablations showing effect of independence.

Our central empirical object is the induced by the audit rule ``always audit disagreements; randomly audit non-disagreements with probability ρ.’’ For each suite, we measure (i) the realized expert time per task (via rendered audit packet length and adjudication timestamps) and (ii) the residual post-mechanism error rate, i.e.,
Pr (E^′ = 1) = Pr (E = 1, D = 0, R = 0) = μ(1 − α)(1 − ρ),
under the baseline approximation that audited items are corrected and that audits recover ground truth. The qualitative pattern is stable across suites: once the challenger has nontrivial detection power α and audits are (small c_a relative to c_full), the marginal expert minutes scale with the disagreement rate rather than with task complexity. In practical terms, we obtain large savings whenever the system spends most of its human budget on a small fraction of tasks and, within those tasks, on a small fraction of the trace.

We report savings in the dimension that dominates enterprise cost in our setting: expert minutes. Let T denote expert minutes per task under the mechanism and c_full minutes per task under full-trace review. Using our logging schema, we compute T̂ directly from the audit packets shown to humans (triggered packets for D = 1 and random packets for D = 0, R = 1). Consistent with the closed-form model, the primary driver of 𝔼[T] is Pr (D = 1) = αμ + β(1 − μ), with a second-order contribution from ρ through the random audits of non-disagreements.

Empirically, when the dispute packets are truly local (small L_i, so that c_a is an order of magnitude below c_full), we observe expert-minute ratios 𝔼[T]/c_full in the 0.05–0.2 regime, corresponding to roughly 5–20× less expert time than full review at comparable residual error targets. The savings degrade smoothly when locality worsens: if the minimal context required to adjudicate a challenge expands to encompass most of the trace, then c_a approaches c_full and the mechanism collapses back toward full review. This dependence on locality is not merely a measurement artifact; it is a design constraint. A dispute signal is only useful if it can be converted into a small, reviewable packet.

Because our model yields a closed-form ρ^*(δ),
$$ \rho^*(\delta)=\max\Bigl\{0,\,1-\frac{\delta}{\mu(1-\alpha)}\Bigr\}\qquad(\alpha<1), $$
we can treat δ as a policy knob and evaluate the resulting human workload. Operationally, we estimate (μ, α) on a training slice, set ρ = ρ^*(δ) (capped to [0, 1]), and then run the mechanism on a held-out slice. The predicted residual error μ̂(1 − α̂)(1 − ρ) tracks the observed residual error within statistical uncertainty, provided that (i) audits are close to perfectly corrective on audited items and (ii) the inclusion mechanism for random audits is implemented exactly as randomized.

Two practical takeaways follow. First, for moderate α̂, the ρ required to reach stringent δ can become large quickly, directly revealing when debate-triggered audits are insufficient on their own. Second, when α̂ is high, ρ becomes small and the residual error guarantee is achieved mostly through targeted triggered audits; in that regime, tightening δ has only a mild effect on expert minutes until ρ becomes the dominant term. In other words, the ``cheap’’ part of the frontier is buying safety through better detection power, not through more random sampling.

A key methodological contribution is that we do not assume access to ground-truth E for all tasks. Instead, we exploit the fact that disagreements are always audited, while non-disagreements are audited at a known random rate ρ. This creates a standard missing-at-random structure: for D = 0, the error label is observed only when R = 1, with known inclusion probability ρ. We therefore estimate
$$ \widehat{\mu}=\frac{1}{N}\left(\sum_{i:D_i=1}\widehat{E}_i \;+\; \sum_{i:D_i=0,R_i=1}\frac{\widehat{E}_i}{\rho}\right), $$
with analogous Horvitz–Thompson-style estimators for Pr (E = 1, D = 1) and Pr (D = 1), and then recover
$$ \widehat{\alpha}=\frac{\widehat{\Pr(E=1,D=1)}}{\widehat{\mu}},\qquad \widehat{\beta}=\frac{\widehat{\Pr(E=0,D=1)}}{1-\widehat{\mu}}. $$
We report uncertainty via standard survey-sampling asymptotics with robust variance estimates, and we validate calibration by checking that predicted disagreement rates α̂μ̂ + β̂(1 − μ̂) match the realized disagreement frequencies in held-out runs. This identification step is crucial for governance: without random audits of non-disagreements, α is not learnable because we cannot estimate the miss-rate of the challenger on cases it accepts.

Aggregate (μ, α, β) masks meaningful heterogeneity. Using our trace schema, we assign each challenge and each adjudicated error to a step category s (e.g., retrieval, citation binding, code execution, test interpretation, clause inclusion, policy mapping). We then estimate α_s, β_s, and μ_s by conditioning on s and reweighting within stratum. Two regularities appear. First, α_s is typically highest on steps with crisp, mechanizable evidence (e.g., failing tests, missing required clauses) and lowest on steps with diffuse evidentiary requirements (e.g., ``summary is misleading’’ without a single contradicting span). Second, audit cost is strongly mediated by locality: for some s, a pinpointed allegation truly reduces context (small L_i), while for others the adjudicator must inspect substantial upstream context, increasing c_a and eroding savings. These decompositions matter because the overall frontier is a mixture: low α_s concentrated on high-impact s can dominate residual risk even when aggregate α looks healthy.

Finally, we test whether ``independence’’ interventions actually change the operational parameters that drive the frontier. We run controlled ablations where B is (i) given the full trace (maximal information sharing), (ii) given a context-stripped view (e.g., only the final artifact and minimal tool results), and (iii) swapped to a different model family or provider. Across suites, the consistent effect of stronger independence is an upward shift in α and, in many cases, a reduction in β: challengers that do not share the same intermediate reasoning and do not inherit the same tool-use path are more likely to surface distinct failure modes and less likely to mirror A’s mistakes. The practical implication is immediate through ρ^*(δ): a modest increase in α can sharply reduce the random-audit load required to meet a fixed δ, translating into large reductions in expert minutes.

At the same time, the ablations expose where the model can fail: there are step types for which independence yields little improvement in α_s, suggesting shared blind spots that cannot be diversified away by superficial context manipulation. This motivates the next section, where we focus on predictability of such failure modes, how correlated misses manifest as disagreement collapse, and how locality and step features can be used to forecast audit load and residual risk.

8. Failure modes and predictability: shared blind spots; correlated misses; disagreement collapse; link to trace locality/context length; predictive models for audit load and residual risk.

The cost–risk frontier above rests on two deployment-facing premises that are easy to state but nontrivial to maintain in practice: (i) the challenger signal D must be about latent error E (high α at acceptable β), and (ii) an audit triggered by D = 1 must be , meaning a human can adjudicate the dispute from a compact packet rather than reconstructing the whole trace. This section enumerates the ways these premises fail, why those failures are often , and how they manifest operationally as either (a) disagreement collapse (too few challenges to control risk without near-full random auditing) or (b) locality collapse (challenges exist but are too diffuse to save time).

The most dangerous regime is not one where B is weak'' in an absolute sense, but one where $B$ is \emph{aligned with $A$'s failure distribution}. Concretely, the mechanism is brittle to correlation in the miss event $\{E=1,D=0\}$. Shared blind spots arise when $A$ and $B$ (a) share the same model family and training artifacts, (b) view the same intermediate reasoning or tool transcripts, and (c) are exposed to the same affordances for rationalizing a mistake. In that regime, errors are not merely unchallenged; they are \emph{jointly unchallenged for the same reasons}. Empirically we see this as systematicacceptance’’ of a wrong-but-plausible tool interpretation, a misbound citation, or a spec violation that both agents implicitly normalize. The high-level symptom is that α drops precisely on the slices where μ is elevated, which is the worst possible interaction for the frontier.

A low realized disagreement rate Pr (D = 1) is often celebrated as ``high consensus,’’ but it is compatible with two very different worlds: μ is genuinely low (the system is correct), or α is low (the challenger is silent even when A is wrong). Without auditing some fraction of D = 0 cases, these worlds are observationally confounded. In production, disagreement collapse therefore has a crisp operational signature: the error rate in the random-audited D = 0 slice is non-negligible. Using inclusion-corrected estimates, we can track $\widehat{\Pr(E=1\mid D=0)}$; if this rises while Pr (D = 1) falls (or remains flat), we are in the correlated-miss regime. This is exactly the setting where governance should treat a drop in Pr (D = 1) as a rather than a quality improvement.

Even when aggregate α looks healthy, the mechanism can fail because errors and detection power are heterogeneous across step categories s. Formally, with mixture weights π_s, base error μ_s, and detection α_s, aggregate quantities average in a way that can obscure the tail: μ = ∑_sπ_sμ_s while the effective detection against errors is ∑_sπ_sμ_sα_s/μ. If the high-impact or high-frequency categories have low α_s, then the mechanism must compensate by pushing ρ toward one for those categories, eliminating time savings where they matter most. We repeatedly find that the troublesome s are those where the adjudication criterion is underspecified (``misleading’’), evidence is distributed across many spans (global coherence), or correctness depends on an implicit contract with the user. These categories also tend to be those where B’s critique is more rhetorical than checkable, which couples heterogeneity in α_s to heterogeneity in locality.

A second predictable failure mode occurs when B does flag issues but cannot localize them. In our notation, the audit cost is mediated by a locality feature L_i (minimal context needed to judge the flagged claim). When L_i is large—e.g., the challenge is the reasoning is inconsistent'' without a pinpointed contradiction, orthe summary omits important details’’ without a reference set—the adjudicator must retrieve large upstream context, increasing c_a toward c_full. In practice, this can happen even with high α: the challenger ``knows something is wrong’’ but cannot compress the evidence. Locality collapse is therefore visible in logs as an increase in rendered packet length (or adjudication time) conditional on D = 1, and it is exacerbated by tool-use patterns that create long dependency chains (multi-hop retrieval, iterative code edits) where correctness is globally entangled. A design implication is that we must treat as a product requirement: challengers should be trained or prompted to output a minimal claim–evidence pair (span IDs, tool output hashes, failing tests), and the system should reject or downweight challenges that cannot be grounded in a small, auditable witness.

Independence across models does not automatically imply independence across . If both agents call the same external tool (search, database, compiler, sandbox) and the tool is wrong, stale, or adversarially manipulated, then both A and B may converge on the same incorrect conclusion with high confidence. This is a distinct failure mode from shared cognitive blind spots: even two diverse models can be coupled through a common upstream dependency. The practical consequence is that improving α via multi-vendor debate may not reduce residual risk for tool-mediated errors unless we also introduce redundancy at the tool layer (e.g., independent retrieval endpoints, cross-check queries, deterministic re-execution, or provenance constraints).

Because these failure modes have observable correlates, we can move from post hoc explanation to . Let z_i denote features extracted from (x_i, y_i), including step type s_i, locality proxies (packet length, citation density), tool metadata, and simple uncertainty measures. We can fit a parametric or semi-parametric model with parameters θ for Pr_θ(D = 1 ∣ z) and Pr_θ(E = 1 ∣ z, D). This enables two operational forecasts: expected audit minutes 𝔼[T ∣ z] (via predicted disagreement and predicted packet cost) and expected residual error Pr (E^′ = 1 ∣ z) under a candidate audit policy. Importantly, if we move beyond a constant ρ to a random-audit rate ρ(z), we can still preserve identifiability and unbiased estimation so long as inclusion probabilities are logged and depend only on observed z (a standard missing-at-random condition). In other words, we can deliberately oversample ``quiet’’ but suspicious regions of z-space (where we predict correlated misses) without giving up the ability to estimate (μ, α, β) on-policy.

The key governance lesson is that the debate-triggered mechanism does not fail silently if we monitor the right sufficient statistics. Disagreement collapse is detectable as divergence between (i) the observed disagreement rate and (ii) the error rate measured in randomly audited non-disagreements, stratified by s and locality proxies. Locality collapse is detectable as drift in the distribution of L_i (or realized audit minutes) conditional on D = 1. These are not merely diagnostics; they are levers. If we see collapse concentrated in particular s or tool regimes, we can predict (and pre-empt) the resulting audit load required to hold a fixed residual-risk target. This predictability is what makes the next section possible: setting ρ and procurement constraints as functions of measured (α, β) and locality, rather than as static policy commitments detached from operational reality.

9. Policy and procurement implications: how to set audit rate ρ, when to require multi-vendor debaters, how to budget expert time, and which metrics to monitor in production (estimated α, β, disagreement rate, locality).

The closed-form expressions above are not merely descriptive; they let us turn governance desiderata (a residual-risk target, a human-time budget, and a tolerance for operational volatility) into explicit policy knobs. In deployment we typically face two coupled choices: (i) how aggressively to random-audit apparent non-disagreements, and (ii) how much to pay—in model diversity, context restrictions, and tooling redundancy—to keep the challenger signal informative and the audit packet local. We can treat the mechanism as a controllable system with measurable sufficient statistics (μ̂, α̂, β̂) and realized cost drivers (ĉ_a, ĉ_r), and we can update ρ and procurement constraints as those statistics drift.

If the organization specifies a tolerable residual error rate δ (possibly varying by product surface or harm category), the minimal random-audit rate consistent with the model is
$$ \rho^*(\delta) \;=\; \max\Bigl\{0,\; 1-\frac{\delta}{\mu(1-\alpha)}\Bigr\} \qquad (\alpha<1), $$
with the operational caveat that we must plug in estimates and buffer for uncertainty. In practice we therefore implement a version:
$$ \rho \;\ge\; \max\Bigl\{0,\; 1-\frac{\delta}{\mu^{\mathrm{UCB}}\bigl(1-\alpha^{\mathrm{LCB}}\bigr)}\Bigr\}, $$
where μ^UCB is an upper confidence bound on base error and α^LCB is a lower confidence bound on detection power, computed from audited data with inclusion probabilities recorded. This is a governance-friendly translation: safety targets become a concrete sampling obligation, and conservative uncertainty handling is explicit rather than implicit.

When tasks are heterogeneous, a single global ρ is usually inefficient. The natural extension is a feature-adaptive audit rule ρ(z) (e.g., by step type, tool regime, or product tier), chosen to satisfy a constraint on expected harm. Concretely, for strata k with weights w_k and targets δ_k, we can impose Pr (E^′ = 1 ∣ k) ≤ δ_k and set ρ_k ≈ ρ_k^*(δ_k) using stratum-specific (μ_k, α_k). This is both cheaper and easier to justify externally: we can say which categories are audited more and why, and we can demonstrate that inclusion probabilities are policy-determined and logged.

Because ρ^*(δ) scales with 1/(1 − α), improving α is often the dominant lever for reducing human time at fixed safety. Procurement therefore matters: if we can buy an independence intervention that lifts α materially without exploding β, the reduction in required ρ can be large. This motivates an explicit decision rule: require a (different vendor/model family for B, or enforced context stripping) whenever the estimated α in any high-impact stratum falls below a threshold $\underline{\alpha}$ that would otherwise force ρ near one. Put differently, we should not treat multi-vendor debate as a research luxury; it is a cost-control instrument for expert minutes.

What we should contract for is not vague model quality'' but measurable audit-facing properties: (i) \emph{challenge quality}: a challenger output format that must include a minimal claim--evidence witness (tool output hash, span IDs, failing test, or a precise contradiction), enabling low $c_a$; (ii) \emph{rate guarantees}: empirical bounds on $\beta$ (since false flags directly consume expert capacity) and minimum acceptable $\alpha$ on designated benchmarks that mirror production step types; (iii) \emph{interface constraints}: commitments about what intermediate artifacts $B$ can see (context stripping) and whether tool calls are shared or independent. We can encode these as service-level objectives backed by periodic evaluation with randomized audits, avoiding the common failure where two models are nominallydifferent’’ but operationally coupled through shared evidence streams.

The expected expert minutes per task are
𝔼[T] = c_aPr (D = 1) + c_rρPr (D = 0), with Pr (D = 1) = αμ + β(1 − μ).
This gives an immediate budgeting worksheet: for a forecasted task volume N, wages w, and chosen ρ, we can compute 𝔼[wT]N and compare it to a full-review baseline wc_fullN. However, operational planning requires more than a mean. Audit queues have tail risk: bursts of disagreements, localized incidents in a step type, or tool outages that inflate μ can temporarily increase both Pr (D = 1) and the packet length that drives c_a. A robust budget therefore includes (i) a reserve capacity for the upper quantiles of T, (ii) a policy for graceful degradation (e.g., temporarily raising ρ and/or throttling task throughput when $\widehat{\Pr(E=1\mid D=0)}$ spikes), and (iii) a clear fallback to full review for narrow high-stakes categories when uncertainty bounds are wide.

This also implies a staffing implication that is easy to miss: debate-triggered auditing changes the of expert work. Instead of long end-to-end reviews, we need rapid, repeated adjudications of compact packets, plus occasional escalation when locality breaks. Hiring, tooling, and training should therefore optimize for high-throughput, high-consistency micro-adjudication, with escalation protocols that deliberately pay the full-review cost only when triggered by locality or uncertainty alarms.

The mechanism is only as safe as our ability to detect drift in its sufficient statistics. At minimum, production monitoring should include:

We emphasize that these should be stratified not only by s but also by tool regime and product surface, because procurement and policy decisions are usually made at those boundaries.

Finally, monitoring should be tied to with human oversight. For example, if α^LCB drops below a contracted threshold in a high-impact stratum, we can (i) increase ρ for that stratum immediately, (ii) route tasks to a more independent challenger configuration, and (iii) trigger an incident review focused on the tool or evidence pipeline. The point is not to eliminate judgment, but to ensure that the system cannot silently slide into a regime where safety requires near-full auditing while we continue operating under a low-ρ assumption.

These implications together suggest a governance stance: ρ is not a static compliance parameter but a dynamically set control variable; multi-vendor and context restrictions are not branding choices but levers to increase α and preserve locality; and the core production metrics are those that let us continuously re-derive the cost–risk frontier from logs and enforce a residual-risk target with auditable evidence.

10. Limitations and next steps: imperfect humans, correlated panels, strategic models, and extensions toward robust debate; open questions aligned with the source’s flagged gaps.

Our analysis deliberately idealized the auditing layer: when a task (or pinpointed step) is audited, we treated human adjudication as essentially correct. This approximation is convenient for deriving a clean cost–risk frontier, but it is not literally true in most deployments. Experts are noisy, they disagree, and they are subject to fatigue and anchoring. A first extension is therefore to introduce an . For example, let η denote the probability that an audited error is mistakenly cleared (a human false negative), and let ζ denote the probability that a correct output is mistakenly flagged (a human false positive). Then audited items no longer deterministically reset residual error to zero, and the post-mechanism error becomes a mixture of (i) unaudited misses and (ii) audited-but-misjudged cases. Even in the simplest variant where only false negatives matter for safety, residual error takes the schematic form
Pr (E^″ = 1) = μ(1 − α)(1 − ρ) + μαη + μ(1 − α)ρη,
highlighting a qualitative shift: once η > 0, increasing ρ has diminishing returns because audited coverage is no longer equivalent to correction. This motivates treating expert minutes'' andexpert accuracy’’ as jointly budgeted resources, and it makes procurement and training of auditors a first-class part of the safety mechanism rather than an afterthought.

A closely related limitation is that panels are not independent coin flips. Many organizations mitigate human noise by majority vote or escalation, implicitly assuming that errors decorrelate across adjudicators. In practice, correlation can be substantial: panelists share a rubric, see the same anchored packet, and are exposed to the same organizational incentives. If adjudicator errors are positively correlated, the effective reduction in η from adding more reviewers can be far smaller than what a binomial model would predict, and the system can fail in precisely those hard cases we most want to catch. An actionable next step is to measure directly by duplicating a subset of audit packets across independent teams with minimal shared context, and to treat the resulting correlation estimates as governance metrics. Where correlation is high, we should expect ``more reviewers’’ to be a weak intervention and instead invest in making evidence more machine-checkable (hash-linked tool outputs, executable tests, constrained proofs) so that adjudication relies less on subjective judgment.

A third limitation is that our baseline model treats (α, β) as intrinsic properties of the challenger B given a task distribution, while real challengers are in at least two senses. First, B may be trained (or prompted) to optimize a proxy objective that implicitly penalizes challenges, because challenges trigger audits, slow throughput, or create internal incidents.'' Second, if $A$ and $B$ share training data, system prompts, or tool traces, they can develop shared blind spots that are not well summarized by a stationary $\alpha$. In a strategic setting, $D$ is best viewed as an action chosen under incentives, and $\alpha$ becomes an equilibrium outcome. This points toward mechanism-design versions of debate-triggered auditing: we can make $B$'s incentives closer totruthful dispute reporting’’ by (i) imposing explicit penalties for discovered misses on randomly audited D = 0 cases, (ii) rewarding high-quality challenges that lead to confirmed corrections, and (iii) preventing easy gaming by randomizing parts of the evaluation and keeping some audit outcomes delayed or partially hidden from the models.

Relatedly, our independence interventions (context stripping, multi-vendor B) were discussed as if they only affect (α, β). In practice, they also affect the . If B has access to A’s intermediate tool calls and chain-of-thought, it can either inherit errors (reducing α) or learn to rationalize them (increasing a particularly dangerous form of false acceptance). Conversely, over-stripping context can reduce α by starving the challenger of the minimal evidence needed to find flaws. A concrete research direction is to optimize is shared: share verifiable artifacts (tool output hashes, structured state diffs, tests) while withholding non-verifiable internal rationales. This is also where trace locality L matters: if we can standardize the challenger to always emit a compact ``witness’’ that is itself checkable, then both α and c_a can improve simultaneously rather than trading off.

Another assumption worth flagging is the independence of random audits R from (E, D) conditional on design. Operationally, this can fail via human-in-the-loop triage (auditors preferentially pick ``interesting’’ cases), via engineering constraints (some logs are missing), or via product incidents (audit sampling changes precisely when risk increases). These failures break the identification arguments underlying (μ̂, α̂, β̂) and can create a false sense of security. The immediate engineering next step is therefore boring but critical: immutable inclusion-probability logging, audit assignment that is provably randomized (and replayable), and explicit missingness accounting. Without these, the mechanism becomes difficult to audit externally, and the organization can drift into unmeasured correlated-miss regimes.

We also simplified the task outcome to a binary error indicator E, whereas many deployments care about severity, downstream propagation, and time-delayed harms. Two outputs can both be ``incorrect’’ but differ by orders of magnitude in expected harm. Extending the model to a graded loss L(E) (or a severity label) naturally leads to risk-weighted auditing, where ρ and even the challenger configuration depend on an estimated harm proxy. This extension is not merely cosmetic: it changes what it means to meet a target δ, replacing a uniform error-rate bound with a constraint on expected harm, tail risk, or specific harm categories. It also interfaces directly with governance requirements, which are often phrased in terms of unacceptable failure modes rather than average error.

Finally, debate-triggered auditing is only one point in a larger design space of . Our mechanism is essentially a one-shot challenge with a localized human check. Robust debate suggests multi-round interaction, counterfactual tool re-execution, and structured argumentation where each claim must be grounded in a verifiable artifact. The open technical question is how to preserve the key advantage—auditing compact packets rather than full traces—while increasing robustness against strategic obfuscation and shared blind spots. Promising directions include: multi-challenger ensembles with diversity constraints; adversarially trained challengers optimized for α on high-impact strata; and verifier-side automatic checks that expand the packet only when locality fails. At a systems level, we expect the right abstraction to be a with drifting sufficient statistics, where ρ, model routing, and packetization are updated online under explicit uncertainty bounds. The core safety implication is that the mechanism must be designed to : when independence erodes or locality breaks, the system should automatically ratchet toward higher auditing and slower throughput rather than silently maintaining a low-ρ posture that no longer supports the claimed residual-risk target.