Influence Auctions for Agentic LLM Policies: Tool Calls, Citations, and Universal Stable Sampling

Metadata

Total Words: 21,210
Export Date: 2026-01-15 23:51:41
Description: We study auctions that sell probabilistic influence over agentic LLM behavior—tool calls, retrieval decisions, citation choices, and UI actions—rather than only over next-word tokens. Building on token auctions for LLMs, we model each advertiser as inducing a preferred action distribution p_i(s) over a finite action set A at each state s (dialogue prefix), while the platform chooses an aggregated distribution q(b,p) as a function of scalar bids b and reported distributions p. We generalize robust (partial-order) preferences from token distributions to action distributions and define policy monotonicity: increasing an advertiser’s bid must move q toward its preferred distribution in the robust order. Our first result extends stable sampling and second-price-style critical-bid payments to the policy setting, yielding an implementation-independent expected payment identity (a Myerson-style envelope with total-variation improvement). Our second contribution is an impossibility frontier for audit-consistent explanations: a single universal sampling implementation that is simultaneously stable for all advertisers generally does not exist (echoing and strengthening counterexamples in prior token-auction work). We then provide a clean necessary-and-sufficient condition: universal stability is achievable exactly when the aggregator is single-index in bids (path-independent), i.e., q depends on bids only through a monotone potential (b). We conclude with constructive designs for sponsored citations and sponsored tool calls and a proof-of-concept simulation on an agentic LLM stack where counterfactual explanations (“your bid flipped this tool call/citation”) are consistent across advertisers.

1. Introduction: sponsored actions in agentic assistants (citations/tools/UI), why token auctions are insufficient, and why audit-consistent counterfactuals matter in 2026
2. Related work: token auctions for LLM aggregation, sponsored summaries/QA, public projects, robust mechanism design, proper scoring rules (discussion only)
3. Model: agentic action set A, state s, advertiser-induced distributions p_i(s), bid-only indirect mechanism, robust preference order over (A), and stage-separable utilities
4. Monotone policy aggregation: definitions (robust monotonicity, consistent aggregation), and a coordinatewise characterization extending Lemma 3.10
5. Stable sampling and critical-bid pricing for actions: existence of stable implementations for any monotone q; second-price-style payment rule; envelope/payment identity; counterfactual interpretation for tool calls and citations
6. Universal stability and auditability: definition of universal stable sampling; impossibility examples in the policy setting; why multi-dimensional bid spaces create path dependence
7. Characterization theorem: universal stability single-index (potential) aggregators; constructive sampler _{} and proof sketch; discussion of what is lost (expressiveness) and what is gained (auditability)
8. Applications: (i) sponsored citations (choose among sources) (ii) sponsored tool calls (choose among tools/providers) (iii) sponsored UI actions; mapping each to the model primitives
9. Simulation plan: agentic LLM prototype (retrieval + tool router), implement monotone vs non-monotone aggregators, evaluate counterfactual consistency and outcome quality; note where numerical methods are required
10. Extensions and limitations: cross-step externalities, budgets/pacing, constraints (safety/factuality), strategic model manipulation; open questions
Appendices: proofs, additional counterexamples, and implementation details

Content

1. Introduction: sponsored actions in agentic assistants (citations/tools/UI), why token auctions are insufficient, and why audit-consistent counterfactuals matter in 2026

By 2026, the economically salient ``surface area’’ of general-purpose assistants is no longer limited to the next token they emit. Deployed systems routinely take : they retrieve documents, choose whether to cite sources, call external tools (travel booking, payments, code execution), open partner applications, render interface elements, and decide when to stop. These actions are often sequential and state-dependent: a retrieval choice changes what evidence is available; a tool call changes the user-visible outcome; a citation decision changes perceived credibility and downstream user behavior. In this environment, sponsorship and monetization pressures naturally migrate from raw text generation to —paid influence over which tools, sources, and UI affordances are exercised in a given context.

This shift exposes a mismatch with the dominant mental model inherited from web advertising and earlier language-model deployments: the token auction.'' Token-level pricing (whether framed as auctions over positions in generated text, or as payments per character/word) treats the assistant as a channel whose output is a string. But agentic assistants are closer to a \emph{policy} over heterogeneous actions, where text is only one component and often not the component advertisers care about. A restaurant may not value being mentioned in prose if the assistant never opens the booking widget; a publisher may prefer being cited (with a link) rather than paraphrased; a software vendor may want the assistant to call their API rather than an open-source alternative. In short, the object being purchased is nottokens’’ but on consequential actions, conditional on the current conversational state.

Token auctions are also technically brittle in agentic settings for three reasons. First, the mapping from tokens to economic outcomes is indirect and model-dependent: a tool invocation can compress into a single token (e.g., a structured call) yet dominate user value, while a long textual explanation may have little marginal impact. Second, token allocation ignores the platform’s need to maintain coherent tool-use policies: when external actions carry safety, privacy, latency, or regulatory constraints, the platform cannot treat each token as an independent impression without risking policy violations. Third, token-level mechanisms interact poorly with the randomness inherent in modern inference pipelines (sampling, speculative decoding, retrieval stochasticity): any attempt to charge per token while preserving meaningful counterfactual statements about ``what would have happened otherwise’’ becomes murky once a single token changes the subsequent state trajectory.

These issues motivate a different starting point. We model the platform at each state (e.g., a dialogue prefix together with retrieval and tool context) as choosing a . Advertisers, in turn, are not assumed to have quasilinear values for a single deterministic outcome; rather, each advertiser induces a over actions, reflecting how they would like the assistant to behave in that state if unconstrained. Sponsorship then becomes a problem of aggregating competing preferred distributions into a single distribution that the platform samples from, with payments that (i) induce sensible bidding incentives and (ii) support credible auditing.

Auditability is not an afterthought. In practice, sponsored-action systems face scrutiny from at least three directions. Regulators increasingly treat AI-driven recommendations and UI interventions as subject to disclosure and fairness requirements, especially when commercial influence is present. Business customers demand invoicing and attribution that can be reconciled with logs: did a tool call occur, and bid would have been just enough for it to occur? And platforms themselves need internal governance to prevent hidden degrees of freedom—particularly randomness and stateful decision-making—from becoming a channel for favoritism, retaliation, or inadvertent bias. A credible mechanism must therefore support : given the realized trajectory and a fixed randomness seed, one should be able to answer counterfactual questions of the form ``if advertiser i had bid less (holding others fixed), would the same action have been taken?’’ in a way that is consistent across advertisers and across different ways of varying bids.

The naive approach—re-simulate the assistant under a different bid profile using fresh randomness—does not provide such consistency. If the counterfactual uses independent randomness, then deviations in outcomes can be attributed either to the bid change or to the random draw, undermining both incentive arguments and audit narratives. Moreover, when multiple advertisers are present, independent resampling creates pathologies: one can obtain incompatible explanations depending on whether we vary advertiser i’s bid first or advertiser j’s bid first. In an agentic system where action choices shape subsequent state, these inconsistencies are amplified: a small difference early in a trajectory can cascade into qualitatively different future tool calls, citations, or stopping times.

To address this, we treat the sampling procedure as a first-class object of mechanism design, not merely an implementation detail. Conceptually, the platform outputs an action distribution, but operationally it must instantiate a deterministic action given internal randomness. The key requirement is a form of : as a single advertiser increases her bid (holding others fixed), the realized action under a fixed random seed should change in a structured, monotone way that matches the direction of influence implied by the advertiser’s preference. Stability turns randomness from an obstacle into a coupling device: it pins down counterfactuals in a way that can be logged, reproduced, and audited.

A second modeling choice is how to represent advertiser objectives over distributions. In sponsored-action settings, we rarely have a defensible cardinal utility scale for each advertiser over all possible mixed policies, and we do not want the platform’s incentive guarantees to depend on fragile assumptions about such scales. We therefore begin with an notion of improvement: an outcome is better for an advertiser if it moves the platform’s action distribution the advertiser’s preferred distribution, coordinate by coordinate, without overshooting. This robust order matches an operational intuition: a bidder is paying to reduce misalignment between the platform’s behavior and the bidder’s desired behavior, not to maximize an arbitrary expected payoff functional. It also aligns with how sponsorship is often negotiated in practice (more often show this citation,''less often use that tool’’) and with how policy teams reason about influence (shifts in propensity rather than pointwise utility).

Within this framing, the central economic question becomes: Our answer decomposes into two parts. First, for a fixed set of opponents’ bids, we ask when an advertiser’s bid induces monotone movement of the action distribution toward that advertiser’s preference, in the robust ordinal sense. Under such monotonicity, one can implement payments using critical bid'' logic familiar from single-parameter auction design, but adapted to stochastic actions: intuitively, each random seed induces a threshold such that above the threshold the realized action is one that is relatively under-supplied from the advertiser's perspective, and below the threshold the realized action is one that is relatively over-supplied. This supports per-step pricing that is interpretable to advertisers (a second-price-styleyou pay the minimum you needed to cause this favorable action under this seed’’) and can be reconciled with logged randomness.

Second, we ask when sampling procedure can be stable for advertisers simultaneously, across all profiles of opponent bids. This requirement is stronger than per-advertiser stability because it demands in bid space: a single realized action must be compatible with varying any bidder’s bid up or down while holding the random seed fixed. In multi-advertiser settings with three or more actions, this can fail in a fundamental way: the directions of monotone mass transfer demanded by different advertisers can be incompatible unless the aggregation rule has additional structure. The practical significance is immediate. Without a universally stable sampler, the platform can still implement stable counterfactuals for one advertiser at a time, but it cannot provide a coherent global audit story that simultaneously answers counterfactuals for many bidders using the same logged randomness. For systems that must produce regulator-facing reports or advertiser-facing invoices that are mutually consistent, this distinction matters.

We view the technical results of this paper as formalizing a design tradeoff that practitioners already feel. On the one hand, allowing the action distribution to depend on bids in a richly multidimensional way can improve expressiveness and, potentially, revenue or allocative fit. On the other hand, such richness makes it difficult to commit to a single randomness-coupled counterfactual semantics that is consistent across bidders. Our characterization isolates when this tradeoff disappears: when bid dependence collapses to a single increasing index (a potential function of bids), the platform can route all bid changes along a one-dimensional path and inherit the stable sampling and pricing logic of single-parameter environments. This is not merely a mathematical convenience; it is a concrete prescription for building auditable sponsored-action layers on top of stochastic agents.

Our contributions are therefore threefold, aimed at bridging economic mechanism design with the engineering realities of agentic assistants.

The proposed perspective has immediate implications for policy and platform practice. A regulator or auditor does not need to inspect the assistant’s full model weights to evaluate whether sponsorship is being handled consistently; instead, they can test monotonicity and stability properties of the aggregation-and-sampling layer, and verify that realized payments correspond to critical bids under logged seeds. Likewise, advertisers can be offered contracts that are legible: a bid buys a predictable shift of probability mass toward a declared target behavior, and invoices can be accompanied by counterfactual statements that are reproducible. Finally, the platform can integrate user-facing disclosure: when an action is sponsored, the system can indicate sponsorship while still maintaining coherent tool-use policies, since sponsorship operates by controlled, monotone shifts rather than arbitrary overrides.

Several limitations are worth acknowledging upfront. We focus on a stage-separable baseline in which bids are fixed over an episode (or at least treated as exogenous to the within-episode state evolution) and advertiser values scale linearly with per-step improvements. This abstraction is attractive for clarity and for aligning with common billing practices, but it omits dynamic strategic considerations (bid shading in response to observed earlier actions, budget constraints across sessions, and learning about the platform’s policy). We also take advertisers’ preferred distributions as observable to the platform, which is natural when preferences are induced by advertiser-conditioned models or declared policy templates, but it suppresses a separate informational problem of eliciting those preferences. Finally, we treat the user as non-strategic and represent user welfare only indirectly via platform constraints and objective choices; extending the analysis to settings where users respond strategically to sponsored UI actions is an important direction but not necessary to isolate the core tension between expressiveness and audit-consistency.

Despite these simplifications, the model captures a central fact about agentic assistants: sponsorship operates through a stochastic policy over heterogeneous actions, and credible monetization requires a coupling between bids, outcomes, and randomness that supports stable counterfactual reasoning. The remainder of the paper develops this logic formally and connects it to implementable sampling and pricing rules that can be deployed in modern assistant stacks without pretending that token-level auctions are the right primitive. In the next section, we position our approach relative to token-auction proposals for language models, sponsored summarization and question answering, and adjacent literatures in robust mechanism design and proper scoring rules.

Our focus—paid influence over an assistant’s together with auditable, randomness-coupled counterfactuals—touches several literatures that have developed largely in parallel. We organize this section around five threads: token-level monetization proposals for language models, sponsored summarization and question answering (especially via retrieval and citations), public-facing transparency and governance projects for AI advertising, robust mechanism design with weak preference information, and proper scoring rules for eliciting distributions. Across these threads, the recurring gap is that most existing approaches treat randomness and sequential state evolution as engineering details, whereas our central object is precisely the joint design of an aggregation rule and a sampling implementation that supports coherent counterfactual explanations.

A natural first attempt to import familiar ad-auction logic into language models is to treat generated tokens as slots'' and run some variant of a position auction, either explicitly (bidding for insertion opportunities, sponsored phrases, or ranked candidates) or implicitly (charging per token for model-conditioned influence). Variants of this idea appear in industry discussions and early academic proposals forauctioning attention’’ inside generated text, and they mirror classical sponsored-search models in which an auction determines which advertiser message is shown and in what position .

Our setting differs in two respects that make token auctions a brittle primitive. First, the economically relevant unit is often an rather than a string: retrieval choices, citation decisions, tool calls, UI affordances, and stopping behavior are not well captured by per-token exposure. Second, even when the object of influence is textual, modern inference pipelines create nontrivial dependencies across tokens (sampling, speculative decoding, constrained decoding, and retrieval-conditioned generation). In such systems, it is difficult to define a token-by-token ``allocation rule’’ that is monotone in bids in a way compatible with standard single-parameter incentive arguments, because a small perturbation early in a sequence can change the future state and thereby change which tokens are feasible or likely. Token auctions, in effect, attempt to decompose a sequential stochastic control problem into separable impressions; this can be a useful approximation for billing, but it is poorly aligned with the assistant’s policy structure and with the auditing questions that arise when outcomes are stochastic and stateful.

A further distinction concerns counterfactual semantics. In sponsored search, the counterfactual ``would advertiser i have won position k at a lower bid?’’ is naturally evaluated holding fixed the randomization used for tie-breaking (or ignoring randomness when the allocation is deterministic). In stochastic generation, however, counterfactual statements are only meaningful once one fixes a coupling of randomness across bid profiles; resampling destroys interpretability. The mechanisms we study treat this coupling as a design constraint, not merely a computational convenience, and our characterization results can be read as delineating when token-style bid dependence is compatible with a single, globally consistent coupling.

A second, closer thread concerns sponsored content in retrieval-augmented generation and question answering. In information retrieval, paid inclusion,'' sponsored ranking, and blending organic with sponsored results are mature topics \cite{agrawal2006adwords,broder2008sponsored}. Recent work has begun to examine analogous issues for LLM-based summarization and QA: which sources are retrieved, which are cited, and whether the assistant should paraphrase, quote, or link. In these systems, advertisers plausibly care not about raw mention counts, but about probability mass on actions likeretrieve from domain d,’’ cite publisher $j$,'' oropen partner widget w.’’ This is precisely the type of action-space we model.

The key difference is that most sponsored-retrieval and sponsored-citation proposals implicitly assume that once the platform decides a ranking or a set of sources, the rest of the pipeline is either deterministic or can be treated as noise that washes out in expectation. Yet practical deployment forces the opposite stance: the stochasticity is an input to governance. Auditors and advertisers typically want episode-level logs that explain a particular source was cited or why a tool call occurred, and they want counterfactual minimum bid'' explanations that do not depend on rerunning the model with fresh randomness. This pushes the design problem toward stable couplings (so that a fixed random seed defines a coherent family of counterfactuals) and toward payment rules that can be computed from such couplings. The literature on sponsored ranking supplies many useful design patterns (separation of organic and sponsored constraints, disclosure norms, and allocation/payment templates), but it does not by itself resolve the randomness-coupling problem that emerges once theranking’’ is a distribution over heterogeneous agentic actions.

Our framing also complements recent work emphasizing that citations and retrieval are not merely UX features but strategic levers affecting trust, user reliance, and downstream welfare. From this perspective, ``sponsoring’’ a citation decision is closer to influencing a policy under safety and credibility constraints than to buying an impression. The robust, coordinatewise preference model we adopt is meant to capture the kind of contractual language that arises in practice (increase the frequency of a certain action without overshooting, decrease another) while avoiding fragile assumptions about advertisers’ cardinal utilities over all mixed policies.

A third thread is the emerging ecosystem of public commitments, regulatory initiatives, and measurement projects around transparency in AI-mediated advertising and recommendations. While this work is not always cast in mechanism-design terms, it shapes the constraints under which any sponsored-action system must operate: disclosures for sponsored outputs, record-keeping requirements, reproducibility standards, and the ability to provide user- or regulator-facing explanations. The EU AI Act and related digital-services regulation, for example, have spurred renewed attention to logging and traceability for automated decision systems, including those that personalize or recommend content. Industry bodies have also developed norms for ad transparency and labeling, though these norms are less settled for agentic assistants than for search and social feeds.

Our contribution sits at the intersection of these governance concerns and classical incentive design. The practical question ``can the platform a sponsored action ex post?’’ translates in our model into a requirement that, given a realized randomness seed and a bid profile, one can compute consistent counterfactual outcomes and critical bids. This is stronger than typical disclosure (which only requires telling the user that sponsorship occurred) and stronger than aggregate reporting (which might only require ex post frequency statistics). It is a requirement: the same logged randomness must support counterfactual queries across many bidders, not just for one bidder at a time. The universal-stability condition we study can therefore be viewed as a structural constraint that makes certain governance desiderata technically feasible.

At the same time, we view this as an illuminating limitation. Many platforms may prefer rich, multidimensional bid dependence because it allows finer-grained control and potentially higher revenue. Our impossibility and characterization results clarify that such richness can conflict with globally consistent auditing. This helps separate what is a policy choice (how much expressiveness to allow in the sponsored-action layer) from what is a mathematical constraint (when a single counterfactual semantics can exist).

A fourth thread is robust mechanism design, broadly construed: settings where the designer lacks a credible cardinal utility model, where types are partially ordered, or where incentive guarantees are sought uniformly over a class of utility representations. Classic robust-design work studies worst-case revenue, ambiguity about priors, or mechanisms that are optimal across a set of environments . Another relevant set of ideas concerns mechanism design with or restrictions, and with monotonicity as the key implementability condition . We borrow the central lesson of this literature: incentive properties are fundamentally linked to monotone comparative statics of the allocation rule, and payments can often be recovered (up to constants) by envelope arguments once monotonicity holds.

What is distinct in our environment is the object over which monotonicity is defined: not a scalar allocation or a discrete winner, but a distribution over actions. Moreover, the preference relation is not derived from a full expected-utility model; it is a robust partial order capturing ``move toward my preferred distribution without overshooting, coordinate by coordinate.’’ This robust order is intentionally conservative, and it leads to a corresponding notion of monotone aggregation: increasing a bid should move the platform’s mixed action choice toward the bidder’s target in every coordinate, with the direction determined relative to a baseline. The benefit is that our incentive conclusions do not depend on choosing a particular divergence or proper scoring rule as the advertiser’s utility; any cardinalization consistent with the robust order can be accommodated.

There is also a conceptual connection to the literature on randomized social choice and implementation with lotteries, where agents may have preferences over distributions and one seeks incentive properties under weak assumptions. In such settings, the choice of coupling (how randomness is shared across counterfactuals) is often suppressed because the mechanism is defined directly as a lottery. In our agentic context, the coupling re-enters because auditing requires a action together with a reproducible counterfactual mapping. Our ``stable sampling’’ requirement can be interpreted as selecting, among all implementations of a given lottery, one that preserves monotone structure pathwise.

Finally, our model is closely related in spirit to proper scoring rules and mechanisms that elicit probabilistic beliefs . Proper scoring rules provide payments that make truthful reporting of a probability distribution optimal, and they are closely connected to Bregman divergences. This connection is relevant for two reasons. First, the primitive in our environment is also a distribution (the advertiser-induced preference distribution p_i(s)), and a natural design route would be to p_i(s) via scoring-rule payments and then aggregate reports. Second, the canonical cardinalization we use for illustration, $U(p,q)=-\tfrac12\|q-p\|_1$, is a divergence-like measure of misalignment between the platform’s behavior and the advertiser’s target.

Despite this proximity, our mechanism problem is different from belief elicitation. We do not treat p_i(s) as privately reported information that must be truthfully elicited; rather, it is best interpreted as an externally specified target distribution (e.g., induced by an advertiser-conditioned model, or a contractually declared policy template) that is observable to the platform. The private information is instead a parameter θ_i controlling willingness to pay for movement toward p_i(s). This is why the relevant incentive constraint is single-parameter truthfulness in bids, not truthful reporting of a full distribution. In this sense, our payments resemble Myerson pricing more than scoring-rule transfers: the payment is tied to a critical threshold under a stable coupling, and the expected payment admits an envelope representation in terms of how much the outcome distribution moves toward the bidder’s target as the bid increases.

There is nonetheless a useful methodological lesson from scoring rules: once one commits to a divergence-based objective (or, more generally, to a convex potential), the geometry of distributions can provide structure that makes aggregation and pricing more tractable. Our results can be read as identifying a different structural requirement—single-index dependence of the aggregator—that plays an analogous simplifying role for audit-consistent sampling. In both cases, additional structure is what converts an underdetermined mapping (many implementations of a lottery; many ways to pay for influence) into something that is both interpretable and incentive-compatible.

Taken together, the related literatures provide building blocks but leave open the core design problem that arises once we treat an assistant as a stochastic policy over heterogeneous actions: how to couple randomness across counterfactual bid profiles in a way that is simultaneously (i) monotone in each bidder’s influence, (ii) compatible with transparent, per-realization ``critical bid’’ explanations, and (iii) globally consistent across many advertisers. Token auctions and sponsored retrieval motivate the economic demand for influence; governance projects motivate the audit requirement; robust mechanism design motivates monotonicity and envelope logic under weak preference assumptions; and scoring rules motivate thinking geometrically about distributions. Our next step is to formalize a model that makes these connections precise and isolates the key structural constraint—single-index bid dependence—under which universal, audit-consistent stability is achievable.

3. Model: agentic action set A, state s, advertiser-induced distributions p_i(s), bid-only indirect mechanism, robust preference order over (A), and stage-separable utilities

We formalize ``paid influence’’ as influence over an assistant’s at each point in an interaction. The central modeling choice is to treat the assistant not as a text generator with monetized tokens, but as an agent that repeatedly selects among heterogeneous actions—including but not limited to emitting text. This is the natural unit at which retrieval, citations, tool use, UI affordances, and stopping decisions are made, and it is also the natural unit at which one can ask auditable counterfactual questions such as:

Fix a (possibly history-dependent) state space, and let s denote the current state. In an LLM pipeline, s can be interpreted broadly: the dialogue prefix, the retrieved-document cache, tool outputs already observed, user metadata relevant for policy constraints, and any other latent control state the platform maintains. At each s, the platform selects a distribution over a action set A, denoted Δ(A). The finiteness of A is not restrictive for our purposes: one may take A to be a discretization of a richer control space, or a high-level abstraction (e.g., retrieve from corpus $\mathcal{C}_j$'' rather thanretrieve a specific URL’’).

We emphasize that A is intentionally heterogeneous. A representative (non-exhaustive) list includes
A = {emit-token, retrieve-doc(j), cite-source(j), call-tool(t), UI-action(w), stop},
where the indices j, t, w can themselves encode families of actions. The designer is free to refine or coarsen A depending on product requirements and the desired granularity of auditing.

There are n advertisers indexed by i ∈ [n]. Advertiser i is associated with a p_i(s) ∈ Δ(A) at each state s. Conceptually, p_i(s) describes what the assistant would do at s if it were fully aligned with advertiser i’s objective within the platform’s feasible action space. For example, an advertiser might prefer a higher chance of taking an action such as retrieve-doc(i) when the user asks about a product category, or a higher chance of cite-source(i) when the assistant makes a claim in a domain where the advertiser is an authorized publisher.

A key modeling stance is that p_i(s) is not a strategic message in the baseline mechanism; rather, it is an that the platform can query. Operationally, p_i(s) could be generated by an advertiser-conditioned model, a policy template defined in a contract, or a learned mapping from state features to action propensity. This reflects a practical separation: the platform can often verify or control the mapping from state to the space of permissible sponsored behaviors, while the advertiser’s private information is better modeled as the with which they value moving the platform policy toward that mapping.

Formally, write p(s) = (p₁(s), …, p_n(s)). We will often suppress the argument s when it is fixed.

Advertisers do not directly choose actions. Instead, they submit nonnegative scalar bids b_i ∈ ℝ₊, collected in the bid vector b = (b₁, …, b_n). We focus on indirect mechanisms because this is the natural interface for sponsored systems: a bidder supplies a budget and a bid, while the platform retains control over how influence is implemented subject to user-facing constraints and policy.

At each state s, the platform computes an
q(b, p; s) ∈ Δ(A),
which we interpret as the assistant’s mixed action policy at s given bids and preferred distributions. An action a ∈ A is then realized by sampling
a ∼ q(b, p; s).
We separate the q from its . Let r denote exogenous randomness used by the sampler (e.g., a uniform seed), and let
σ(b, p; s, r) ∈ A
be a deterministic function such that, when r is drawn from the platform’s internal randomness source, σ(b, p; s, r) has distribution q(b, p; s). This distinction is economically consequential: many different σ implement the same q, but they induce different when one varies bids while holding fixed the realized random seed. Since auditability is fundamentally a pathwise requirement (``what would have happened under the same randomness?’’), the implementation σ is part of the mechanism.

The mechanism charges advertisers per step. Let ζ_i(b, p; s, a) denote advertiser i’s per-step payment when action a is sampled at state s under bid profile b and preference profile p. The expected per-step payment is
z_i(b, p; s) = 𝔼[ζ_i(b, p; s, a) | a ∼ q(b, p; s)].
In applications where reproducibility is required, the platform can log (s, b, p, r, a) (or an appropriate hash) so that third parties can verify that a = σ(b, p; s, r) and so that counterfactual actions can be computed by evaluating σ(⋅, p; s, r) at alternative bid profiles. Our results do not require that r be publicly revealed in real time, only that it be well-defined and available for ex post auditing when needed.

The interaction proceeds over steps k = 1, 2, …, T, where T may be fixed or random. At step k, the state is s_k, the platform computes q(b, p; s_k), samples a_k ∼ q(b, p; s_k) via a_k = σ(b, p; s_k, r_k), and the state updates to s_k + 1 according to the assistant’s environment dynamics (which can include the user’s response). Importantly, even though the mechanism is applied step-by-step, the state evolution can create rich dependencies: early retrieval or tool-use decisions alter what actions are feasible or salient later.

To keep the incentive problem tractable while retaining this sequential realism, we adopt a : advertisers care about the per-step alignment of the platform’s action distribution with their preferred distribution at that step, and their episode utility is the sum of these per-step utilities. This assumption is standard in sponsored-search and many auctioned-service settings where billing and value accrue per impression or per decision, and it yields a clean single-parameter environment even when the state is endogenous.

A distinctive feature of our setting is that the primitive outcome at each step is a q(⋅) ∈ Δ(A). Rather than impose a fully specified expected-utility model over all lotteries, we assume each advertiser has a robust ordinal preference relation over Δ(A) induced by their target p_i(s). The guiding intuition is contractual: advertisers can often specify directional requirements of the form increase the frequency of these actions'' anddecrease the frequency of those actions,’’ but fine-grained cardinal tradeoffs across all action coordinates are difficult to elicit and may be unstable.

Concretely, we assume advertiser i prefers distributions that move p_i(s) . Formally, we will use a robust partial order ≽_i on Δ(A) with the property that q≽_iq^′ whenever, for every action a ∈ A, the coordinate q(a) lies on the same side of p_i(a) as q^′(a) and is weakly closer to p_i(a) in absolute distance. This order is intentionally conservative: it only ranks outcomes when every coordinate moves in the ``right’’ direction relative to p_i. The benefit is that incentive and monotonicity statements derived from ≽_i do not depend on fragile assumptions about how advertisers substitute across different actions.

Because ≽_i is ordinal and partial, we allow advertisers to admit cardinal utility representations consistent with it. Let
U(p, q)
denote any cardinal index that is monotone with respect to the robust order (holding p fixed). For illustration and for explicit envelope formulas later, it is convenient to keep in mind the canonical choice
$$ U(p,q)=-\tfrac12\|q-p\|_1, $$
which assigns higher utility to distributions closer in total variation distance to the target. We stress, however, that our monotonicity-based implementation logic is intended to apply to any U consistent with the robust ranking.

Advertiser i has a private value parameter θ_i ∈ ℝ₊ that scales their willingness to pay for alignment with p_i(s). We interpret θ_i as a ``value per unit of movement toward the target policy,’’ where the notion of movement is captured by the chosen index U. At state s, advertiser i’s per-step utility under bid profile b is
u_i(s) = θ_i U (p_i(s), q(b, p; s)) − z_i(b, p; s).
Over an episode of length T with realized state sequence (s_k)_k = 1^T, we assume stage separability:
$$ U_i=\sum_{k=1}^T\Big(\theta_i\,U\!\big(p_i(s_k),q(b,p;s_k)\big)-z_i(b,p;s_k)\Big). $$
This quasilinear form matches common billing architectures (payments are monetary and add across steps) and isolates the core mechanism-design question: how to convert a scalar bid into a controlled, auditable movement of the platform policy toward the bidder’s target distribution.

We take p_i(s) to be observable to the platform (query access) and common knowledge, while θ_i is private. Bids b_i are the only strategic messages. Bids may be submitted once per episode or per step; the baseline we analyze is a fixed bid per episode, which is natural when budgets and contracts are set in advance, and which aligns with our stage-separable value assumption.

The platform (designer) chooses (i) the aggregation rule q(b, p; s) and (ii) a sampling implementation σ(b, p; s, r), together with (iii) a payment rule ζ_i. In practice, the designer may face additional feasibility constraints: safety policies, disclosure requirements, user-level constraints, or hard limits on certain actions. These can be modeled as restrictions on the allowable set of distributions q(b, p; s) at each state. We treat such constraints as exogenous to the bidding problem; they shape what policy movements are feasible, but they do not alter the basic tension we study between and .

It is tempting to define a mechanism solely by q and ignore σ, as is common in models where the outcome is a lottery and only expectations matter. In our environment, that omission is not innocuous. The same q can be implemented by different couplings of randomness across bid profiles, yielding different answers to counterfactual questions of the form ``what would have happened under the same random seed if advertiser i had bid b_i^′?’’ Since these counterfactuals underlie critical-bid explanations and reproducibility, we treat the pair (q, σ) as the true allocation object. Our main results will show that, when q satisfies a robust monotonicity property, one can construct sampling implementations that are stable in a precise pathwise sense, and that such stability supports second-price-style critical-bid pricing with an envelope characterization of expected payments. Conversely, requiring stability—a single coupling that supports consistent counterfactuals for all bidders simultaneously—forces strong structure on how bids can enter q.

Several simplifications are worth noting. First, our baseline model treats the user as non-strategic; user welfare and platform objectives can be incorporated through constraints on feasible actions or through the designer’s objective, but we do not model user responses as strategic manipulation of bids. Second, stage separability abstracts from advertisers caring about long-run state changes induced by early actions (e.g., brand goodwill or user retention). This is a deliberate first step: it allows us to isolate the coupling-and-pricing problem in a single-parameter environment. Third, we do not model budgets, pacing, or intertemporal constraints; these can be layered on top of per-step pricing once stable sampling yields coherent per-realization accounting. Finally, by taking p_i(s) as given, we shift attention away from eliciting a full distribution and toward pricing a scalar intensity of influence, which is the margin at which many practical sponsored systems operate.

With the model in place, we now turn to the key property that makes incentive and audit guarantees possible: monotone movement of q(b, p; s) toward each advertiser’s target distribution in the robust, coordinatewise sense, and the corresponding structural implications for sampling implementations.

4. Monotone policy aggregation: definitions (robust monotonicity, consistent aggregation), and a coordinatewise characterization extending Lemma 3.10

Our next step is to formalize what it means for bids to purchase influence over the assistant’s mixed action policy. In a standard single-parameter auction, monotonicity says that raising one agent’s bid weakly increases the probability of being served. Here the outcome is a full distribution q(⋅) ∈ Δ(A), and each advertiser evaluates that distribution relative to their target p_i(s). The appropriate analogue of monotonicity therefore depends on an order over distributions. We use the robust, coordinatewise order introduced in and then impose a monotonicity requirement that is : it constrains how probability mass can flow across actions as a function of a single bid, holding everything else fixed.

Fix a state s and an advertiser i, and abbreviate p_i = p_i(s). For any two distributions q, q^′ ∈ Δ(A), we write q≽_iq^′ if, for every action a ∈ A,

The second condition enforces a constraint: q(a) and q^′(a) must lie on the same side of p_i(a) (or coincide with it). The first condition then requires that q be (weakly) closer to p_i in each coordinate. This order is partial by design: if one distribution moves probability in the right'' direction on some actions but thewrong’’ direction on others, the order is silent. Economically, this matches a conservative contracting interpretation: advertisers can verify that a platform moved probability mass toward their target in each coordinate, but they may not be willing (or able) to specify stable cardinal tradeoffs across heterogeneous actions such as retrieval, citation, tool use, and stopping.

Two remarks help situate . First, if q≽_iq^′, then any cardinal index U(p_i, ⋅) that is monotone with respect to ≽_i will rank q at least as good as q^′; in particular, the canonical $U(p,q)=-\tfrac12\|q-p\|_1$ satisfies
q≽_iq^′ ⇒ ∥q − p_i∥₁ ≤ ∥q^′ − p_i∥₁.
Second, the converse fails: ∥q − p_i∥₁ ≤ ∥q^′ − p_i∥₁ does not imply q≽_iq^′, because total-variation improvements can be obtained by compensating wrong-way'' moves on some actions with largerright-way’’ moves on others. Our focus on ≽_i is precisely to rule out such compensations, which are brittle both for incentives (they depend on cardinal tradeoffs) and for auditing (they obscure which actions became more or less likely).

Fix a state s, a preference profile p = (p₁, …, p_n), and a vector of opponents’ bids b_−i. Consider the one-dimensional slice of the aggregation rule given by
q_i(⋅) : b_i ↦ q(b_i, b_−i, p; s) ∈ Δ(A).
We will say that the aggregation rule is if

This is the direct generalization of monotone allocation in a single-parameter environment: increasing b_i must move the entire distribution q weakly toward p_i, coordinate-by-coordinate, without crossing to the other side of p_i on any action. The restriction is strong, but it is also transparent: it specifies exactly what kind of counterfactual statement the platform can make to advertiser i (and to an auditor) when varying b_i while holding fixed everything else.

Because the robust order is partial, is best read as a : the direction of movement in each action coordinate cannot depend on unmodeled cardinal preferences. If the platform wants to sell more influence'' to advertiser $i$ when $b_i$ rises, it must do so in a way that is unambiguouslymore aligned’’ with p_i.

The platform ultimately chooses a single aggregation rule q(⋅) that responds to all bids. We therefore impose monotonicity advertiser-by-advertiser.

We say that q is if, for every state s, every preference profile p, every advertiser i, and every opponents’ bid vector b_−i, the slice b_i ↦ q(b_i, b_−i, p; s) satisfies . This is the natural feasibility condition for using bid-based pricing in our environment: each advertiser faces a one-dimensional influence problem in which ``higher bid’’ has a well-defined directional meaning.

Importantly, robust monotonicity is a , not on any particular sampling implementation. In the next section we will show that this restriction is exactly what is needed to construct stable couplings of randomness and critical-bid prices. For now, we develop a characterization of that makes its content operational.

Fix (s, p, b_−i) and an advertiser i. Write
q⁰(⋅) ≡ q(0, b_−i, p; s)(⋅)
for the distribution induced when advertiser i bids zero, holding fixed the rest of the environment. Compare q⁰ to the advertiser’s target p_i. This comparison partitions the action set into coordinates on which the platform is initially below'' $p_i$ and coordinates on which it is initiallyabove’’ p_i:

(When q⁰(a) = p_i(a), the action belongs to T_i⁺(s) by convention; these are the coordinates at which the platform is already at the target for advertiser i at bid 0.)

The economic interpretation is immediate. Actions in T_i⁺(s) are relative to advertiser i’s target at the baseline bid, while actions in T_i⁻(s) are . If advertiser i raises their bid and the platform is to move ``toward p_i’’ without overshooting, then probability mass must flow from T_i⁻(s) to T_i⁺(s).

The following statement makes this precise and extends the familiar coordinatewise monotonicity logic from single-item settings to the present distributional outcome.

For fixed (s, p, b_−i) and advertiser i, define q_i(b_i) = q(b_i, b_−i, p; s). Then q_i(⋅) is robustly monotone in the sense of if and only if the following two properties hold:

In words: once we classify an action as under-sampled or over-sampled at bid 0, its probability must move in the appropriate direction as b_i increases, and it must never cross the target coordinate p_i(a).

A short proof sketch clarifies why this characterization is the right analogue of monotonicity. If holds, then for any b_i > b_i^′, the robust order requires that each coordinate of q_i(b_i) be on the same side of p_i as the corresponding coordinate of q_i(b_i^′), and weakly closer to p_i. Taking b_i^′ = 0 pins down the side of p_i(a) on which the coordinate must remain for all b_i (yielding –), and taking arbitrary b_i > b_i^′ yields the monotonicity direction within that side. Conversely, if each coordinate obeys –, then for any b_i > b_i^′ we have coordinatewise closeness to p_i and no sign changes relative to p_i, which is exactly .

The sets T_i⁺(s) and T_i⁻(s) are defined relative to q⁰ = q(0, b_−i, p; s), hence depend on the state, the other advertisers’ bids, and any platform constraints embedded in q. This dependence is not a technicality: it is how externalities and feasibility enter the monotonicity requirement. For example, if other bids already push the platform strongly toward advertiser i on a subset of actions, those actions may fall into T_i⁻(s) at bid 0, meaning that further increases in b_i cannot increase their probability without overshooting p_i. Robust monotonicity forces the mechanism to express additional influence through other under-supplied actions (or not at all, if no such actions remain). This captures a practical design constraint: a platform that promises ``we move toward your target without exceeding it’’ must specify what happens when parts of the target are already saturated by other forces.

Because q_i(b_i) is a probability distribution, increases on some coordinates must be offset by decreases elsewhere:
∑_a ∈ Aq_i(b_i)(a) = 1 ∀b_i.
Under the coordinatewise criterion, the only permitted direction of flow is from T_i⁻(s) to T_i⁺(s). In particular, define the total mass on under-sampled actions,
Q_i⁺(b_i) ≡ ∑_{a ∈ T_i⁺(s)}q_i(b_i)(a).
Then Q_i⁺(b_i) is weakly increasing in b_i, and the total mass on T_i⁻(s) is weakly decreasing. This ``one-way flow’’ perspective is the key structural fact we will exploit in the next section to build stable sampling implementations: if probability can only move in one direction across a fixed cut of the action set, then we can couple randomness so that, for each realized seed, the sampled action crosses that cut at most once as b_i increases.

From a product perspective, – can be read as a contractual monotonicity promise. Suppose a sponsor’s target p_i(s) encodes an allowed template such as increase the chance of citing source $i$ when answering in domain $\mathcal{D}$, but never cite it more often than $p_i$ specifies,'' alongside other action-level constraints. Robust monotonicity ensures that higher bids increase (or at least do not decrease) the probability of the actions the sponsor is currently not receivingenough of,’’ and decrease (or at least do not increase) the probability of actions the sponsor is currently receiving too much of,'' in a way that can be checked coordinatewise. This is exactly the kind of property that can be audited from logs of realized actions and reconstructed counterfactuals: an auditor need not accept a platform's claim that a complicated multi-action shift isworth it’’ in expectation; they can instead verify that each coordinate moved in the correct direction and did not overshoot.

Finally, we emphasize what monotonicity does resolve. The mapping q(⋅) specifies, for each bid profile, a distribution over actions. But counterfactual auditing and critical-bid explanations are inherently : they ask what would have happened under the same realized randomness if a bid had been different. Two different sampling implementations σ can implement the same q while inducing different answers to such questions. Robust monotonicity is the property that makes it possible, for a advertiser i and fixed (s, p, b_−i), to choose an implementation that is stable in b_i in a strong sense. Whether one can do this for all advertisers with a single universal sampler is a separate question, and as we will see it imposes substantially more structure on how bids may enter q.

With these definitions and the coordinatewise characterization in hand, we now turn to the constructive side: given any robustly monotone aggregation rule, we show how to implement it with a stable sampler and how to charge second-price-style critical bids for realized actions such as retrievals, citations, and tool calls.

5. Stable sampling and critical-bid pricing for actions: existence of stable implementations for any monotone q; second-price-style payment rule; envelope/payment identity; counterfactual interpretation for tool calls and citations

Robust monotonicity is a property of the outcome q(⋅), but incentives and auditing depend on the action a and the counterfactual question ``what would have happened under the randomness if a bid were different?’’ This section shows that, for any fixed advertiser i and fixed environment (s, p, b_−i), robust monotonicity is exactly what we need to (i) choose a sampling implementation that is stable in b_i and (ii) attach a second-price-style, payment to realized actions such as retrievals, citations, and tool calls. The resulting expected payment admits an envelope identity that depends only on the distributional rule q (not on the particular implementation), which is the key to truthful bidding under any cardinal index consistent with the robust order.

Fix i, s, p, and b_−i, and write q_i(b_i) = q(b_i, b_−i, p; s). If we were only interested in matching marginals, we could sample a ∼ q_i(b_i) independently at each bid. That would implement the correct distribution but would make counterfactuals meaningless: the event a citation happened at bid $b_i$'' would not be comparable towhat would have happened at b_i^′’’ because the underlying randomness would change. Stability is the coupling requirement that rules this out. Informally, we want a single seed r to generate a whole path b_i ↦ σ_i(b_i, r) such that the realized action moves ``toward p_i’’ as the bid increases, in the same directional sense as q_i(⋅).

The coordinatewise monotonicity criterion from implies a particularly useful structure: relative to the baseline partition (T_i⁺(s), T_i⁻(s)), probability mass can only flow from the over-sampled side to the under-sampled side. This induces a one-dimensional monotone statistic,
Q_i⁺(b_i) ≡ ∑_{a ∈ T_i⁺(s)}q_i(b_i)(a),
which is weakly increasing in b_i. Our sampling construction will couple bids only through this statistic at the first stage, which is exactly what yields the ``single flip across the cut’’ behavior.

We now describe an explicit σ_i that implements q_i(⋅) and is stable in the sense needed for critical-bid pricing. Let r ∈ [0, 1] be uniform and write r = (r₁, r₂) via any deterministic measure-preserving map from [0, 1] to [0, 1]² (e.g., interleaving binary digits). Define a two-stage sampling rule:

This defines a deterministic map σ_i(b_i, r) ∈ A whose marginal distribution is exactly q_i(b_i) for every b_i.

The stability property comes from the first stage. Because Q_i⁺(b_i) is weakly increasing, the indicator 1{σ_i(b_i, r) ∈ T_i⁺(s)} = 1{r₁ ≤ Q_i⁺(b_i)} is weakly increasing in b_i for every fixed seed r. Equivalently, for each r there exists a (possibly infinite) critical value

such that σ_i(b_i, r) ∈ T_i⁻(s) for all b_i < ϑ_r and σ_i(b_i, r) ∈ T_i⁺(s) for all b_i > ϑ_r (ties can be broken measurably at b_i = ϑ_r). This is the precise sense in which each seed experiences at most one ``flip’’ from the over-sampled set to the under-sampled set as b_i increases.

Two comments are worth stressing. First, stability here is : σ_i is constructed holding (s, p, b_−i) fixed. Second, we do not require that σ_i(b_i, r) be constant within T_i⁺(s) after the flip; the construction only guarantees a single crossing of the cut, which is the ingredient needed to define critical-bid prices that are interpretable and incentive compatible.

Given a stable implementation σ_i, we can price realized actions by asking: what is the minimal bid at which this seed would have caused the outcome to land on the advertiser-favorable side of the cut? This mirrors second-price pricing in single-item auctions, where a winner pays the minimal bid needed to remain the winner.

Fix b_i and realize a = σ_i(b_i, r). Define the realized per-step payment

where ϑ_r is the critical value in . The economic interpretation is direct: if the action falls in T_i⁻(s) (an action that is already over-supplied relative to p_i at bid 0), then the advertiser did not purchase any ``alignment improvement’’ along the cut and pays nothing. If the action falls in T_i⁺(s), then the advertiser pays exactly the lowest bid that would have been sufficient (for this same seed and the fixed environment) to move the realized action to the under-sampled side.

This pricing rule is pathwise and therefore auditable. Conditional on logging r (or an auditable commitment to it), an external party can recompute ϑ_r from the mapping b ↦ Q_i⁺(b) and verify that the charged price equals the critical bid. Moreover, the rule is : it attaches a price to a concrete action such as call tool,''retrieve document,’’ or ``cite source,’’ rather than to a nebulous ex ante influence quantity.

A central virtue of critical-bid pricing is that, while the payment depends on the coupling σ_i, the payment depends only on q_i(⋅). In particular, for any stable sampler with a single-crossing property across the cut, the expected payment
z_i(b_i) ≡ 𝔼_r[ζ_i(b_i, p; s, σ_i(b_i, r))]
is implementation-independent and can be written as an envelope-style integral in terms of how much closer q_i(b) is to p_i in total variation.

To state the identity cleanly, recall that under the robust monotonicity criterion, for each action coordinate a, the term |q_i(b)(a) − p_i(a)| is weakly decreasing in b. Summing across a implies that the total variation distance ∥q_i(b) − p_i∥₁ is weakly decreasing in b. Define the alignment improvement'' from $0$ to $b$ as \[ \Delta_i(b)\;\equiv\;\frac{1}{2}\Big(\|q_i(0)-p_i\|_1-\|q_i(b)-p_i\|_1\Big)\ \ge\ 0. \] Then the expected payment under \eqref{eq:critical-payment} satisfies \begin{equation}\label{eq:payment-envelope} z_i(b_i)\;=\;\frac{1}{2}\int_{0}^{b_i}\Big(\|q_i(b_i)-p_i\|_1-\|q_i(b')-p_i\|_1\Big)\,db'. \end{equation} The structure mirrors Myerson's payment formula withallocation’’ replaced by the monotone statistic that the bidder values—here, the (negative) total variation distance to p_i induced by q_i(⋅). Intuitively, as b_i rises, the advertiser purchases marginal decreases in ∥q_i(b) − p_i∥₁; the integral accumulates these marginal improvements up to b_i in a way that makes truthful bidding optimal for any θ_i multiplying a compatible cardinal index.

It is instructive to see where implementation independence enters. Under stable sampling, the event of crossing the cut {T_i⁻, T_i⁺} is governed by the single uniform coordinate r₁, and the realized price is the threshold ϑ_r at which that event occurs. This is exactly the structure behind second-price auctions: randomness determines a ``rank’’ and the price is the minimal bid needed to attain the favorable event. Different refinements within T_i⁺ (i.e., different uses of r₂) may change favorable action is selected, but not the threshold at which the seed becomes favorable, and therefore not the expected payment. In that sense, is a distributional invariant: it is pinned down by the monotone path q_i(⋅), not by arbitrary choices in the sampler.

The practical point of stable sampling and critical bids is not merely to compute transfers, but to produce counterfactual statements that are both meaningful and checkable. Suppose an action a corresponds to a tool call (e.g., ), a citation decision (e.g., ), or a UI action (e.g., ). Under a stable sampler, we can attach to the realized action an interpretable explanation:

This is a second-price-style statement: it identifies a that is independent of the reported bid above the threshold, and it is inherently counterfactual—it compares the realized action at b_i to the action that would have occurred at slightly lower bids under the same seed.

For citations, this yields a particularly clean narrative. If p_i(s) encodes a target distribution over whether and which sources are cited, then T_i⁺(s) includes citation actions that are currently under-produced relative to that target. A realized citation to advertiser i (or to a sponsor-approved source) can be priced as the minimal bid that would have moved the outcome into the under-sampled region, rather than as a negotiated flat fee. For tool calls, the same logic attaches a critical price to an action like use an external verifier'' orretrieve supporting documents,’’ which is exactly the kind of atomic event that can be logged and audited.

It is important not to over-interpret the payment. Under , the advertiser is not paying for the of the realized action within T_i⁺(s); they are paying for moving the outcome across the baseline cut induced by q_i(0). This is a feature rather than a bug: it aligns transfers with what robust monotonicity guarantees unambiguously, namely directional movement toward p_i without overshoot. In environments with heterogeneous actions, asking a bidder to pay for fine-grained tradeoffs within T_i⁺ would reintroduce cardinal assumptions that our robust framework is designed to avoid.

Two limitations are worth flagging. First, the construction is per-advertiser and holds b_−i fixed. If we vary multiple bids simultaneously, different stable couplings may be required to preserve stability for different advertisers, and there is no reason that a single sampler can satisfy everyone at once. Second, robustness comes at a cost in expressiveness: if q_i(0) already matches p_i on many coordinates, then those coordinates are effectively ``saturated’’ and cannot be increased without violating no-overshoot. In product terms, a sponsor cannot buy influence on an action that the platform is already providing at (or above) the sponsor’s declared target; additional budget must shift probability through other coordinates, or have no effect.

These caveats underscore the role of mechanism design in specifying q(⋅): robust monotonicity gives us implementability and a transparent pricing logic, but it does not by itself determine which actions should belong to T_i⁺(s) at baseline, nor how the platform should trade off among advertisers when their targets conflict. What we obtain is a : for any chosen monotone aggregator, we can implement it in a way that supports critical-bid pricing and stable counterfactuals for a fixed advertiser.

Putting the pieces together, robust monotonicity delivers a tight analogue of the single-parameter toolkit: monotone ``allocations’’ (here, monotone movement toward p_i) imply the existence of stable couplings of randomness, which in turn yield second-price-style critical values and an envelope identity for expected payments. In the next step we ask whether these pathwise guarantees can be made across advertisers with a single universal randomness source—the property that ultimately governs auditability when multiple sponsors interact in the same policy.

6. Universal stability and auditability: definition of universal stable sampling; impossibility examples in the policy setting; why multi-dimensional bid spaces create path dependence

The stable-sampling logic in is intentionally : for a fixed advertiser i and a fixed environment (s, p, b_−i), we can couple randomness across b_i so that counterfactuals are meaningful and critical-bid prices are well-defined. In a policy setting with multiple sponsors, however, the platform runs policy once, producing realized action a from random seed r. Auditing, disclosure, and any notion of ``what would have happened under different bids’’ must therefore be phrased relative to a randomness source shared across all advertisers.

This requirement is easy to miss because it is not a distributional constraint: the platform can always match the marginals q(b, p; s) at each bid profile. The friction is counterfactual consistency. If we use an i-specific coupling to explain payments to advertiser i, and a different j-specific coupling to explain payments to advertiser j, we are implicitly answering two different counterfactual questions under two different randomness realizations—even though there was only one realized execution. In practice, this is exactly where auditability bites: a regulator (or an internal audit team) wants to be able to replay the same step, hold the seed fixed, and ask counterfactual questions for advertiser without the platform changing the coupling ex post.

Fix s and p, and consider an aggregation rule q(⋅, p; s) : ℝ₊ⁿ → Δ(A). A is a single deterministic implementation
σ_univ: ℝ₊ⁿ × [0, 1] → A
such that for every bid profile b the induced marginal distribution matches the aggregator:
Pr_r [σ_univ(b, r) = a] = q(b, p; s)(a) ∀a ∈ A.
Universal stability then imposes a requirement simultaneously for all advertisers. To state it in the same ``single flip across a cut’’ language as before, it is helpful to make the dependence on opponents explicit. For each advertiser i and fixed b_−i, define the baseline partition at b_i = 0,
T_i⁺(s; b_−i) = {a ∈ A: q(0, b_−i, p; s)(a) ≤ p_i(s)(a)}, T_i⁻(s; b_−i) = A \ T_i⁺(s; b_−i).
Universal stability asks that the realized action produced by σ_univ moves across this partition monotonically in coordinate, for way of holding other bids fixed.

This definition is deliberately rather than ex ante. It says that if we fix r and move one advertiser’s bid holding opponents fixed, the realized action crosses the relevant baseline cut at most once and never recrosses. In other words, the step is auditable in the following strong sense: for each advertiser i and each counterfactual b_i^′, we can compare σ_univ(b_i, b_−i, r) to σ_univ(b_i^′, b_−i, r) under the randomness, and the direction of change is constrained to be ``toward p_i(s)’’ in the robust sense.

There are two distinct ways universal stability matters in policy applications.

First, it rules out what we might call . If different advertisers can be shown counterfactual replays under different couplings, then the platform can (even unintentionally) provide incompatible narratives:
to advertiser i, and simultaneously

to advertiser j, even if these claims cannot be jointly true under any single seed. A universal sampler forces all pivotality statements to be evaluated against one replayable mapping b ↦ σ_univ(b, r).

Second, universal stability is the natural prerequisite for versions of critical-bid reasoning. In we priced a realized action for advertiser i by asking for the threshold at which the seed crosses from T_i⁻ to T_i⁺. If we intend to attach analogous threshold-based disclosures for multiple advertisers in the same step (or even just to answer post hoc counterfactual questions for multiple advertisers), we need those thresholds to be defined with respect to a seed. Otherwise, the counterfactual what would have happened if advertiser $i$ had bid less?'' and the counterfactualwhat would have happened if advertiser j had bid less?’’ are anchored to different latent randomness and cannot be coherently combined.

To build intuition, consider a setting where A contains three mutually exclusive atomic actions:
$$ A=\{\textsc{Cite-1},\ \textsc{Cite-2},\ \textsc{No-Cite}\}, $$
and there are two sponsors. Sponsor 1 prefers a distribution p₁(s) that puts more probability on ; sponsor 2 prefers more probability on . A designer may propose an aggregation rule q(b, p; s) that is monotone for each sponsor in isolation: raising b₁ shifts mass away from whatever is over-produced relative to p₁(s) and toward ; similarly for b₂ and . At the distributional level, there is nothing inconsistent about this.

The difficulty emerges when we try to embed monotone shifts into a single realized execution. If the two bids move probability mass through different channels'' (e.g., $b_1$ mainly swaps \textsc{No-Cite} into \textsc{Cite-1}, while $b_2$ mainly swaps \textsc{Cite-1} into \textsc{Cite-2} because only one citation slot exists), then a fixed seed can be forced to experience two distinct kinds of flips depending on which coordinate changes first. In one ordering of bid adjustments, the seed may flip into \textsc{Cite-1} and then (as the other sponsor's bid rises) flip again into \textsc{Cite-2}. In the reverse ordering, the same seed may be pushed first into \textsc{Cite-2} and then (as $b_1$ rises) be pushed back toward \textsc{Cite-1}. If the platform insists on a single $\sigma_{\mathrm{univ}}(b,r)$, it must assign \emph{one} deterministic action at the final bid profile $(b_1,b_2)$ for that seed. Path dependence is the observation that the two monotonestories’’ can demand different final actions for the same seed, even though both stories are consistent with the q(⋅).

This is not a pathology of citations per se. The same phenomenon appears whenever multiple sponsors compete for mutually exclusive resources (one tool call, one retrieval budget, one disclosure slot, one UI placement) and the designer allows different bid coordinates to reallocate probability mass along different margins in Δ(A). The more structured and multi-dimensional the policy space becomes, the easier it is to write down monotone distributional rules whose induced counterfactuals cannot be coupled consistently across all advertisers.

The preceding discussion can be made formal. Even if q(⋅, p; s) is monotone for each advertiser in the robust sense (so that each advertiser individually admits a stable implementation holding b_−i fixed), a universally stable sampler may fail to exist once bids are genuinely multi-dimensional.

The economic content of Proposition~ is that monotonicity is an condition, while universal stability is a condition. Monotonicity constrains how q changes along one coordinate direction at a time (hold b_−i fixed and vary b_i). By contrast, a universal sampler must simultaneously satisfy all such one-dimensional constraints on a shared seed space. When we look at a two-dimensional bid grid, the one-dimensional stability requirements live on the four edges of a rectangle, but the sampler must assign actions consistently at the four corners. The impossibility arises when the edge constraints imply contradictory corner assignments for some seeds.

A useful way to see the mechanics is to translate stability into a nesting requirement on subsets of seeds. Fix a deterministic sampler σ_univ and define, for each advertiser i, opponent bids b_−i, and own bid b_i, the ``favorable-seed set’’
W_i(b_i; b_−i) ≡ {r ∈ [0, 1]: σ_univ(b_i, b_−i, r) ∈ T_i⁺(s; b_−i)}.
Stability for advertiser i at fixed b_−i is exactly the statement that W_i(b_i; b_−i) is nested (weakly increasing) in b_i. Meanwhile, marginal correctness pins down its measure:
λ(W_i(b_i; b_−i)) = ∑_{a ∈ T_i⁺(s; b_−i)}q(b_i, b_−i, p; s)(a),
where λ(⋅) denotes Lebesgue measure. In one dimension, we can always choose nested sets of the appropriate measures. In multiple dimensions, we must choose these nested families for i and b_−i on the underlying probability space, and these families must be jointly realizable by a single action-valued function r ↦ σ_univ(b, r). Proposition~ constructs q so that these requirements are incompatible.

At an abstract level, the problem is an integrability issue. When n = 1, monotonicity gives us a well-defined direction of motion in Δ(A) as b₁ increases, and stable sampling is a coupling that turns this directional motion into a pathwise threshold. When n ≥ 2, monotonicity gives us directional motions—one for each coordinate. Universal stability asks that these motions be simultaneously representable by a single random seed that can be replayed across coordinatewise counterfactuals.

If the effect of increasing b₁ depends on the level of b₂ in a way that cannot be summarized by a single scalar index, then the induced flow of probability mass'' in the simplex has a kind ofcurl’’: moving from b to b^′ by increasing b₁ and then b₂ can reallocate mass across actions in a manner that is not equivalent (seed-by-seed) to increasing b₂ and then b₁. Distributionally, these two paths end at the same q(b^′) by definition; the issue is that stable sampling is stronger than matching endpoints—it demands a consistent coupling along each edge, and edge couplings need not commute.

In policy terms, this is exactly what we mean by . With multiple sponsors, a counterfactual explanation must specify not only the endpoint bid profile but also an implicit path (which bid is hypothetically changed, holding what fixed). Universal stability insists that the platform can answer such counterfactuals using the same replayable seed, without ever having to say ``the counterfactual depends on which bid you imagine changing first.’’ Proposition~ shows that, absent additional structure on q, such a demand is too strong: there are natural-looking monotone aggregators for which the only way to provide stable explanations to advertiser 1 is to choose a coupling that is incompatible with the stable explanations demanded by advertiser 2.

The takeaway from this section is that universal auditability cannot be obtained ``for free’’ by clever sampling alone. Stable sampling is always available (fix i and b_−i), but a single sampler that is stable for everyone is a substantive restriction on the aggregator q itself. Put differently, the impossibility result tells us that if we want audit-consistent counterfactuals in a multi-sponsor policy, we must constrain how q depends on bids; we cannot simply post-process a generic monotone q into a universally stable implementation.

This sets up the main characterization step. The next section identifies the exact additional structure that eliminates path dependence: bid dependence must collapse to a single scalar index (a potential), so that all coordinatewise changes in bids correspond to motion along one common one-dimensional trajectory in Δ(A). Under that condition, a universally stable sampler exists and becomes constructive; without it, Proposition~ tells us that some counterfactual questions cannot be answered in a way that is simultaneously stable, symmetric, and replayable across advertisers.

7. Characterization theorem: universal stability single-index (potential) aggregators; constructive sampler _{} and proof sketch; discussion of what is lost (expressiveness) and what is gained (auditability)

Proposition~ tells us that ``monotone marginals’’ are not enough: even if every advertiser can be given a stable, threshold-style explanation , those explanations may fail to coexist on a single replayable seed once bids are truly multi-dimensional. We therefore need an additional restriction on the of the aggregator q(⋅, p; s) itself. The right restriction is, in a precise sense, an condition: coordinatewise monotone effects must fit together as motion along a common one-dimensional trajectory in Δ(A).

We say that an aggregation rule is if bids enter only through a strictly coordinate-increasing scalar Φ(b). Formally, for each (p, s) we ask for a representation

where Φ : ℝ₊ⁿ → ℝ₊ satisfies ∂Φ/∂b_i > 0 for all i, and Q(⋅, p; s) : ℝ₊ → Δ(A) is a one-parameter family that is monotone in the robust sense (relative to each advertiser) as its scalar argument increases. Economically, Φ(b) plays the role of an that aggregates bids into a single ``knob’’ the platform turns, while Q describes how turning that knob shifts probability mass across actions.

The restriction in is strong but interpretable. It does say that sponsors have identical preferences p_i(s), nor that q must be linear in bids. It says only that the in the simplex induced by raising any one bid must be compatible with the direction induced by raising any other bid, in the sense that both are realizable as movement along the same curve λ ↦ Q(λ, p; s). When such a representation exists, the multi-dimensional bid space is effectively collapsed to one dimension for purposes of sampling and counterfactual replay.

We can now state the main result of this section. It formalizes the intuition that universal stability is exactly no path dependence,'' and that no path dependence is exactlythere exists a potential.’’

The ``if’’ direction is constructive and, from an auditing perspective, is the key operational message: once the platform commits to a potential Φ and a monotone path Q(⋅), the entire multi-sponsor system can be replayed with a single scalar λ = Φ(b).

Assume q admits the single-index representation . Fix any stable one-dimensional implementation σ_1D(λ, r) for Q(λ, p; s), meaning that for each seed r the realized action changes with λ by (at most) one threshold across the appropriate baseline cut (the one-dimensional analogue of Proposition~2). Define σ_univ by composition as in . Marginal correctness is immediate:
Pr_r [σ_univ(b, r) = a] = Pr_r [σ_1D(Φ(b), r) = a] = Q(Φ(b), p; s)(a) = q(b, p; s)(a).
For stability, fix an advertiser i and fix opponents b_−i. Because Φ is strictly increasing in b_i, the map b_i ↦ λ(b_i) ≡ Φ(b_i, b_−i) is strictly increasing. Thus, as we vary b_i holding b_−i fixed, we traverse the one-dimensional path λ ↦ Q(λ, p; s) in a monotone way. Stability of σ_1D in λ therefore transfers to stability of σ_univ in b_i: for almost every seed r there exists a critical λ_r at which the action flips across the relevant cut for advertiser i, and hence a critical bid
ϑ_i, r(b_−i) ≡ inf {b_i: Φ(b_i, b_−i) ≥ λ_r},
with the desired ``never recross’’ property. The crucial point is that all advertisers share the underlying randomness and the same effective coordinate λ; advertisers differ only in how their baseline partitions T_i^±(s; b_−i) interpret the realized action along that path.

The more subtle direction is that universal stability is demanding that it essentially forces the representation . We give a proof sketch that emphasizes the economic logic rather than measure-theoretic details.

Fix (s, p) and suppose a universally stable σ_univ exists for q(⋅, p; s). Consider any two advertisers i ≠ j and two-dimensional slices of the bid space where we vary only (b_i, b_j) and hold all other bids fixed at some b̄_−(i, j). Universal stability implies that for almost every seed r, the realized action a(b_i, b_j) ≡ σ_univ(b_i, b_j, b̄_−(i, j), r) must be consistent with one-dimensional threshold structures:

This is already a strong ``rectangle’’ constraint: it rules out the possibility that, for a given seed, moving from (0, 0) to (1, 1) by first raising b_i and then b_j yields a different realized action than raising b_j first and then b_i. In other words, universal stability requires a form of of counterfactual replays at the seed level.

To convert this intuition into a structural statement about q, one discretizes the bid space to a finite grid (e.g., b_i ∈ {0, 1, …, m}) and examines the implied realized-action constraints on each elementary rectangle. Stability on the edges implies a nesting structure of favorable-seed sets of the form
W_i(b_i; b_−i) = {r : σ_univ(b_i, b_−i, r) ∈ T_i⁺(s; b_−i)},
and marginal correctness pins down λ(W_i(⋅)) as a function of q. The key step is a discrete-cycle argument: if q cannot be written as Q(Φ(b)), then there exist two different monotone paths in bid space that induce patterns of probability-mass transfer across actions. On a grid, this manifests as a rectangle for which the edge-wise nesting constraints, when combined with the fixed corner marginals, force contradictory assignments of σ_univ at a corner for a positive-measure set of seeds. This is exactly the obstruction behind Proposition~, but now used contrapositive: to such contradictions on rectangle and for pair of coordinates, the bid dependence must be integrable, i.e., summarized by a scalar potential Φ.

A convenient way to interpret the potential is as a coordinate that makes all coordinatewise changes ``commute.’’ If universal stability holds, then increasing b_i and then b_j must produce the same seed-by-seed counterfactual outcome as increasing b_j and then b_i (because both paths end at the same bid profile and σ_univ must assign a single action there for each seed). This commutativity is precisely what a single-index representation guarantees: both paths correspond to increasing the same scalar λ = Φ(b), merely via different decompositions into coordinate moves.

Theorem~ makes a sharp tradeoff explicit. By insisting on universal auditability, we restrict the designer to aggregators whose bid dependence is effectively one-dimensional. This rules out many seemingly natural mechanisms in which different sponsors purchase influence along genuinely different margins.

Two examples illustrate what is lost. First, suppose one sponsor primarily competes for swapping into , while another sponsor primarily competes for swapping into . At the level of marginals such a design can be monotone for each sponsor, but it is inherently path-dependent: which action a fixed seed maps to at a high bid profile depends on which reallocation channel ``activates’’ first. Single-index rules forbid such distinct channels; all sponsors can only move the system along the same curve Q(λ).

Second, single-index rules limit the platform’s ability to implement multi-criteria policies where different sponsors are intended to affect different subsets of actions in different ways (e.g., one sponsor purchases tool calls but not citations, another purchases citations but not tool calls). Such separations generally require at least two degrees of freedom. Under universal stability, the platform can still encode these distinctions in the sponsors’ preferred distributions p_i(s), but responses must be aligned through Φ, which substantially reduces design flexibility.

In exchange, we obtain a very clean audit story. Under , the platform can log, for each step, the realized seed r and the scalar λ = Φ(b), and then replay counterfactuals for bidder by recomputing Φ(b_i^′, b_−i) and evaluating σ_1D(⋅, r). The output is automatically consistent across advertisers because all replays are anchored to the same (r, σ_1D).

Moreover, the pricing logic from the interim analysis becomes operationally simpler. In a single-index world, every step has a one-dimensional threshold structure in λ, so critical-bid payments can be expressed as critical- levels, converted back into advertiser-specific critical bids by inverting Φ(⋅, b_−i). This is exactly the form of pricing one would want in a compliance setting: the platform can disclose (i) the realized seed, (ii) the realized action, and (iii) the critical effective bid at which the action would have switched, all without invoking advertiser-specific couplings.

Although single-index aggregation is restrictive, it leaves two meaningful levers.

First, the choice of Φ determines how bids trade off. Common practical choices include separable and strictly increasing forms such as
$$ \Phi(b)=\sum_{i=1}^n w_i b_i,\qquad \Phi(b)=\sum_{i=1}^n w_i \log(1+b_i),\qquad \Phi(b)=\Big(\sum_{i=1}^n w_i b_i^\rho\Big)^{1/\rho}\ \ (\rho>0), $$
each of which induces a different notion of substitutability across sponsors and different marginal influence at high bids. Interpreting Φ as a policy object is natural: regulators and platforms can debate whether influence should have diminishing returns, whether some sponsors receive larger weights w_i, and what normalization is appropriate.

Second, the choice of the path Q(λ, p; s) determines how influence is translated into action probabilities. Here monotonicity in the robust sense is the constraint, but within that constraint Q can be tuned to the application: one can make early increases in λ have modest effects (a soft'' market) and later increases have sharper effects (ahard’’ market), or vice versa, and one can incorporate platform-side feasibility constraints by restricting which coordinates of q can move with λ.

Theorem~ formalizes a message we view as central for policy design: universal auditability is not merely a property of the sampling implementation; it is a property of how the mechanism . If we want a single randomness source to support stable counterfactual explanations for sponsors, then bids must collapse to a single effective dimension via a potential Φ. The next section shows how this abstraction maps cleanly onto concrete product settings—sponsored citations, sponsored tool calls, and sponsored UI actions—and how the model primitives correspond to implementation choices in each case.

8. Applications: (i) sponsored citations (choose among sources) (ii) sponsored tool calls (choose among tools/providers) (iii) sponsored UI actions; mapping each to the model primitives

We now translate the abstract objects—states s, action sets A, preferred distributions {p_i(s)}, an aggregator q(b, p; s), and stable sampling with critical-bid pricing—into three concrete product settings. The common thread is that, in each case, a deployed system already contains (i) a that chooses among discrete actions (which source to cite, which tool to call, which UI affordance to render), and (ii) a that is partly stochastic (because of exploration, uncertainty, or sampling). Our framework simply makes explicit how paid influence can enter that router while remaining auditable: bid changes should move probability mass in a directionally consistent way, and counterfactual replays should be well-defined under a single logged randomness source.

Throughout, we emphasize a pragmatic mapping: advertisers do not directly pick an action; instead, they submit a scalar bid b_i and provide (or induce) a distributional ``steering target’’ p_i(s) ∈ Δ(A). The platform then chooses q(b, p; s) and samples a ∼ q(b, p; s), logging the seed r (or an equivalent randomness record) to support after-the-fact replay. The key design choice in all three applications is the decomposition of influence into (a) a rule across sponsors, encoded by a potential Φ(b), and (b) a of the effective influence level into action probabilities along a one-dimensional monotone path λ ↦ Q(λ, p; s). This separation is not merely aesthetic: it is the operational form in which stable replay and critical-bid pricing are easiest to implement.

Consider a retrieval-augmented generation (RAG) pipeline that, at some intermediate step, must decide which sources to cite (or whether to cite at all). A natural discrete action set is
$$ A(s)=\{\textsc{No-Cite}\}\ \cup\ \{\textsc{Cite}(d): d\in\mathcal{D}(s)\}, $$
where 𝒟(s) is the set of candidate documents retrieved at state s (e.g., top-K passages, URLs, or database entries). The dependence on s matters: the action set is finite but varies with the dialogue prefix, the retrieval results, and any already-issued citations. In implementation we can treat A as ``finite per step,’’ which is exactly what the theory requires.

A sponsor i in this setting is a publisher, knowledge-base provider, or merchant with a corpus. Their preferred distribution p_i(s) is a distribution over the same citation actions. There are several reasonable constructions, all consistent with the primitives:

Importantly, p_i(s) is a declaration of truth; it is a target distribution over , conditional on the state. The platform may (and typically must) impose feasibility constraints so that sponsored influence cannot induce citing a source that fails factuality or licensing checks. In our notation this simply means that p_i(s) (and hence q) live on the policy-feasible subset of A(s).

A baseline citation router already produces some distribution, say q₀(s) ∈ Δ(A(s)), reflecting uncertainty about whether a citation is needed and which retrieved source best supports the answer. Sponsored influence corresponds to shifting this distribution. The robust monotonicity requirement has an intuitive reading here: raising b_i should move probability mass toward p_i(s), coordinate-by-coordinate. For example, if at b_i = 0 the system under-cites sponsor-owned documents relative to p_i(s), then increasing b_i can increase $\Pr[\textsc{Cite}(d)]$ for those d; if it already over-cites them, then increasing b_i should not push further in that direction. This is exactly the ``no overshoot’’ aspect of robust monotonicity, and it aligns well with product desiderata: a sponsor should not be able to buy pathological behavior that makes the system consistent with its own declared target.

Citations are often audited post hoc (by users, publishers, or regulators). This is a case where the stable-sampling interpretation is especially concrete: for a fixed seed r, there should be a well-defined critical effective bid at which the system would have switched from citing one set of sources to another. When the design is single-index, the platform can log (r, λ) with λ = Φ(b) at the citation step, and later answer counterfactual questions of the form ``Would the cited source have changed if sponsor i had bid b_i^′?’’ by recomputing λ^′ = Φ(b_i^′, b_−i) and replaying the one-dimensional sampler at (λ^′, r). This yields a coherent audit surface even when multiple sponsors are eligible to be cited in the same answer.

In sponsored citations, per-step pricing corresponds naturally to (or per-citation-opportunity) payments. Under critical-bid pricing, a sponsor pays only when the sampled action falls on the side of the baseline cut that is ``favorable’’ to them (in the sense of T_i⁺(s)), and the payment equals the critical bid (or critical potential level) at which this would have occurred under the realized seed. This is a second-price logic adapted to distributional influence: payment is tied to the marginal shift in the realized citation choice, not to a click.

Many agentic LLM systems route among tools: search APIs, calculators, code interpreters, travel booking engines, vector databases, or enterprise connectors. At a tool-routing step, a natural action set is
$$ A(s)=\{\textsc{No-Tool}\}\ \cup\ \{\textsc{Call}(t): t\in\mathcal{T}(s)\}, $$
where 𝒯(s) is the set of tools currently permitted by policy, available given authentication, and relevant given the user request and intermediate plan. Here s includes not only the user prefix but also tool availability, budgets, and any partial results already obtained.

In this setting, an ``advertiser’’ can be a tool provider (e.g., a particular travel API), a cloud vendor offering a specialized model, or an internal business unit promoting a proprietary connector. A natural p_i(s) assigns mass to calling provider i in contexts where that provider is applicable, and otherwise defers to $\textsc{No-Tool}$ or to neutral alternatives. Unlike citations, tool calls also affect because tool outputs change the dialogue state. Our baseline model assumes stage-separable values, but the mapping remains meaningful: the at each routing step is still a draw from a distribution, and sponsors value being chosen in that step because it generates downstream usage, data, or revenue.

Tool routing is typically constrained by safety: one cannot allow bidding to induce a tool call that is disallowed (e.g., financial transactions without confirmation, sensitive data access, or actions requiring elevated privileges). We implement this cleanly by defining 𝒯(s) to include only admissible tools and by forcing both p_i(s) and q(b, p; s) to have support only on admissible actions. In other words, sponsor influence is a reweighting a policy-defined feasible simplex. This separation is useful in practice: compliance teams can certify the feasibility layer independently of the auction layer.

Tool routing often needs determinism for debugging, yet it also benefits from stochasticity for exploration and load balancing. Stable sampling reconciles these: we can retain stochastic routing while guaranteeing that sponsor bids have a predictable directional effect for every fixed seed. This is valuable when tool providers dispute outcomes (e.g., ``we increased our bid but saw fewer calls’’): under robust monotonicity, such disputes can be resolved by replaying the logged seed across bid levels and showing the single threshold at which the call would have flipped.

Payments can be charged per realized tool call (CPA-like) or per routing opportunity (impression-like). Our per-step payment ζ_i(⋅) corresponds to the former when the realized action is $\textsc{Call}(t=i)$, and to the latter when the realized action lies in the sponsor-favorable region T_i⁺(s), which may include (i) but could also include a broader set (e.g., ``any call to a family of endpoints owned by i’’). A practically appealing variant is to treat A(s) as tool-provider choices , so that sponsorship affects which provider is selected while the call/no-call decision remains governed by a separate (non-sponsored) policy. This corresponds to factoring the router into two stages and applying the mechanism only on the provider-selection stage, reducing incentives to induce unnecessary tool usage.

Many assistants present UI affordances: suggested follow-ups, book'' buttons,compare’’ panels, or quick actions like open calendar'' orsummarize.’’ At a UI decision point, define
$$ A(s)=\{\textsc{Show}(u): u\in\mathcal{U}(s)\}\ \cup\ \{\textsc{Show-Nothing}\}, $$
where 𝒰(s) is a finite set of eligible UI modules given the user query, locale, device, and policy constraints. Here the action is not a token but a UI-rendering choice, which is still naturally discrete and therefore fits our sampling-and-payments formulation.

Advertisers may be merchants (promoting a buy'' module), service providers (promoting scheduling), or content partners (promoting aread more’’ card). Their preference p_i(s) can be thought of as a distribution over which module is shown. Unlike citations and tools, UI actions are often optimized for engagement and conversion, and a platform may wish to encode additional user-centric objectives. Our framework accommodates this by allowing p_i(s) to be (depending on predicted user intent) and by interpreting q(b, p; s) as the outcome of a constrained optimization that trades off sponsor influence with platform/user constraints. The robust order remains the discipline: whatever the platform decides the sponsor ``prefers’’ at state s, higher bids should move the displayed-module distribution in that direction without reversals.

UI sponsorship is particularly sensitive because it is visible. Stable sampling provides a crisp explanation interface: the platform can say, in effect, Given the logged randomness, this module was shown; if sponsor $i$ had bid less, the module would have switched at bid $\vartheta$; otherwise it would not have changed.'' This is a more precise statement than aggregate reports likehigher bids increase impressions,’’ because it supports counterfactual replay at the exact user interaction. At the same time, we should acknowledge a limitation: even with stable sampling, the explanation is only as meaningful as the definition of A(s) and the feasibility filters. If the platform manipulates 𝒰(s) (e.g., by selectively declaring certain modules ``ineligible’’), then bid monotonicity within Δ(A(s)) does not prevent abuse. In other words, our mechanism governs bids affect choice among eligible actions, not actions are deemed eligible.

UI actions invite standard pricing models: CPM (per rendering), CPC (per click), or CPA (per downstream conversion). Our per-step payment is naturally CPM-like (payment when the sponsored module is shown), but the mechanism can be embedded inside broader billing systems. For example, the platform can use critical-bid pricing to compute an for rendering a module, and then apply post hoc refunds or multipliers based on user interactions. The key point is modularity: the auction determines an auditable stochastic allocation among UI modules, and separate analytics can map allocations to business outcomes.

Across all three applications, the potential Φ(b) determines how different sponsors substitute for one another. A separable linear form Φ(b) = ∑_iw_ib_i is easy to communicate and invert for critical-bid computations, while concave variants (e.g., ∑_iw_ilog (1 + b_i)) can impose diminishing returns and mitigate concentration. The path Q(λ, p; s) determines how the effective influence level translates into probability shifts. A practical template is to start from a baseline router q₀(s) and define Q as a monotone interpolation between q₀(s) and an aggregate of the p_i(s) (or a policy-filtered combination thereof), with safeguards that prevent any coordinate from crossing its target in the wrong direction. The important operational requirement is not a specific functional form but verifiable monotonicity along λ.

A subtle risk is that p_i(s) may itself be learned, sponsor-conditioned, or otherwise manipulable. If sponsors can influence how their own p_i(s) is computed (e.g., by curating training data or crafting prompts), they may effectively reshape the direction in which monotonicity moves the system. This is not a failure of the mechanism; it is a governance issue about how preferred distributions are defined and audited. In product terms, one should treat the pipeline producing p_i(s) as part of the regulated surface: it should be logged, testable, and subject to constraints (e.g., p_i(s) must remain within a relevance-filtered support set).

These applications share an empirical question that the theory alone cannot resolve: how much outcome quality is lost (or gained) when we restrict to monotone, and especially single-index, aggregators to obtain stable counterfactuals? The next section therefore outlines a simulation plan in an agentic LLM prototype—with retrieval, tool routing, and UI decisions—to compare monotone versus non-monotone designs, quantify counterfactual consistency failures, and measure downstream task performance under realistic constraints.

9. Simulation plan: agentic LLM prototype (retrieval + tool router), implement monotone vs non-monotone aggregators, evaluate counterfactual consistency and outcome quality; note where numerical methods are required

We now outline a simulation plan that is deliberately ``product-shaped’’: we build a small agentic LLM prototype in which the mechanism intervenes only at discrete routing points (retrieval/citation choice and tool-provider choice), and we compare (i) monotone, single-index aggregators that admit universal stable sampling to (ii) plausible non-monotone (and/or non-single-index) alternatives that can improve short-run objective fit but may fail counterfactual consistency. The goal is not to claim that any particular aggregator is correct deployed choice, but to make the tradeoff measurable: how much quality (task success, factuality, latency) is gained or lost when we impose the structure needed for auditable replay and Myerson-style critical-bid pricing.

We implement a two-router agent that alternates between (a) retrieval/citation decisions and (b) tool-routing decisions, embedded in an LLM loop. Each episode begins with a user query and proceeds for up to T steps (with early stopping). At each step k, the system is in a state s_k that includes the dialogue prefix, any retrieved candidates, and tool availability/outputs.

At a designated citation step, the router chooses an action from a finite set
$$ A_{\mathrm{cite}}(s)=\{\textsc{No-Cite}\}\cup\{\textsc{Cite}(d): d\in\mathcal{D}(s)\}, $$
where 𝒟(s) are the top-K retrieved passages (we fix K in the prototype). The realized action influences the next state through the prompt (e.g., selected passages are injected into context, or a citation tag is appended to the draft answer).

At a tool step, the router chooses among
$$ A_{\mathrm{tool}}(s)=\{\textsc{No-Tool}\}\cup\{\textsc{Call}(t): t\in\mathcal{T}(s)\}, $$
where 𝒯(s) includes only policy-admissible tools (authentication present, safe to call, relevant). The tool output updates s_k + 1.

We instantiate n advertisers. Some correspond to document providers (citation setting), others to tool providers (tool setting); for simplicity we allow a single advertiser to be eligible in both contexts, but we treat each step independently through p_i(s_k) ∈ Δ(A(s_k)). We construct p_i(s) in one of two ways:

We simulate private values θ_i and set bids either truthfully (b_i = θ_i) or under controlled deviations (shading, overbidding) to probe incentive sensitivity. In the baseline simulation, bids are fixed per episode (matching the stage-separable envelope logic), though we also log results for per-step bids as a stress test.

We compare multiple aggregation rules q(b, p; s), all producing a distribution over the current step’s action set. The comparison is structured around two axes: robust monotonicity (does b_i↑ move q toward p_i coordinatewise without overshoot?) and single-index structure (does q depend on b only through λ = Φ(b), enabling universal stable replay?).

We select a strictly increasing potential, e.g.
$$ \Phi(b)=\sum_{i=1}^n w_i b_i\qquad \text{or}\qquad \Phi(b)=\sum_{i=1}^n w_i \log(1+b_i), $$
and define a one-parameter family Q(λ, p; s) that moves along a monotone path. A practical construction begins from a baseline router q₀(s) and defines an aggregate target'' $\bar p(s)$ (for example, a policy-filtered mixture of the $p_i(s)$ with weights proportional to $w_i b_i$ or to exogenous relevance scores). We then interpolate between $q_0(s)$ and $\bar p(s)$ in a way that enforces coordinatewise no-overshoot relative to each advertiser's robust order when varying their bid through $\lambda$. Concretely, we implement $Q$ by solving a constrained program at each $\lambda$: \[ Q(\lambda,p;s)\in\arg\min_{q\in\Delta(A(s))}\ \Big\{ D(q,q_0(s))+\lambda \cdot L(q,p;s)\Big\} \] subject to feasibility constraints (support restrictions, safety filters) and additional monotonicity constraints along a discretized grid of $\lambda$ values. Here $D$ is a convex divergence (e.g., KL to preserve support, or squared $\ell_2$), and $L$ is a convex proxy fordistance from sponsor targets’’ (e.g., a weighted sum of ℓ₁ distances to selected p_i’s, or distance to p̄). The key point is not the precise objective but that we explicitly enforce the one-dimensional monotone trajectory, so that the resulting family is compatible with universal stable sampling.

To isolate the role of single-index structure, we also implement an aggregator that is monotone for each advertiser holding b_−i fixed, but depends on the full vector b in a way that can generate path dependence. For example, we can let q be the solution to a convex program with advertiser-specific penalty weights that depend on b coordinatewise rather than through Φ(b). This class should still admit stable sampling (and hence critical-bid pricing in the per-advertiser counterfactual sense), but it need not admit a single universal seed interpretation across all advertisers simultaneously.

Finally, we implement common product heuristics that are attractive operationally but do not guarantee robust monotonicity. Examples include:

These baselines are included because they approximate what one might implement absent the theoretical discipline, and they provide a meaningful empirical counterpoint: if they outperform on short-run metrics, we can quantify the auditability cost.

For each mechanism variant we implement a sampler σ(b, p; s, r) with explicit logging. For the single-index monotone family (M1), we log the seed r and the realized λ = Φ(b) at each routing step, and we replay counterfactuals by recomputing λ^′ = Φ(b^′) and resampling via the same one-dimensional stable sampler.

Given a monotone family λ ↦ Q(λ, p; s), we implement stable sampling by coupling distributions across λ on a shared randomness source. In finite A(s) this can be done by constructing a monotone transport of probability mass from the over'' side to theunder’’ side along λ, ensuring that for each fixed r the realized action changes at most once (the single-threshold property). Operationally, we discretize λ on a grid, compute cumulative masses on the relevant partition, and use a measure-preserving map from r ∈ [0, 1] to actions that respects the nested sets implied by monotonicity.

Given stable sampling, we compute the per-step critical bid ϑ_r (or critical λ) at which the sampled action would flip into the sponsor-favorable region for the realized seed. We then implement the second-price-style payment:
$$ \zeta_i(b_i,p;s,a)= \begin{cases} 0, & a\in T_i^-(s),\\ \vartheta_r, & a\in T_i^+(s). \end{cases} $$
In the simulation we compute ϑ_r by searching over λ (or b_i holding b_−i fixed) on the discretized grid; we record both the realized payment and the envelope-implied expected payment estimate for diagnostic consistency.

For (M2), we can still compute per-advertiser critical bids by replaying counterfactuals holding b_−i fixed, but we do expect a single replay interface to be coherent across advertisers simultaneously. For (N1), we compute ``shadow prices’’ only as a diagnostic (e.g., minimal bid increase needed to flip the realized action under brute-force replay), but we do not interpret them as truthful-auction payments.

We evaluate three categories of metrics, reflecting the central claims of the model.

For each advertiser i, we fix sampled (s, p, b_−i) and sweep b_i over a grid. We check the coordinatewise conditions from Proposition~1 empirically:

We summarize violations by their frequency and magnitude (maximum monotonicity violation per coordinate; total variation of the violating mass), and we stratify by action-set size |A(s)| to test the comparative static that richer action sets are harder.

We measure stability at two levels.

We report a : the fraction of (s, p, r, b, b^′) instances for which endpoint actions disagree across paths. We also compute an ``audit surface’’ score: the fraction of instances for which the platform can return a well-defined critical bid explanation for each advertiser that is consistent with a shared seed.

We evaluate downstream performance using standard task metrics and mechanism-relevant operational metrics:

To connect back to the model’s primitives, we also report the realized and expected ``influence’’ purchased, proxied by ∥q − p_i∥₁ reductions as bids rise, and we verify numerically that expected payments align with the envelope formula when stable sampling conditions hold.

Although the theory is discrete and finite, implementing it at scale requires careful numerical choices. We highlight three points where computation, not just logic, matters.

Even in the single-index case, we must map λ to a distribution over a state-dependent action set under feasibility constraints. If we define Q via a convex program, we need a fast solver (or closed form) per step. In the prototype we can afford a small interior-point or projected-gradient solver because |A(s)| is modest (e.g., K + 1), but the exercise clarifies what would need engineering for deployment. When monotonicity in λ is enforced across a grid, the computation resembles isotonic regression with simplex constraints; we can precompute Q(λ_ℓ, p; s) for a finite set of λ_ℓ and interpolate.

Given a discretized path {Q(λ_ℓ)}, we need a coupling across ℓ that is (approximately) stable. In the prototype, we construct the coupling explicitly by tracking how mass moves between actions as λ increases and then allocating subintervals of r ∈ [0, 1] accordingly. This is algorithmically straightforward but delicate: numerical noise can create apparent reversals. We therefore include a ``monotonicity repair’’ step (projecting the sequence onto the set of monotone sequences coordinatewise, subject to simplex constraints) so that the sampler’s stability failures reflect the aggregator choice, not floating-point error.

Critical bid computation is a root-finding or search problem: for each realized seed r and step we need the smallest λ (or b_i) at which the sampled action enters T_i⁺(s). With a discrete λ grid we can compute this by binary search over indices. For non-single-index mechanisms (M2) we need to repeat this per advertiser holding b_−i fixed, which is more expensive but still feasible in an offline evaluation setting.

At the end of the simulation we will have, for each mechanism variant, a joint picture of (i) outcome quality and operational cost, (ii) stability and auditability of counterfactual replay, and (iii) the extent to which observed payments and influence align with the envelope logic. This is the empirical complement to our theoretical claims: even if non-monotone designs can sometimes improve short-run performance, they do so by sacrificing coherent counterfactuals, which is precisely the surface that regulators, auditors, and sophisticated counterparties are likely to demand.

10. Extensions and limitations: cross-step externalities, budgets/pacing, constraints (safety/factuality), strategic model manipulation; open questions

Our simulation plan isolates a deliberately clean baseline: stage-separable values, a finite routing action set at each step, and sponsor influence that enters only through the aggregator q(b, p; s) and its sampling implementation. This is the right starting point for making the auditability–performance tradeoff measurable, but it is not the end of the story. In deployed agentic systems, influence today can change the state tomorrow (sometimes in ways that are economically first-order), advertisers often face intertemporal budget and pacing constraints, and the platform must respect hard constraints (safety, factuality, policy) that interact with monotonicity in nontrivial ways. Moreover, because p_i(s) is produced by models (sometimes advertiser-conditioned), it is itself a potential object of strategic manipulation. We sketch how these issues fit into the structured context and where the theory is currently incomplete.

The stage-separable assumption in the baseline—that advertiser utility aggregates as ∑_kθ_iU(p_i(s_k), q_k) − z_i, k with a bid that is fixed per episode—is not merely a mathematical convenience. It ensures that a scalar bid is a single-parameter type and that the envelope logic applies stepwise. However, in an agent loop, a routing decision at step k can change the future state s_k + 1 in a way that affects both future feasible action sets A(s_k + 1) and future advertiser targets p_i(s_k + 1). This creates an intertemporal externality: buying influence at step k may open'' future opportunities (e.g., a tool call that returns structured data enabling more citations), or it maycrowd out’’ future influence (e.g., an early citation that resolves the question and triggers stopping).

A minimal way to represent this is to keep the same per-step preference index but allow θ_i to weight a trajectory functional:
$$ U_i(b)=\mathbb{E}\Bigg[\sum_{k=1}^T \theta_i\,U\big(p_i(s_k),q(b,p;s_k)\big) - \zeta_i(b,p;s_k,a_k)\Bigg], $$
where the expectation is over the endogenous trajectory of states induced by the sampled actions. The difficulty is that even if q(⋅, p; s) is monotone at every fixed state, the of state transitions with sampling can break the one-dimensional ``single-threshold’’ interpretation that supports stable critical bids. Intuitively, when raising b_i changes the distribution over future states, it can make some future actions more or less available, so the marginal effect of a bid increase need not be a monotone movement toward p_i(s_k) at each realized step.

There are two conceptually different regimes in which we can recover tractability.

If the routing action at step k has limited effect on future feasible sets (e.g., retrieval only affects which passages are shown but does not change tool availability, and stopping is not sponsor-influenced), then state evolution is approximately exogenous with respect to b. In that regime, stage-separable analysis is a reasonable approximation, and the per-step monotonicity/stability properties remain informative. Empirically, this corresponds to environments where the agent’s core reasoning and stopping policy are held fixed, and only narrow ``side decisions’’ are monetized.

A more ambitious path is to impose that the platform’s entire policy depends on bids only through a scalar potential λ = Φ(b), and that both routing stopping decisions are parameterized by λ in a monotone way. Formally, one would aim for a Markov policy π_λ such that at each state s the induced action distribution is Q(λ, p; s) and transitions satisfy a monotonicity property in λ under an appropriate coupling. This resembles monotone comparative statics in controlled Markov processes: a single-index sufficient statistic can deliver path independence, but the required structure is strong. It effectively says that sponsor influence is mediated through one global ``intensity’’ knob rather than through multiple independent levers across time.

The limitation is clear: many product designs intentionally allow sponsors to target influence to particular phases (e.g., early retrieval versus late tool selection). Our framework can still evaluate such designs, but we should not expect universal stability or single critical bids to survive without additional restrictions.

Real ad systems rarely implement a pure per-impression second-price rule without budgets. Sponsors have episode budgets, daily budgets, and pacing objectives, and platforms impose caps for user experience. In our setting, budgets enter naturally as constraints on cumulative payment or cumulative ``influence’’ across steps. A simple budget constraint for advertiser i is
$$ \sum_{k=1}^T \zeta_i(b,p;s_k,a_k) \le B_i, $$
with B_i fixed for the episode (or a remaining budget state variable updated over time). Introducing this constraint breaks the single-parameter environment even if bids are scalar, because the effective allocation at step k now depends on past allocations through remaining budget.

A standard approach in auction theory is to replace the hard constraint with a Lagrangian multiplier (a ``pacing multiplier’’) α_i ∈ [0, 1] and to run the unconstrained mechanism on scaled bids b̃_i = α_ib_i. This suggests a direct extension of our single-index condition: if the platform computes
λ = Φ(b̃) = Φ(α₁b₁, …, α_nb_n)
with α evolving predictably from budgets, then universal stability can remain feasible conditional on the (logged) α sequence. Practically, this means audits must treat pacing multipliers as part of the mechanism state: a counterfactual replay that changes b_i but holds α_i fixed is not the same as one that recomputes α_i under the counterfactual. The latter is the economically relevant counterfactual but is inherently dynamic.

This highlights a broader point: budgets turn per-step critical bids into that depend on remaining budget. Even in classical sponsored search, budget-feasible truthful mechanisms are delicate. In agentic routing, the issue is sharper because the number of monetizable steps T is endogenous (stopping), and the mapping from bids to spend is highly state-dependent. For our prototype, we view budgets as a stress test: they are important to simulate for realism, but they are also a principled reason to expect deviations from DSIC, and hence a reason to focus on auditability and monotone influence rather than full-blown truthfulness claims.

The simulation plan already includes feasibility filtering (policy-admissible tools, safe actions, etc.). From a mechanism perspective, such constraints are not peripheral: they define the feasible set A(s) and therefore shape what monotonicity can mean. Two subtleties matter.

Even if a designer specifies a monotone family in an unconstrained simplex, projecting onto a constrained feasible polytope can create kinks that violate coordinatewise monotonicity relative to some p_i(s). For example, increasing sponsor weight may try to move probability toward a sponsor-owned tool, but if that tool becomes unsafe given the current user context, the probability must be reallocated elsewhere, possibly away from p_i(s) in some coordinates. In such cases, robust monotonicity is not a property of the aggregator alone; it is a joint property of the aggregator and the constraint correspondence s ↦ A(s).

A disciplined way to handle this is to build constraints into the of the advertiser’s preferred distribution: interpret p_i(s) as already conditioned on feasibility, i.e.,
p_i(s) ∈ Δ(A(s)) and p_i(a) = 0 for infeasible a.
This helps, but does not eliminate the issue, because feasibility can change with the state in a way that depends on earlier sponsored actions (a dynamic externality), and because safety constraints may be nonconvex or discontinuous.

In a routing application, safety and factuality are not optional; they are effectively hard constraints imposed by regulators and by the platform’s own risk tolerance. Economically, this means the platform is not a neutral auctioneer maximizing revenue; it is optimizing a constrained objective where sponsor influence is limited by user harm bounds. Our framework is compatible with this view—indeed, it motivates why monotone influence is valuable (it makes limits enforceable and auditable). But it also implies a limitation: when constraints bind, the marginal ``allocation’’ purchased by a sponsor can be zero over regions of the bid space, producing flat segments in the mapping b_i ↦ q. Stable sampling can still exist (flat segments are benign), but payment interpretations become more subtle, because a sponsor may pay for influence only when the constraint slackens, and the critical bid can jump at constraint boundaries.

A distinctive feature of the agentic LLM setting is that the objects that define preferences and feasibility are model outputs. This raises manipulation channels that do not exist in classical ad auctions.

In the model-based sponsor'' setting, $p_i(s)$ is generated by a learned module. If advertisers can influence that module (directly through training data, indirectly through content, or via prompt injection), then $p_i(s)$ is no longer common knowledge and may not reflect genuinepreferences’’ in a stable way. A sponsor could attempt to craft inputs so that p_i(s) places mass on actions that are easy to obtain at low bids (gaming the payment rule), or so that T_i⁺(s) is large (expanding the set of actions for which payments are triggered).

This suggests that p_i(s) should be treated as part of the mechanism’s . In particular, to preserve any incentive interpretation, the platform must commit to (i) how p_i(s) is computed, (ii) what information it conditions on, and (iii) how it is audited. One practical mitigation is to restrict p_i(s) to a class of sponsor claims that can be validated (e.g., a sponsor declares an inventory set of eligible documents/tools, and p_i(s) is the platform’s normalized relevance score within that set). Another is to separate the advertiser-conditioned component (which defines eligibility) from a platform-controlled relevance model (which defines ranking), limiting the sponsor’s ability to reshape the entire distribution.

Because the state includes retrieved text and tool outputs, sponsors may manipulate future states by influencing what is retrieved or by returning strategically crafted tool results. This is not merely a ``fraud’’ issue; it affects the economic counterfactuals. If the sponsor can change s_k + 1 in response to being selected at step k, then the mapping from bids to outcomes is no longer determined by the platform’s sampler alone. Stable sampling can still define a counterfactual holding tool outputs fixed, but that counterfactual may not match what would have happened had the sponsor not been selected.

For audits, this motivates logging not only r and the selected action, but also the realized exogenous artifacts (documents retrieved, tool outputs). Counterfactual replay then requires a convention: do we hold these artifacts fixed (a frozen world'' replay) or resimulate them (astructural’’ replay)? The former is easier and more transparent; the latter is closer to economic reality but requires stronger modeling assumptions. Our view is that, in early deployments, frozen-world replay is a defensible standard: it answers whether the platform’s was bid-sensitive in a coherent way, even if downstream consequences involve additional strategic behavior.

We conclude with several open questions that we expect to matter for both mechanism design and practical deployment.

In realistic systems, Q(λ, p; s) will be computed approximately (finite grids, solver tolerances, learned approximators). This raises the need for quantitative notions: if coordinatewise monotonicity is violated by at most ε, how large can counterfactual inconsistencies become, and how sensitive are envelope-based payments? A robust ``approximate DSIC’’ theory tailored to finite action routing would be valuable, especially if it yields engineering tolerances (e.g., acceptable projection error in monotonicity repair).

Theorem-level structure suggests that universal stability requires q(b, p; s) = Q(Φ(b), p; s). In practice, one may want to learn Φ (a potential) and Q jointly from logged interaction data under constraints. This resembles representation learning with monotonicity constraints. The open challenge is identifiability and governance: when multiple (Φ, Q) pairs fit observed behavior, which are acceptable from an audit standpoint, and what additional commitments are needed to prevent ``moving the goalposts’’ over time?

Our core statements rely on finiteness of A(s). Tool routing and citation routing are finite in the prototype, but richer agentic actions (e.g., selecting a prompt template, choosing a plan, generating a natural-language intermediate) are combinatorial or continuous. Extending robust monotone order and stable sampling to structured actions may require new mathematical objects (e.g., monotone couplings over trees or sequences) and may weaken the ``single flip’’ intuition.

Finally, sponsor influence is only one stakeholder objective. A complete economic model would include the user as an affected party with welfare constraints or explicit utility. The platform’s optimization problem then becomes a constrained social choice problem with transfers. Understanding which monotonicity structures remain compatible with user welfare guarantees—and how to audit them—is an open policy-relevant direction.

Taken together, these limitations do not undermine the value of the baseline discipline. Rather, they clarify what the single-index, monotone, stably-sampled design buys us: a coherent counterfactual language and a payment logic that can be explained and audited. The extensions above indicate where that language must be refined (dynamic states, budgets, manipulable models) and where new theory is needed to preserve the same level of clarity in more realistic environments.

Appendices: proofs, additional counterexamples, and implementation details

This appendix records the proof ideas that are used repeatedly in the main text, together with concrete counterexamples and a practical ``how-to’’ for implementing stable sampling and critical-bid pricing in a finite routing action space. Our goal is not to optimize constants or generality, but to make the logical dependencies explicit and to isolate which steps use finiteness of A, which use robust (coordinatewise) preferences, and which use the single-index requirement.

Fix s and an advertiser i, suppress s in notation, and write p = p_i ∈ Δ(A). Recall the robust order: q≽_iq^′ iff for every a ∈ A we have
|q(a) − p(a)| ≤ |q^′(a) − p(a)| and (q(a) − p(a))(q^′(a) − p(a)) ≥ 0,
i.e., each coordinate moves toward p(a) without crossing it. Let q(b_i) = q(b_i, b_−i, p; s) with b_−i and all other primitives held fixed.

The ``only if’’ direction is a direct coordinatewise unwrapping of the definition. Define the baseline deviation signs at b_i = 0:
T_i⁺ := {a ∈ A: q(0)(a) ≤ p(a)}, T_i⁻ := {a ∈ A: q(0)(a) > p(a)}.
Now impose monotonicity in the robust order: for b_i ≥ b_i^′ we require q(b_i)≽_iq(b_i^′). Taking b_i^′ = 0 and applying the sign condition coordinatewise yields: if a ∈ T_i⁺, then q(b_i)(a) − p(a) must have the same sign as q(0)(a) − p(a) ≤ 0, hence q(b_i)(a) ≤ p(a) for all b_i, and the absolute deviation inequality forces q(b_i)(a) to be weakly increasing in b_i (moving up toward p(a)). Similarly, if a ∈ T_i⁻, then q(b_i)(a) ≥ p(a) for all b_i and q(b_i)(a) is weakly decreasing in b_i (moving down toward p(a)). This is precisely the coordinatewise characterization in the proposition.

For the ``if’’ direction, assume the coordinatewise conditions hold for all a ∈ A. Take any b_i ≥ b_i^′. If a ∈ T_i⁺, then q(b_i)(a) ≥ q(b_i^′)(a) and both are at most p(a), so |q(b_i)(a) − p(a)| ≤ |q(b_i^′)(a) − p(a)| and the sign condition holds. If a ∈ T_i⁻, then q(b_i)(a) ≤ q(b_i^′)(a) and both are at least p(a), giving the same conclusion. Since the robust order is defined coordinatewise and we have verified each coordinate, we obtain q(b_i)≽_iq(b_i^′).

Two remarks help interpret the result. First, this order is intentionally strong: it rules out any ``overshoot and compensate’’ behavior across actions, which is exactly what makes stable sampling feasible. Second, the partition (T_i⁺, T_i⁻) is endogenous to the baseline q(0) rather than to p alone; economically, monotonicity is defined relative to the status quo at zero bid.

We fix i, s, p, b_−i and again consider the one-dimensional family q(b_i). Under Proposition~1, probability mass moves only from T_i⁻ to T_i⁺ as b_i increases. Let
M⁺(b_i) := ∑_{a ∈ T_i⁺}q(b_i)(a), M⁻(b_i) := ∑_{a ∈ T_i⁻}q(b_i)(a) = 1 − M⁺(b_i).
Then M⁺(b_i) is nondecreasing, and moreover the within-set conditional distributions
$$ q^+_{b_i}(a):=\frac{q(b_i)(a)}{M^+(b_i)}\ \ (a\in T_i^+),\qquad q^-_{b_i}(a):=\frac{q(b_i)(a)}{M^-(b_i)}\ \ (a\in T_i^-) $$
are well-defined whenever the denominators are positive. The stable coupling we use separates the between-set'' movement (mass transfer across $T_i^\pm$) from thewithin-set’’ reshuffling (which can be coupled in a measure-preserving way without inducing multiple flips).

A convenient construction proceeds by sampling two independent uniforms from a single seed r ∈ [0, 1] via a deterministic hash (for auditability), say u₁ = h₁(r) and u₂ = h₂(r). We interpret u₁ as choosing the set and u₂ as choosing the action within the set:
$$ \sigma_i(b_i,r)= \begin{cases} \text{the $u_2$-quantile of }q^+_{b_i} & \text{if } u_1\le M^+(b_i),\\ \text{the $u_2$-quantile of }q^-_{b_i} & \text{if } u_1> M^+(b_i). \end{cases} $$
Because M⁺(b_i) is monotone, for each fixed r there is a (possibly infinite) critical bid
ϑ_r := inf {b_i: u₁ ≤ M⁺(b_i)}
at which the realized action switches from the minus'' side to theplus’’ side. Importantly, once u₁ ≤ M⁺(b_i) holds it continues to hold for all larger b_i, so the set membership flips at most once. Within each set, the u₂-quantile mapping can be chosen consistently across b_i to avoid additional instabilities; finiteness of A makes this straightforward by ordering actions and using cumulative sums. This yields the desired stable sampler σ_i.

The main limitation of this construction is conceptual rather than technical: stability is only guaranteed because the partition (T_i⁺, T_i⁻) depends on i. This is precisely why the universal sampler problem becomes nontrivial when n ≥ 2.

Under the stable sampler above, each seed r induces a threshold ϑ_r at which the realized action transitions from T_i⁻ to T_i⁺. Define the realized payment
$$ \zeta_i(b_i,a)= \begin{cases} 0 & \text{if } a\in T_i^-,\\ \vartheta_r & \text{if } a\in T_i^+, \end{cases} $$
where r is the randomness used to sample a = σ_i(b_i, r). The economic interpretation matches a second-price rule: the bidder pays the minimum bid needed (given r and b_−i) to obtain an outcome on the ``favorable’’ side.

To obtain an implementation-independent expected payment, we use a Myerson-style argument with the relevant ``allocation’’ replaced by the improvement in robust distance. Under the canonical cardinalization $U(p,q)=-\tfrac12\|q-p\|_1$, note that Proposition~1 implies
∥q(b_i) − p∥₁ = ∑_{a ∈ T_i⁺}(p(a) − q(b_i)(a)) + ∑_{a ∈ T_i⁻}(q(b_i)(a) − p(a)),
and each term is monotone in b_i with the appropriate sign. Differentiating along a monotone path (or using finite differences and absolute continuity in b_i) shows that the marginal change in $-\tfrac12\|q(b_i)-p\|_1$ equals the mass that crosses the T_i⁻/T_i⁺ boundary in the stable coupling. In the stable sampler, this boundary crossing is exactly the indicator 1{u₁ ≤ M⁺(b_i)}, whose expectation is M⁺(b_i). Integrating the marginal condition and setting the boundary condition z_i(0) = 0 yields the envelope formula stated in the proposition:
$$ z_i(b_i)=\frac12\int_0^{b_i}\Big(\|q(b_i)-p\|_1-\|q(b')-p\|_1\Big)\,db'. $$
Because the right-hand side depends only on the path b^′ ↦ q(b^′) and not on the particular coupling used to sample a, the expected payment is implementation-independent within the class of stable samplers.

We exhibit a minimal path-dependence obstruction on a discrete bid grid. Let n = 2, A = {x, y, z}, and consider bids b_i ∈ {0, 1}. We specify an aggregation rule by giving q(b₁, b₂) at the four corners:
$$ q(0,0)=\Big(\tfrac12,\tfrac12,0\Big),\quad q(1,0)=\Big(0,\tfrac12,\tfrac12\Big),\quad q(0,1)=\Big(\tfrac12,0,\tfrac12\Big),\quad q(1,1)=\Big(\tfrac12,0,\tfrac12\Big), $$
written in the order (x, y, z). Choose preferred distributions so that advertiser~1 prefers shifting mass from x to z while holding y fixed, and advertiser~2 prefers shifting mass from y to z while holding x fixed. One can make this formal by setting $p_1=(0,\tfrac12,\tfrac12)$ and $p_2=(\tfrac12,0,\tfrac12)$ and checking Proposition~1 along each bidder’s coordinate holding the other fixed: increasing b₁ moves (x, y, z) from $(\tfrac12,\tfrac12,0)$ to $(0,\tfrac12,\tfrac12)$ (toward p₁), and increasing b₂ moves (x, y, z) from $(\tfrac12,\tfrac12,0)$ to $(\tfrac12,0,\tfrac12)$ (toward p₂). Thus q is monotone for each bidder along their own dimension.

Suppose a universal stable sampler σ_univ(b₁, b₂, r) existed. Fix a seed r such that at (0, 0) the sampler outputs y (this occurs with probability $\tfrac12$ because $q(0,0)(y)=\tfrac12$). Stability for advertiser~1 along the path (0, 0) → (1, 0) requires that, as b₁ increases, the action can flip at most once from the oversampled'' side to theundersampled’’ side relative to p₁, which forces σ_univ(1, 0, r) = y or z; in fact, because q(1, 0) assigns zero mass to x, consistency forces σ_univ(1, 0, r) ∈ {y, z}. Similarly, stability for advertiser~2 along the path (0, 0) → (0, 1) forces σ_univ(0, 1, r) ∈ {x, z}.

Now consider reaching (1, 1) via two different coordinatewise paths. Along (0, 0) → (1, 0) → (1, 1), advertiser~2’s stability at the second step constrains σ_univ(1, 1, r) given σ_univ(1, 0, r). Along (0, 0) → (0, 1) → (1, 1), advertiser~1’s stability constrains σ_univ(1, 1, r) given σ_univ(0, 1, r). With the corner specifications above, these constraints can be made contradictory for a positive-measure set of r (intuitively, one path forces the sampler to output y at (1, 1) while the other forces it to output x), even though q(1, 1) itself is well-defined. This is the core path-dependence obstruction: a single deterministic map (b₁, b₂, r) ↦ a cannot simultaneously implement all one-dimensional stable couplings induced by different advertisers when the bid space has cycles.

Sufficiency is constructive. If q(b, p; s) = Q(Φ(b), p; s) for a strictly coordinate-increasing Φ and a one-parameter monotone family Q(λ, p; s), then we can sample using any stable one-dimensional sampler σ_1D(λ, r) for Q(λ) and define
σ_univ(b, r) := σ_1D(Φ(b), r).
Because every advertiser’s bid increase increases Φ(b), counterfactuals are path-independent: varying b_i holding b_−i fixed corresponds to moving along the same one-dimensional λ axis.

Necessity formalizes the intuition in Proposition~4: if no such Φ exists, then there are two coordinatewise monotone paths in bid space reaching the same profile b that induce different required orderings of outcomes under stability. A clean way to argue uses a discrete grid approximation. Consider bids on {0, 1}ⁿ (or a finer lattice) and fix a seed r. Stability for each advertiser implies that along each coordinate direction, the realized action can only cross the advertiser-specific partition once. This defines, for each r, a set of admissible labels on the vertices consistent with all coordinatewise monotonicity constraints. If the realized labels are consistent for all cycles in the grid, then there exists a scalar potential that topologically orders vertices by their ``progress’’ under the sampler; one can then define Φ by extending this order (e.g., by assigning increasing values along any monotone path). If there is a cycle inconsistency (the discrete analogue of path dependence), then no scalar Φ can represent the mapping, and universal stability fails. The theorem identifies the single-index condition as exactly what rules out such cycle inconsistencies.

In simulation, we recommend implementing stable sampling in a way that makes counterfactual replay literal: given the same logged randomness r and the same inputs (b, p, s), the sampler must reproduce the same action deterministically.

When the single-index condition holds, the most transparent sampler is inverse-CDF with a fixed ordering of actions. Fix an ordering A = {a₁, …, a_m} and define cumulative sums F_λ(j) = ∑_{ℓ ≤ j}Q(λ)(a_ℓ). Given u ∈ (0, 1) (derived from r), output the least j with F_λ(j) ≥ u. Stability in λ is not automatic: it requires that each F_λ(j) is monotone in λ, which holds for many parametric families (e.g., monotone ``mass transfer’’ families) but can fail for arbitrary learned Q. When F_λ(j) is monotone, the realized index j(λ, u) changes by single threshold crossings, which is the one-dimensional analogue of Proposition~2.

For auditability and pricing, we often need the threshold ϑ_r (or its single-index analogue ϑ_u in λ-space) at which the outcome switches into the favorable set. Numerically, we can compute it by bisection on λ (or b_i) because stability implies a monotone predicate of the form ``is the sampled action in T⁺?’’. In discrete implementations, we can precompute breakpoints where the inverse-CDF outcome changes, i.e., the set of λ at which u equals some cumulative sum. This makes pricing O(|A|) per step instead of iterative root finding.

Even without a universal sampler, we can still implement advertiser-specific stable samplers for auditing . Practically, this means: to compute advertiser i’s counterfactual at bid b_i^′, we replay the episode holding (b_−i, p, s) fixed, reuse the same seed r at each step, and sample using σ_i(b_i^′, r) constructed from (T_i⁺, T_i⁻). This is sufficient for computing critical-bid payments for i and for producing bidder-specific counterfactual explanations, but it does not yield a single coherent joint counterfactual across all bidders.

At minimum, to make replay well-defined we log, per step: the state identifier (or a hash of the prompt prefix), the realized action a_k, the seed r_k, the computed distribution q_k, and any feasibility mask defining A(s_k). If budgets or pacing multipliers are present, we also log the multipliers (or the full budget state) because they change the mapping from bids to λ. If tool calls produce exogenous outputs that affect future states, we log those outputs as well; otherwise replay becomes underspecified even in the ``frozen world’’ convention.

These implementation details are mundane, but they determine whether the theoretical objects in the model correspond to operationally meaningful counterfactuals. In our experience, the main engineering failure mode is not that monotonicity is mathematically impossible, but that the system fails to commit to a replayable mapping from logged randomness to actions.