Learning Sophistication as Protection: A Robustness Threshold for Dynamic Contracting Against Learning Agents

Metadata

Total Words: 13,489
Export Date: 2026-01-16 04:49:01
Description: We study repeated principal–agent contracting when the agent is a learning algorithm, motivated by 2026-era settings where workers and users increasingly deploy AI assistants to choose actions (effort, bids, compliance) in response to algorithmic incentives. Building on recent algorithmic contract theory with no-regret learners, we identify a sharp robustness threshold: if the agent’s learning dynamics satisfy no-swap regret (internal regret), then dynamic contracting offers no advantage over the optimal static contract; the principal’s best dynamic policy collapses to a static benchmark. In contrast, under weaker and empirically common learning behaviors (mean-based no-regret, entropy-regularized updates), the principal can systematically steer the agent’s path of play and extract strictly higher revenue—sometimes by simple phase-based policies reminiscent of ‘free-fall’ dynamics. We formalize the economic interpretation as an agent-side defense: stronger learning guarantees immunize the agent from intertemporal manipulation. Finally, we endogenize algorithm choice by allowing the agent to select among learning rules with different sophistication costs, yielding a threshold characterization of when ‘upgrading the agent’ is privately optimal. Our results extend and generalize the source paper’s observation that swap-regret eliminates dynamic gains and its constructions showing large gains against mean-based learners, reframing them as a policy-relevant distinction for certifying agentic decision systems.

1. Introduction and 2026 motivation: AI assistants in labor/consumer contexts; dynamic incentives vs learning behavior; contributions and overview of dichotomy.
2. Model: repeated contract environment; contract space and limited liability; principal policies (oblivious/adaptive); learner classes (mean-based, swap-regret, entropy-regularized); performance criteria and benchmarks.
3. Benchmarking: static optimality and Stackelberg value; relationship to one-shot contracting; when repeated static replicates one-shot best-response outcomes.
4. Impossibility: no dynamic advantage against no-swap-regret (internal regret) learners; equivalence to correlated-equilibrium constraints; extensions to richer contract spaces.
5. Possibility: steerability and revenue gaps against mean-based/entropy-regularized learners; canonical constructions (free-fall-type) and scaling of gaps; welfare and distributional implications.
6. Endogenous algorithm choice: a meta-game where the agent chooses sophistication at a cost; equilibrium/threshold characterization; implications for certification and procurement of AI assistants.
7. Extensions (optional, scoped): contextual/stochastic environments; policy-regret analogues; partial feedback; stopping-time uncertainty as robustness stress test.
8. Discussion: policy and platform design (auditability, transparency requirements); guidance for agent designers; open questions.

Content

1. Introduction and 2026 motivation: AI assistants in labor/consumer contexts; dynamic incentives vs learning behavior; contributions and overview of dichotomy.

Digital labor markets and consumer platforms increasingly interact with agents rather than one-shot, fully rational decision makers. In 2026 this is no longer a speculative premise: customer-support copilots route tickets and decide escalation levels; autonomous sales agents choose outreach intensity and channel mix; content-moderation models select review depth; and ``AI employees’’ in enterprise workflows decide whether to run costly checks, request clarifications, or take shortcuts. In each case, a designer (an employer, platform, or regulator) specifies a mapping from observable outcomes to payments, credits, access, or future opportunities, while the agent chooses among actions that differ in cost and in their induced outcome distributions. What makes these environments distinct is not the presence of hidden action alone—moral hazard is old—but the fact that the agent is typically implemented as an algorithm that updates behavior from experience. The designer is therefore not only facing a best response to the current incentives, but also shaping the agent’s .

This paper studies a simple but, we argue, foundational question for such settings: In traditional contract theory, a principal who can commit to a contract often focuses on a static optimum, while dynamics enter through additional informational frictions or intertemporal constraints. In contrast, modern deployments routinely give the principal fine-grained ability to change incentives round by round (A/B tests, personalized bonuses, shifting KPIs, adaptive reward models), while the agent may be running standard online learning or reinforcement learning updates. These dynamics create a new channel: even if each round resembles a static moral-hazard problem, the sequence of incentives can interact with the agent’s update rule to create persistent ``inertia’’ or predictable transitions across actions. The principal may be able to exploit this inertia, but only insofar as the agent’s learning rule fails to defend against certain path-dependent deviations.

Our starting point is a repeated contract environment in which the principal posts a bounded, limited-liability payment scheme each round and the agent selects an action that stochastically determines an observable outcome. This abstraction captures a range of practical situations. In data-labeling pipelines, the outcome can be label quality categories, while the action is the degree of diligence. In online marketplaces, the outcome can be delivery timeliness or customer ratings, while the action is effort or resource allocation. In safety-critical AI workflows, the outcome can be the frequency of detected errors, while the action is whether to run a costly verification tool. In all of these examples, the principal does not directly observe the action but observes outcomes and can condition transfers on them. Crucially, the agent is not modeled as choosing an optimal action each round given the posted contract, but as running a learning algorithm that reacts to experienced payoffs.

The key conceptual distinction we draw is between two broad types of learning behavior that are both commonly described as no-regret,'' yet have sharply different strategic implications for a principal. On one side are learners that guarantee \emph{internal} consistency, often formalized as \emph{swap regret} (or equivalently, vanishing internal regret). These learners not only compete with the best fixed action in hindsight, but are robust to deviations that systematically remap one played action into another. Many algorithms used in game-theoretic learning and equilibrium computation satisfy such guarantees. On the other side are learners that are only protected against deviations to fixed actions---external regret---and whose choice probabilities are driven by accumulated average payoffs in amean-based’’ way (including common multiplicative-weights and entropy-regularized softmax variants). These are prominent in practice because of their simplicity, stability, and compatibility with partial feedback.

Our main message is a dichotomy: If the agent is protected by swap-regret guarantees, then dynamic incentive schemes do create additional extractable value: the principal cannot guarantee an average payoff above what is achievable with a single, optimally chosen static contract. In that regime, adaptive tinkering with incentives is essentially neutralized; whatever transient advantages might appear can be undone by the agent’s internal-consistency checks, which rule out precisely the kind of path-dependent exploitation that dynamic policies attempt to leverage. This result provides a sharp negative statement for dynamic mechanism design against sufficiently ``defensive’’ learning.

By contrast, if the agent is governed by a broad class of mean-based no-regret dynamics, then dynamic incentives can strictly improve the principal’s long-run payoff relative to the best static contract. The reason is not that the agent fails to learn; rather, the learning rule may be , comparing only to fixed-action alternatives and failing to account for systematic deviations that re-interpret earlier choices. This gap allows the principal to strategically create and then harvest . Intuitively, the principal can temporarily over-incentivize a costly, high-reward action so that it builds a large lead in the agent’s cumulative payoff estimates. Once the learner is ``locked in’’ by this lead, the principal can reduce incentives (sometimes dramatically), and the learner will not immediately switch away, because doing so would require overcoming the accumulated advantage. Over time the lead decays—hence a —but during this decay the principal receives the benefit of the high-reward action while paying less than would be required to induce it in a static one-shot sense. Importantly, this is not a knife-edge phenomenon: we identify environments where the improvement is a constant factor, and families where the multiplicative gap grows with the number of available actions under natural boundedness normalizations.

This perspective reframes several empirical observations about incentive systems in algorithmic workplaces and platforms. Designers often report that short-lived bonuses or temporary KPI shifts have surprisingly persistent behavioral effects, even after the incentives are withdrawn. In our framework, such persistence is not merely a behavioral anomaly; it can arise endogenously from standard learning updates that aggregate past payoffs. Conversely, from the agent’s perspective, the relevant defense is not simply ``learning faster’’ in the external-regret sense, but adopting a learning rule that is robust to a richer class of counterfactual deviations. In other words, the agent may need to invest in a more sophisticated form of regret minimization to avoid being steered by the principal’s dynamic policy.

We make four contributions. First, we formalize a repeated principal–agent contracting problem against a of learning algorithms and define the principal’s robust long-run value as the best guarantee achievable by any (possibly adaptive) contract policy. This worst-case lens is motivated by deployment: a principal may not know the exact learning rule embedded in an AI agent (or in a human–AI team), but may know it belongs to a family (e.g., entropy-regularized bandit learners, or internal-regret minimizers used in equilibrium solvers). Second, we show that against swap-regret learners the principal’s robust value coincides with a static benchmark: the optimal static contract under best responses. This collapses the dynamic design problem to a one-shot Stackelberg-style contract choice, providing a clean boundary condition for when dynamic contracting is a source of extra power.

Third, we construct environments in which mean-based no-regret learners are systematically exploitable by dynamic policies, yielding principal payoffs that strictly exceed the static benchmark. Our constructions isolate the mechanism behind this effect—the creation and controlled dissipation of cumulative advantage—and show that it can be made quantitatively large. While the underlying mathematics is stylized, the economic takeaway is robust: when the agent’s learning rule aggregates payoffs in a way that creates inertia, a principal can trade early transfers for later rents.

Fourth, we connect these results to an increasingly salient design choice: the agent may be able to select (or configure) its learning algorithm, trading off performance and computational or implementation cost. We therefore study a simple stage-0 meta-game in which the agent chooses between a weaker mean-based learner and a stronger swap-regret learner, paying a complexity cost for the latter. The induced equilibrium has a threshold structure: the agent invests in the defensive, internally consistent algorithm exactly when the expected exploitation loss from using the weaker learner exceeds the cost differential. This captures, in a minimalist way, a broader phenomenon: as principals become more sophisticated in designing adaptive incentives, agents (including AI developers acting on their behalf) have greater incentive to deploy more defensive learning architectures.

From a policy and practice standpoint, our results speak to three audiences. For platform designers and employers, the analysis clarifies when adaptive incentive schemes are merely optimizing within a static frontier versus when they are effectively a learning process to extract additional surplus. For regulators and auditors, it highlights that ``no-regret’’ claims are not interchangeable: an AI agent that is externally no-regret may still be vulnerable to exploitation by an adaptive principal, raising concerns analogous to dark patterns but mediated through learning dynamics. For agent designers, it suggests that internal-regret minimization can be interpreted as a form of , potentially worth its computational or engineering costs when interacting with strategic or adaptive counterparties.

We also emphasize limitations. Our baseline model is intentionally spare: the action and outcome sets are finite; contracts are bounded and satisfy limited liability; and the principal’s objective is evaluated in long-run average terms. We abstract away from richer informational structures (e.g., contextual features, state dynamics, multi-agent interactions, or noisy observation of outcomes), and we do not claim that every practical learning system fits neatly into either the swap-regret or mean-based bucket. Rather, our goal is to provide a tractable map from learning guarantees to the principal’s dynamic power, and to identify the precise sense in which stronger regret notions shut down dynamic manipulation channels. Extensions to stateful settings and partial feedback introduce additional technicalities and may require stronger assumptions; we view these as promising directions rather than settled conclusions.

The remainder of the paper proceeds as follows. Section~2 introduces the repeated contract environment, defines the relevant learner classes (including mean-based and swap-regret notions), and formalizes the principal’s robust value and static benchmark. Subsequent sections establish the static optimality result against swap regret, develop the dynamic steerability constructions against mean-based learners, and analyze the algorithm-choice threshold when the agent can endogenously select its learning sophistication. Throughout, we aim to keep the economic logic in view: dynamic contracts matter not because the stage game changes, but because learning turns the history of incentives into a state variable that a principal may, or may not, be able to control.

2. Model: repeated contract environment; contract space and limited liability; principal policies (oblivious/adaptive); learner classes (mean-based, swap-regret, entropy-regularized); performance criteria and benchmarks.

We study a repeated principal–agent interaction with hidden action and observable outcomes. Time is discrete, indexed by t ∈ [T] := {1, …, T}, and we ultimately evaluate long-run average payoffs as T → ∞.

The agent has a finite action set A = [n]. Action a ∈ A incurs a (known) cost c_a ≥ 0 and induces a distribution over observable outcomes O = [m], denoted F_a ∈ Δ(O). In round t, after the agent chooses a_t, an outcome o_t ∈ O is drawn according to o_t ∼ F_{a_t}.

The principal derives a nonnegative reward from the realized outcome. We represent rewards by a vector r ∈ ℝ₊^m, where r(o) denotes the reward from outcome o ∈ O. For each action a, define the principal’s expected reward
R_a := 𝔼_{o ∼ F_a}[r(o)].
We impose boundedness throughout: there exists r̄ < ∞ such that 0 ≤ r(o) ≤ r̄ for all o ∈ O. Similarly, we assume costs are bounded, 0 ≤ c_a ≤ c̄ for all a ∈ A. These bounds ensure per-period utilities are uniformly bounded, which allows us to use standard o(T) regret notions and interchange expectations and averages without technical distractions.

A (one-period) contract is a mapping from outcomes to nonnegative transfers, i.e.,
p : O → ℝ₊, p(o) ≥ 0 ∀o ∈ O.
We identify a contract with its vector p ∈ ℝ₊^m, where the oth coordinate is p(o). The principal’s feasible contract space is a fixed set 𝒫 ⊆ ℝ₊^m satisfying two restrictions that reflect common implementation constraints.

Transfers are nonnegative, so the agent cannot be forced to make payments to the principal. This is built into 𝒫 ⊆ ℝ₊^m.

There is an exogenous cap on payments: 𝒫 is bounded. Concretely, we may assume there exists p̄ < ∞ such that 0 ≤ p(o) ≤ p̄ for all p ∈ 𝒫 and all outcomes o ∈ O. This captures budget or policy constraints (e.g., bonus pools, rate limits on credits, or institutional limits on penalties). It also prevents degenerate constructions in which the principal uses arbitrarily large temporary payments to force behavior.

Given a contract p and action a, the agent’s one-period expected transfer is 𝔼_{o ∼ F_a}[p(o)]. We write the agent and principal expected utilities in the stage interaction as
u_A(p, a) = 𝔼_{o ∼ F_a}[p(o)] − c_a, u_P(p, a) = 𝔼_{o ∼ F_a}[r(o) − p(o)] = R_a − 𝔼_{o ∼ F_a}[p(o)].
It is often helpful to emphasize that u_P(p, a) + u_A(p, a) = R_a − c_a, so transfers only redistribute surplus between principal and agent, conditional on the induced action.

Each round t proceeds as follows.

The realized contract p_t and outcome o_t are observed by both parties. The agent’s action a_t may be unobserved by the principal (the standard moral-hazard formulation), though our benchmark and regret definitions do not require the principal to observe a_t directly.

We intentionally allow flexibility in what feedback the agent uses to update. In a variant, the agent can evaluate the expected utility it would have obtained from every action under the posted contract p_t (e.g., because it knows (F_a)_a ∈ A and (c_a)_a ∈ A). In a variant, the agent only observes the realized utility from the chosen action. Our results are stated in terms of abstract regret properties (external, internal/swap, mean-based), which can be achieved under either feedback model with appropriate algorithms; the substantive distinction for us is not the feedback per se, but the deviation class the learner protects against.

A principal policy π maps histories to contracts. Let h_t denote the public history prior to round t,
h_t := (p₁, o₁, …, p_t − 1, o_t − 1).
An (history-dependent) policy is any mapping π such that p_t = π(h_t), possibly randomized. An policy is one that does not depend on realized outcomes, for instance a predetermined sequence (p_t)_t ≥ 1 or a stationary randomized rule that draws p_t i.i.d. from some distribution over 𝒫. Adaptive policies capture common practices such as A/B testing with iterative updates, bonus schedules that respond to performance metrics, or dynamic KPI reweighting.

We take the principal to commit to a policy at the outset (at least conceptually), and we evaluate what payoff this policy guarantees against a specified class of agents. This worst-case posture is motivated by environments where the principal may not know the agent’s exact update rule, but can plausibly restrict it to a class (e.g., entropy-regularized online learners, or internal-regret minimizers used for strategic robustness).

The agent is represented by a learning algorithm 𝒜 that, given the observed history and the current contract, selects actions (possibly at random). We do not impose expected-utility maximization period by period; instead, we assume the agent satisfies a regret guarantee relative to an appropriate benchmark class of deviations.

For each round t and action a, define the agent’s utility from playing a against the posted contract p_t:
U_A(t, a) := 𝔼_{o ∼ F_a}[p_t(o)] − c_a.
Let a_t be the realized action chosen by the algorithm in round t. The agent’s realized utility is p_t(o_t) − c_{a_t}, while U_A(t, a_t) is the expectation conditional on (p_t, a_t).

We will refer to cumulative expected utilities
$$ \sigma_t(a)\;:=\;\sum_{s=1}^{t-1} U_A(s,a), $$
which play a central role in mean-based dynamics. Intuitively, σ_t(a) is the score assigned to action a by a learner that aggregates past expected payoffs.

The standard no-regret condition requires that, for every action a ∈ A, the learner’s cumulative utility is asymptotically at least that of always playing a. Formally, the external regret after T rounds is
$$ \mathrm{Reg}(T) \;:=\; \max_{a\in A}\sum_{t=1}^T U_A(t,a)\;-\;\sum_{t=1}^T U_A(t,a_t), $$
and a no-regret learner satisfies Reg(T) = o(T).

External regret captures robustness to deviations that switch the entire action sequence to a single fixed action. It does protect against deviations that systematically revise actions in a history-dependent way (for instance, ``whenever I played action i, I should instead have played j’’).

To formalize stronger, internally consistent learning, we use swap regret. For any mapping (swap'') $\phi:A\to A$, consider the deviation that replaces each played action $a_t$ with $\phi(a_t)$. The swap regret is \[ \mathrm{SwapReg}(T) \;:=\; \max_{\phi:A\to A}\sum_{t=1}^T U_A\big(t,\phi(a_t)\big)\;-\;\sum_{t=1}^T U_A(t,a_t). \] A no-swap-regret learner satisfies $\mathrm{SwapReg}(T)=o(T)$. This notion is strictly stronger than external regret and is closely tied to correlated-equilibrium obedience constraints in repeated play. Economically, it represents an agent that can defend against \emph{path-dependent} exploitation: if a principal's policy benefits from inducing the agent to play different actions at different times, swap regret asks whether the agent could have systematicallyre-labeled’’ those choices to do better.

Many widely used online learning rules—including multiplicative weights and softmax choice with entropy regularization—make action probabilities an explicit function of cumulative scores such as σ_t(⋅). For our purposes, we isolate a behavioral implication rather than a specific algorithm.

We say a learner is if there exists a function γ(T) = o(1) such that, for all times t ≤ T and actions i, j ∈ A,
σ_t(i) ≤ σ_t(j) − γ(T)T ⇒ Pr [a_t = i] ≤ γ(T).
That is, if an action i is behind another action j by a sufficiently large cumulative margin, then the learner assigns i only vanishing probability. This condition captures the inertia central to our dynamic-contract constructions: once an action has accumulated a large advantage in cumulative utility, it remains likely to be played even if it is no longer optimal in the instantaneous sense. At the same time, the condition is weak enough to include a broad family of algorithms and to permit persistent (but diminishing) exploration.

A canonical example is a softmax learner that selects actions according to
$$ \Pr[a_t=a]\;\propto\;\exp\!\left(\frac{\sigma_t(a)}{\tau}\right), $$
where τ > 0 is a temperature (equivalently, an entropy-regularization parameter). Smaller τ makes the learner more greedy with respect to cumulative scores, while larger τ increases exploration. Such learners are mean-based in the above sense (with γ(T) depending on τ and the bounded payoff range), and can satisfy external no-regret under mild conditions. However, unless explicitly designed for internal consistency, they need not achieve no-swap-regret, and thus may remain vulnerable to dynamic steering.

Because our benchmark comparisons will involve static contracting, we define the agent’s best-response correspondence to a contract p ∈ 𝒫:
BR(p) := arg max_a ∈ A{𝔼_{o ∼ F_a}[p(o)] − c_a}.
The set-valued nature of BR(p) matters: if multiple actions tie for highest expected utility, then a learning agent might converge to any of them (or cycle), and a robust principal should evaluate the worst case. This motivates the ``min_{a ∈ BR(p)}’’ convention in our static benchmark.

The principal’s realized cumulative payoff under policy π and agent algorithm 𝒜 is $\sum_{t=1}^T u_P(p_t,a_t)$, where p_t = π(h_t). We evaluate long-run performance in expected average terms, taking expectations over outcome randomness and any randomization in π or 𝒜. For a learner class 𝔏, define the principal’s robust long-run value as
$$ V(\mathfrak L) \;:=\; \sup_{\pi}\ \inf_{\mathcal A\in\mathfrak L}\ \liminf_{T\to\infty}\frac{1}{T}\, \mathbb E\!\left[\sum_{t=1}^T u_P(p_t,a_t)\right]. $$
Two aspects of this definition are deliberate. First, we take a supremum over rather than static contracts, so V(𝔏) measures the maximal benefit of dynamic contracting when facing learner class 𝔏. Second, we take an infimum over algorithms in the class, which encodes robustness: the principal can rely only on the properties shared by all learners in 𝔏 (e.g., no-swap-regret or mean-based behavior), not on detailed implementation.

To anchor the analysis, we also define an optimal benchmark. If the principal commits to a single contract p ∈ 𝒫 in every round, then any sufficiently patient learner with vanishing regret should asymptotically concentrate on best responses to p. The worst-case long-run payoff from static commitment is therefore captured by
V_static := max_{p ∈ 𝒫} min_{a ∈ BR(p)}u_P(p, a).
This benchmark is the natural analogue of a Stackelberg value in a one-shot principal–agent problem with possibly non-unique best responses: the principal chooses p, and Nature selects the least favorable best response for the principal. In Section~3 we relate this object to standard one-shot contracting logic and show why, against sufficiently defensive learning (notably, no-swap-regret), dynamic policies cannot systematically outperform it.

3. Benchmarking: static optimality and Stackelberg value; relationship to one-shot contracting; when repeated static replicates one-shot best-response outcomes.

Our notion of dynamic advantage'' is only meaningful relative to a clean static baseline. In this section we therefore unpack the benchmark \[ V_{\mathrm{static}} \;:=\; \max_{p\in\mathcal P}\ \min_{a\in BR(p)} u_P(p,a), \] and explain why it is the appropriate one-shot Stackelberg value in our contracting environment. We also clarify when, and in what sense, repeatedly posting a single contract replicates the one-shot outcome against a learning agent. These observations will be the hinge for the impossibility result in the next section: once the agent's learning rule is sufficientlydefensive’’ (in particular, no-swap-regret), the principal is effectively pushed back to the static benchmark.

Consider the underlying stage interaction in isolation. The principal first commits to a contract p ∈ 𝒫, after which the agent chooses an action a ∈ A that maximizes its expected utility u_A(p, a). This is the standard Stackelberg timing used in moral-hazard models: incentives are set ex ante, actions follow.

Two modeling choices matter for the value the principal can guarantee. The first is that best responses may not be unique. The second is that, in our robust perspective, the principal does not control the agent’s tie-breaking (nor do we assume the principal can predict it). We therefore adopt the Stackelberg convention: when multiple actions maximize u_A(p, ⋅), Nature selects the one that is worst for the principal. Formally, the principal’s guaranteed payoff from contract p is min_{a ∈ BR(p)}u_P(p, a), and optimizing over p yields V_static.

This pessimistic tie-breaking is conservative but economically natural in settings where (i) the principal does not observe actions and cannot condition on them, and (ii) the agent is represented by an algorithm whose fine-grained selection among near-ties may be opaque. In applications, a principal may hope for tie-breaking (e.g., through communication, norms, or default choices), which corresponds to max_{a ∈ BR(p)}u_P(p, a). Our subsequent results are stated for the pessimistic benchmark because it is the correct comparator for worst-case guarantees and because it interacts cleanly with regret-based learning: learning rules generally guarantee performance relative to deviation classes, not favorable tie-breaking for the principal.

It is helpful to relate V_static to the familiar one-shot incentive-design program. Fix a target action a. In a textbook formulation, the principal would like to minimize the expected transfer subject to incentive constraints ensuring a is optimal for the agent:
u_A(p, a) ≥ u_A(p, b) ∀b ∈ A.
If a is strictly optimal under the chosen p, then the agent’s best response is unique and the principal’s payoff is simply u_P(p, a) = R_a − 𝔼_{o ∼ F_a}[p(o)]. When strict optimality is not guaranteed (because 𝒫 is coarse, or because the optimal p lies on an indifference boundary), then the principal must confront the possibility that the induced best-response set contains multiple actions with different implications for u_P.

The object V_static can be viewed as ``one-shot contracting with adverse tie-breaking.’’ It asks the principal to choose p not merely to make some action optimal, but to ensure that action that is optimal for the agent under p yields the principal at least the guaranteed value. In other words, a contract is only as good as its worst best response.

This distinction matters even in classical environments. For example, limited liability and bounded transfers can force indifferences: the principal may be unable to separate two actions that generate similar outcome distributions or differ in costs by less than the available incentive power. In such cases, the principal may be able to make a desirable action a optimal but may be unable to prevent an alternative action b (with worse u_P) from also being optimal. The pessimistic Stackelberg benchmark correctly treats this as a real implementation constraint.

We now connect the one-shot benchmark to the repeated interaction. Suppose the principal posts a contract p in every round: p_t ≡ p. Then the agent faces a stationary payoff environment in which, for each action a, the per-period expected utility is constant:
U_A(t, a) = u_A(p, a) ∀t.
In such a stationary environment, standard regret guarantees imply that the agent’s long-run behavior concentrates on (approximate) best responses to p, so long-run payoffs coincide with the one-shot logic up to vanishing errors.

To make this precise, let ϵ_T := Reg(T)/T denote average external regret. Since U_A(t, a) does not vary with t under a fixed contract, the external no-regret condition implies
$$ \frac{1}{T}\sum_{t=1}^T U_A(t,a_t) \;\ge\; \max_{a\in A} u_A(p,a)\ -\ \epsilon_T. $$
Let a^⋆ ∈ arg max_au_A(p, a) be an optimal action, and define the empirical distribution of play q_T ∈ Δ(A) by
$$ q_T(a)\;:=\;\frac{1}{T}\sum_{t=1}^T \Pr[a_t=a]. $$
Taking expectations over the agent’s randomization and using stationarity yields
∑_a ∈ Aq_T(a) u_A(p, a) ≥ u_A(p, a^⋆) − ϵ_T.
This inequality immediately controls the mass placed on suboptimal actions. If an action i is Δ-suboptimal, in the sense that
u_A(p, a^⋆) − u_A(p, i) ≥ Δ > 0,
then rearranging gives
$$ q_T(i)\ \le\ \frac{\epsilon_T}{\Delta}. $$
Thus, whenever the best response is unique (or, more generally, separated by a positive gap from the rest), external no-regret forces the long-run frequency of non-best-response actions to vanish. In that case, repeatedly posting p replicates the one-shot prediction: asymptotically the agent plays the unique best response, and the principal’s average payoff converges to u_P(p, a^⋆).

The remaining subtlety is precisely the one encoded by min_{a ∈ BR(p)}. When BR(p) contains multiple actions, the above argument only implies that q_T concentrates on that set, not which element of the set is selected. Consequently, the principal’s long-run payoff under a fixed contract p converges (along subsequences) to
∑_{a ∈ BR(p)}q(a) u_P(p, a)
for some limiting distribution q supported on BR(p). Without additional structure on the learning algorithm’s tie-breaking, the principal cannot rule out convergence to the worst element of BR(p), which motivates the pessimistic evaluation min_{a ∈ BR(p)}u_P(p, a).

A useful genericity remark is that indifferences are ``knife-edge’’ when 𝒫 is sufficiently rich. Indeed, for fixed i ≠ j, the indifference condition u_A(p, i) = u_A(p, j) defines an affine hyperplane in ℝ^m (because u_A(p, a) is linear in p through 𝔼_{o ∼ F_a}[p(o)]). If 𝒫 has nontrivial dimension, one can often perturb p slightly (within feasibility) to break ties while changing the principal’s payoff only slightly. This heuristic helps reconcile the pessimistic definition with the intuition that, in many continuous contract families, optimal contracts will have unique best responses. We nevertheless keep the worst-case tie-breaking explicitly because it is the correct benchmark under bounded, discrete, or otherwise coarse contract spaces, and because our robust value comparisons are stated uniformly over environments.

The preceding discussion establishes that V_static is not an arbitrary comparator: it is the value of a genuine one-shot Stackelberg problem and is also the long-run value delivered by repeatedly posting a single contract, up to vanishing regret errors, against a broad set of learning agents. This gives V_static two distinct interpretations:

Both interpretations are important for what follows. The first ties our repeated-game benchmark to standard contracting theory. The second clarifies why the repeated game is not automatically more powerful for the principal: if the agent’s learning is sufficiently robust, then the principal cannot leverage nonstationarity to extract additional surplus beyond what is available under static commitment.

At the same time, viewing V_static as a baseline also highlights the only possible channel for dynamic gains. Any improvement over V_static must come from exploiting dynamics of learning: by varying p_t over time, the principal may be able to shape the agent’s cumulative scores or beliefs so that the agent plays actions that would not be chosen under the ultimately intended stationary contract. Whether this is feasible depends entirely on what deviation class the learner defends against. External no-regret alone leaves considerable room for path-dependent manipulation, while internal (swap) regret closes much of it by enforcing a form of dynamic consistency. The next section formalizes this distinction: against no-swap-regret learners, dynamic policies collapse back to the static benchmark, whereas against mean-based learners, carefully designed nonstationarities can strictly improve the principal’s long-run payoff.

4. Impossibility: no dynamic advantage against no-swap-regret (internal regret) learners; equivalence to correlated-equilibrium constraints; extensions to richer contract spaces.

The previous section isolated the static pessimistic Stackelberg benchmark V_static as the natural comparator for repeated contracting. We now show that, once the agent’s learning rule is sufficiently —in the sense of guaranteeing vanishing (internal) regret—the principal cannot systematically profit from nonstationarity. In short, dynamic contracts do not buy the principal additional long-run value against an agent that is robust to action-remapping deviations.

Fix any class 𝔏_swap of learning algorithms that guarantee SwapReg(T) = o(T) with respect to the agent’s expected utilities U_A(t, ⋅). Recall that the principal evaluates a policy π pessimistically, taking an infimum over algorithms in the class. Our first main observation is that, under this evaluation, dynamic policies collapse to the static benchmark.

The economic logic is straightforward. Dynamic advantage requires exploiting : the principal varies p_t to shape the agent’s internal state (cumulative utilities, beliefs, scores), thereby inducing actions that would not be chosen under the eventual'' contract. Swap regret is precisely a defense against such path dependence. It allows the agent to compare its realized trajectory not only to \emph{fixed} alternative actions (external regret), but to \emph{systematic relabelings} of its own behavior (internal regret). If the principal's manipulation hinges on getting the agent tostick’’ with a dominated label (e.g., continuing to play a costly action because it once built up score advantage), then a swap deviation that replaces that label by a cheaper alternative exposes the manipulation. Anticipating this, a no-swap-regret learner will not provide the principal with the sustained slack needed to extract rents dynamically.

One can also view the result through an extreme (but instructive) special case: a fully informed agent can play a each period, selecting some a_t ∈ BR(p_t). Such behavior has swap regret, because for every mapping ϕ : A → A and every t,
$$ U_A(t,\phi(a_t))\ \le\ U_A(t,a_t) \qquad\Rightarrow\qquad \sum_{t=1}^T U_A(t,\phi(a_t))-\sum_{t=1}^T U_A(t,a_t)\ \le\ 0. $$
Against such an agent, the principal’s per-round payoff is always bounded above by the pessimistic Stackelberg payoff of the posted contract:
u_P(p_t, a_t) ≤ min_{a ∈ BR(p_t)}u_P(p_t, a),
so averaging over t and maximizing over the choice of p_t cannot exceed max_pmin_{a ∈ BR(p)}u_P(p, a) = V_static. This already shows that (i.e., under an infimum over all swap-regret learners), the principal cannot guarantee more than V_static. The more substantive content of the proposition is that the same upper bound is enforced even when we do not assume myopic best responses, but only the weaker asymptotic property SwapReg(T) = o(T).

To connect swap regret to equilibrium constraints, let us write the internal-regret inequality in a form that exposes its ``obedience’’ content. Fix a horizon T and consider the agent’s realized sequence {a_t}_t = 1^T against the principal’s posted contracts {p_t}_t = 1^T. The swap regret bound says that for every mapping ϕ : A → A,

It suffices to consider deviations that change one action into another: for each pair i, j ∈ A, let ϕ_i → j map i to j and fix all other actions. Substituting ϕ_i → j into yields, for all i, j,

Equation says:

This is exactly the form of the obedience constraints, but with a one-sided focus on the agent. To make the correspondence explicit, define the empirical distribution μ_T over pairs (p, a) ∈ 𝒫 × A induced by the repeated play:
$$ \mu_T(B\times\{i\})\ :=\ \frac{1}{T}\sum_{t=1}^T \mathbf{1}\{p_t\in B,\ a_t=i\}, \qquad B\subseteq\mathcal P. $$
Then can be rewritten as

where we used that U_A(t, a) = u_A(p_t, a) in expectation. Interpreting a as a ``recommendation’’ (the action the agent ends up taking), states that the agent has (asymptotically) no incentive to deviate from the recommendation to any fixed alternative action j. Thus, any limit point μ of {μ_T} satisfies the correlated-equilibrium-type obedience constraints for the agent:
𝔼_μ[1{a = i}(u_A(p, j) − u_A(p, i))] ≤ 0, ∀i, j ∈ A.

A convenient way to read these constraints is through conditional expectations. If we condition on a = i and define the conditional distribution over contracts μ(⋅ ∣ a = i), then obedience implies that i maximizes the agent’s conditional expected utility:
𝔼_{p ∼ μ(⋅ ∣ a = i)}[u_A(p, i)] ≥ 𝔼_{p ∼ μ(⋅ ∣ a = i)}[u_A(p, j)] ∀j ∈ A.
Because u_A(p, a) is linear in p (through 𝔼_{o ∼ F_a}[p(o)]), we may equivalently define the ,
p̄_i := 𝔼_μ[ p ∣ a = i ] ∈ co(𝒫),
and conclude that i ∈ BR(p̄_i) for every i played with positive probability. In other words, swap regret forces the play to look like a correlation device that draws a contract from a distribution tailored to the recommended action, but still respects the agent’s incentive compatibility . This is the formal sense in which internal regret eliminates ``behavioral mistakes’’ that a principal could otherwise amplify dynamically.

We now explain why these agent-side correlated-equilibrium constraints are enough to kill dynamic advantage for the principal in our contracting game. The high-level reason is that the principal cannot condition transfers on the agent’s hidden action. Consequently, any long-run outcome must be supportable by contracts for which the realized actions are (approximately) best responses; but when the agent is allowed to resolve indifferences adversarially, the principal is driven to the same worst-best-response evaluation that defines V_static.

A compact way to formalize this is to view the repeated interaction as inducing, in the limit, a joint distribution μ over (p, a) satisfying the obedience constraints. The principal’s long-run average payoff under μ is
𝔼_μ[u_P(p, a)].
Since the principal takes an infimum over learners in 𝔏_swap, we should treat μ pessimistically: whenever the agent is (approximately) indifferent among multiple actions, the learner may select those actions so as to minimize the principal’s payoff while preserving obedience.

This pessimism is not an additional modeling choice; it is precisely what internal regret enables. If the principal tries to exploit an indifference region by ``nudging’’ the learner into a principal-favorable best response, a swap-regret learner can instead implement a different best response on that region without sacrificing its own utility (or its regret guarantees). From the principal’s perspective, the only contracts whose induced behavior is robust to such relabelings are those that maximize the principal payoff under worst-case best-response selection—exactly the definition of V_static.

Technically, one can convert this intuition into an upper bound via a reduction to the one-shot pessimistic Stackelberg problem. Fix any (possibly adaptive) principal policy π. Consider the subclass of no-swap-regret learners that, whenever there are multiple approximate best responses, choose among them so as to minimize u_P(p_t, a) (subject to maintaining the no-swap-regret property). Such tie-breaking is compatible with swap regret because internal regret constrains only the agent’s utilities, and within an indifference set the agent can move probability mass without affecting its cumulative utility up to o(T). It follows that, under this worst-case learner, the principal’s per-period payoff is asymptotically bounded by the pessimistic Stackelberg payoff of the posted contract:
u_P(p_t, a_t) ≤ min_{a ∈ BR(p_t)}u_P(p_t, a) + o(1).
Averaging over t gives
$$ \frac{1}{T}\sum_{t=1}^T u_P(p_t,a_t) \ \le\ \frac{1}{T}\sum_{t=1}^T \min_{a\in BR(p_t)}u_P(p_t,a)\ +\ o(1) \ \le\ \max_{p\in\mathcal P}\min_{a\in BR(p)}u_P(p,a)\ +\ o(1), $$
which yields the desired upper bound V(𝔏_swap) ≤ V_static. The reverse inequality V(𝔏_swap) ≥ V_static is achieved by the stationary policy that repeats an optimal static contract.

This proof route emphasizes the game-theoretic meaning of internal regret: it gives the agent enough discipline to implement (approximate) obedience constraints enough freedom to select the principal-worst obedient behavior whenever the contract does not pin down a unique response. The repeated game therefore does not enlarge the principal’s guaranteed value set beyond what is already feasible under static commitment with pessimistic tie-breaking.

The preceding argument is clearest when 𝒫 is finite, but the conclusion does not rely on finiteness. The key requirements are (i) bounded payments (so that regret bounds apply uniformly), and (ii) linearity of payoffs in the contract vector p ∈ ℝ₊^m. When 𝒫 is compact (as under limited liability with an upper bound), one can discretize 𝒫 by an ε-net in ℓ_∞ and observe that both u_A(p, a) and u_P(p, a) are Lipschitz in p with constants controlled by the outcome probabilities. The discretized game yields the same conclusion up to an O(ε) error, and letting ε → 0 recovers V_static.

Because both players’ expected utilities depend on p only through expectations 𝔼_{o ∼ F_a}[p(o)], randomizing over contracts is equivalent to posting the expected contract whenever 𝒫 is convex. If 𝒫 is not convex, then allowing the principal to mix effectively enlarges the feasible set to co(𝒫). The same impossibility logic goes through with V_static interpreted over the relevant feasible set (either 𝒫 if mixing is disallowed, or co(𝒫) if mixing is allowed). In either case, internal regret prevents improvement beyond the appropriate static commitment benchmark.

The knife-edge nature of the impossibility result is economically informative: dynamic advantage reappears only to the extent that internal-regret protection is imperfect. If the agent guarantees SwapReg(T) ≤ ηT for some small η > 0, then the obedience constraints are violated by at most η in aggregate. In finite games these become a system of linear inequalities, and standard duality arguments imply a Lipschitz-type bound: the principal’s best dynamic advantage is at most additive O(η) above V_static (with constants depending on payoff bounds). Thus, slightly'' defensive learners admit onlyslightly’’ exploitable dynamics.

Finally, we stress what the result does and does not say. It is not that a swap-regret learner must literally best respond every period; rather, any learner that ensures low internal regret is protected from systematic dynamic exploitation in the long run. The next section shows that this protection is special to internal regret: mean-based or entropy-regularized external-regret learners, while rational in the static sense, can be predictably steered by nonstationary incentives. This contrast is precisely what makes the static benchmark a sharp dividing line: it is simultaneously achievable by a stationary principal and unavoidable against sufficiently sophisticated (swap-regret) agents, yet not descriptive of what happens when learning dynamics are weaker.

5. Possibility: steerability and revenue gaps against mean-based/entropy-regularized learners; canonical constructions (free-fall-type) and scaling of gaps; welfare and distributional implications.

The impossibility result in the previous section hinges on a strong form of ``defensive’’ behavior: internal-regret protection prevents the principal from leveraging the agent’s path-dependent state. When the agent instead uses a broad family of (and, in particular, entropy-regularized) no-regret methods, the picture changes sharply. In these dynamics, actions are not chosen as exact best responses to the current contract; rather, they are chosen as a smooth (or inertia-laden) function of scores. This opens a channel for the principal to in moving the learner to a desirable region of its state space and then by reducing incentives while the induced behavior decays only gradually.

We will not tie ourselves to a single algorithm. Instead, we use a property that is shared by many externally no-regret procedures used in practice—multiplicative weights, regularized follow-the-leader, logit/softmax choice with annealed step-sizes, and related variants. Recall the cumulative (expected) utility score
$$ \sigma_t(a)\ :=\ \sum_{s=1}^{t-1}U_A(s,a) \qquad\text{where}\qquad U_A(t,a)=\mathbb E_{o\sim F_a}[p_t(o)]-c_a. $$
A learner is (in the sense we require) if there exists γ(T) = o(1) such that for all rounds t ≤ T and all actions i, j ∈ A,

Condition says that once an action falls behind in cumulative utility by a linear-in-T margin, it becomes vanishingly unlikely to be played. Importantly, does impose the action-remapping stability that internal regret delivers. It only rules out persistently playing dominated actions in the long run. As a result, the principal can profit from dominance: by creating a large early lead for a costly, high-reward action, the principal can keep the learner on that action even after the contract is modified, until the score advantage is ``worked off’’.

Entropy regularization makes this intuition particularly transparent. A canonical model is the logit rule
Pr [a_t = i] ∝ exp (σ_t(i)/τ),
where τ > 0 is a temperature parameter. Small τ yields sharp best-response-like behavior, while large τ induces exploration and faster mixing. Under decreasing step-sizes (so that σ_t accumulates rather than averaging), the principal can create large score gaps that persist for many periods. Thus, even when the agent is asymptotically no-regret in the external sense, the of play can be highly steerable.

We now sketch a simple construction that already yields a strict gap over the static pessimistic benchmark. The environment uses two outcomes, O = {0, 1}, where o = 1 is ``success.’’ Let the principal’s reward be r(1) = 1 and r(0) = 0. Consider a one-parameter (linear) limited-liability contract family
p_x(1) = x, p_x(0) = 0, x ∈ [0, x̄],
so that the only lever is the success bonus x.

Take two actions, a low action a = L and a high action a = H. Let success probabilities be q_L < q_H, costs be c_L = 0 and c_H = c > 0. Then the agent’s expected utility difference under contract x is
u_A(p_x, H) − u_A(p_x, L) = x(q_H − q_L) − c,
so there is a breakpoint x^⋆ := c/(q_H − q_L): for x > x^⋆ the high action is uniquely optimal, while for x < x^⋆ the low action is uniquely optimal. In a static contract, pushing the agent to H requires paying at least x^⋆ per success, which costs the principal x^⋆q_H in expectation. The static pessimistic value is therefore
$$ V_{\mathrm{static}} =\max\Bigl\{q_L,\ q_H-x^\star q_H\Bigr\} =\max\Bigl\{q_L,\ q_H\Bigl(1-\frac{c}{q_H-q_L}\Bigr)\Bigr\}, $$
with pessimistic tie-breaking at x = x^⋆.

A dynamic policy can do better against a mean-based learner by separating time into two phases:

Post a bonus x^hi > x^⋆ for T₁ rounds. This makes H strictly better each period, so σ_T₁ + 1(H) − σ_T₁ + 1(L) grows linearly in T₁. Under (or under a softmax with small τ), after sufficiently many rounds the learner places almost all probability on H.

Drop the bonus to x^lo < x^⋆ for T₂ rounds. Period-by-period, the low action is now better for the agent, so the score advantage of H by roughly
Δ := u_A(p_x^lo, L) − u_A(p_x^lo, H) = c − x^lo(q_H − q_L) > 0
each round. However, if the principal chose T₁ large enough, the cumulative lead built in Phase I takes on the order of (σ(H) − σ(L))/Δ rounds to dissipate. During this dissipation window, the mean-based learner continues to play H with high probability, even though x^lo is too small to make H optimal myopically.

From the principal’s perspective, the payoff in Phase II is close to
u_P(p_x^lo, H) = q_H − x^loq_H,
which can be much larger than the static pessimistic value if x^lo is chosen small. The principal ``pays’’ for this by over-incentivizing in Phase I, but the key is that the investment cost is incurred for T₁ rounds while the harvest benefit can be collected for T₂ ≫ T₁ rounds, by choosing T₁ just large enough to create a score buffer of order T₂.

The construction is deliberately stylized, but it captures a robust mechanism: external no-regret allows the agent to be correct while remaining exploitable in the . In particular, the principal can ensure that the agent’s realized play spends a long fraction of time on an action that is a best response to the current contract, without forcing the agent into external regret, because the learner’s benchmark does not permit action-contingent remappings of its own behavior.

The two-action example yields a constant-factor improvement by making Phase II long relative to Phase I. Much larger gaps arise once we have a ladder of actions that differ slightly in incentives but substantially in principal value. The canonical ``free-fall cascade’’ construction uses n actions a ∈ {1, 2, …, n} with increasing principal rewards R₁ < ⋯ < R_n and increasing costs c₁ < ⋯ < c_n, together with a contract family rich enough to create a sequence of adjacent breakpoints: for each k, there is a region of contracts where k is slightly preferable to k − 1 for the agent.

A convenient way to think about the principal’s policy is as a controlled walk on the learner’s score vector. In an phase, we post a contract that makes action k + 1 strictly better than k by a small margin, long enough to build a cumulative lead for k + 1. Repeating this step-by-step moves probability mass up the ladder toward n. Once the learner is concentrated on high actions, we switch to a phase in which the posted contract makes lower actions myopically preferable, but only slightly so. The learner then ``falls’’ from n to n − 1 to ⋯ as cumulative leads are depleted, and each step of the fall takes many rounds because the score gaps are large.

The principal benefits because the high actions have large R_a while the harvest contracts keep expected payments low. Under bounded rewards and payments, one can choose parameters so that:

In such families, the ratio V(𝔏_mb)/V_static can grow with n; informally, larger action spaces permit longer cascades (more ``stored’’ score slack) and therefore larger extraction windows.

This scaling perspective also clarifies comparative statics with respect to entropy. With a higher softmax temperature τ, the agent randomizes more and the cascade is blurred: the learner begins to mix into lower actions sooner, shortening the harvest window. Conversely, when τ is small (or when step-sizes place heavy weight on accumulated past advantages), the walk becomes sticky and the principal can implement long, predictable phases. This gives a crisp interpretation of why exploration can be protective: it reduces the principal’s ability to create long-lived score imbalances that keep the agent on a dominated label.

Transfers net out of total surplus, so the welfare effects of dynamic manipulation come entirely from the induced action path:
Welfare in round t under a_t: W(a_t) = R_{a_t} − c_{a_t}.
Dynamic policies that keep the agent on high actions for long periods can therefore welfare relative to a static contract that settles on a lower action. But this same manipulation typically the agent’s long-run share of the surplus. Indeed, since
u_P(p_t, a_t) = W(a_t) − u_A(p_t, a_t),
the principal can increase profits either by increasing welfare W(a_t) (more productive actions) or by decreasing the agent’s utility u_A (rent extraction), and the free-fall mechanism often does both: Phase I makes the agent whole (or better) briefly, while Phase II compresses the agent’s payoff as incentives are withdrawn but behavior adjusts slowly.

This observation has two practical implications. First, when we interpret the agent as an organization (or an automated system) choosing effort/quality levels, the principal’s ability to ``train’’ the system with early subsidies and later reduce compensation resembles familiar concerns about hold-up and dynamic monopsony power. Second, from a design standpoint, these dynamics make the itself economically salient: two agents with the same objectives but different algorithms can generate very different long-run surplus splits under the same contracting institution.

We also emphasize a limitation. The constructions exploit long horizons and relatively stable environments; if contracts, outcomes, or costs drift exogenously, then the score-based inertia that enables harvest can be diluted. Moreover, mean-based behavior is an assumption about the agent’s internal update rule; it is not a claim about full rationality. Our point is narrower: among widely used no-regret procedures, there is a systematic gap between being hard to beat by a fixed action'' andbeing hard to steer by a strategic principal.’’ The next section takes this seriously by endogenizing the agent’s choice of sophistication: if dynamic extraction is large enough, investing in a stronger (internal-regret) defense becomes privately valuable, and equilibrium selection hinges on the cost of that investment.

6. Endogenous algorithm choice: a meta-game where the agent chooses sophistication at a cost; equilibrium/threshold characterization; implications for certification and procurement of AI assistants.

Thus far we have taken the learner class 𝔏 as exogenous. In many applications, however, the agent (or the agent’s developer) has discretion over the learning rule: one can run a lightweight, myopic, mean-based update that is fast and easy to implement, or a more ``defensive’’ procedure that controls internal regret but requires additional computation, memory, or engineering effort. This motivates a simple meta-game in which is itself an endogenous choice, purchased at a cost.

We augment the repeated interaction with a stage 0 in which the agent selects an algorithm from a menu. For expositional clarity, consider two options:
𝒜 ∈ {𝒜_mb, 𝒜_swap},
where 𝒜_mb is drawn from a mean-based externally no-regret class 𝔏_mb, and 𝒜_swap is drawn from an internal-regret (no-swap-regret) class 𝔏_swap. The agent incurs a (possibly one-time) complexity cost κ(𝒜) ≥ 0, with κ(𝒜_swap) ≥ κ(𝒜_mb) capturing the idea that stronger defensive learning is more expensive.

After observing (or being able to infer) the algorithm choice, the principal commits to a policy π mapping histories to contracts, and the repeated interaction proceeds as before. We analyze the natural Stackelberg timing: the principal chooses an optimal policy anticipating the agent’s stage-0 choice, while the agent chooses an algorithm anticipating the principal’s best response.

Given a learner class 𝔏, recall the principal’s robust long-run value
$$ V(\mathfrak L) =\sup_{\pi}\inf_{\mathcal A\in\mathfrak L}\liminf_{T\to\infty}\frac{1}{T}\mathbb E\Bigl[\sum_{t=1}^T u_P(p_t,a_t)\Bigr]. $$
In particular, Propositions~1–2 identify two relevant quantities:
V(𝔏_swap) = V_static, V(𝔏_mb) ≥ V_static + Δ_P,
for some environments where the principal’s dynamic advantage Δ_P > 0 can be constant-factor or even grow with n under bounded normalizations.

To close the stage-0 problem, we must also track the agent’s induced long-run payoff. For a fixed policy–algorithm pair (π, 𝒜), define the agent’s long-run average utility
$$ \bar u_A(\pi,\mathcal A) :=\liminf_{T\to\infty}\frac{1}{T}\mathbb E\Bigl[\sum_{t=1}^T u_A(p_t,a_t)\Bigr]. $$
The agent chooses 𝒜 to maximize ū_A(π^⋆(𝒜), 𝒜) − κ(𝒜), where π^⋆(𝒜) denotes the principal’s best response to the selected algorithm (or more generally to the class it belongs to).

A useful accounting identity links the two players through per-round welfare. Let W(a) := R_a − c_a denote total surplus under action a. Because transfers net out,
u_P(p_t, a_t) + u_A(p_t, a_t) = W(a_t).
Therefore, for any (π, 𝒜),

where W̄(π, 𝒜) is the long-run average welfare induced by the action path. Equation makes explicit what the meta-game is really about: defensive learning matters to the agent to the extent it (i) prevents the principal from increasing ū_P at the agent’s expense, and/or (ii) changes the induced welfare trajectory W̄.

In the simplest and most relevant case, adopting a swap-regret defense forecloses the principal’s dynamic extraction channel without drastically changing efficient behavior. Then the dominant effect is distributional: relative to mean-based learning, swap-regret learning improves the agent’s long-run utility by approximately the amount of principal value that can no longer be extracted dynamically.

To express this cleanly, define the principal’s optimal values against each class:
V_mb := V(𝔏_mb), V_swap := V(𝔏_swap) = V_static.
Define also the corresponding induced welfare levels under principal-optimal policies,
$$ W_{\mathrm{mb}} :=\liminf_{T\to\infty}\frac{1}{T}\mathbb E\Bigl[\sum_{t=1}^T W(a_t)\Bigr]\ \ \text{under an optimal policy against }\mathfrak L_{\mathrm{mb}}, $$
and similarly W_swap under an optimal (static) policy against 𝔏_swap. While W_mb and W_swap can differ in general (dynamic steering may raise welfare by keeping the agent on high actions longer), the decomposition implies that the agent’s long-run payoff under principal-optimal play can be written as
U_mb := W_mb − V_mb, U_swap := W_swap − V_swap.
Hence the agent prefers the defensive algorithm 𝒜_swap whenever
}-(A_{}),
\end{equation}
or equivalently
A_{})-(A_{})).
\end{equation}
The left-hand side is the principal’s from facing a mean-based learner rather than a swap-regret learner. The right-hand side adds two forces that push in the opposite direction from the agent’s perspective: (i) any welfare improvement created by dynamic steering, which the agent may partially internalize, and (ii) the direct complexity cost of adopting the defense. In environments where welfare changes are second-order—or where the principal can reallocate essentially all welfare gains away from the agent—condition reduces to a clean cutoff:
l A_{})-(A_{}).
\end{equation}

This yields a concrete economic reading. Internal regret is a . The agent ``buys’’ it if and only if the expected surplus protected (the dynamic gap the principal could otherwise appropriate) exceeds the defense’s cost. Moreover, because Proposition~2 allows the gap V_mb − V_static to scale with problem size (e.g., with n in cascade constructions), the model predicts a stark nonlinearity: as environments become more complex, or as the principal gains more degrees of freedom in contracting, the private value of defensive learning can jump discontinuously from negligible to decisive.

From the principal’s perspective, the stage-0 meta-game endogenizes the relevant learner class. If the complexity premium κ(𝒜_swap) − κ(𝒜_mb) is high, the principal can rationally invest in dynamic policies tailored to mean-based inertia, expecting the agent to remain ``cheaply steerable.’’ Conversely, if defensive learning is inexpensive, then even a small dynamic advantage triggers algorithm upgrading, collapsing the principal’s achievable value back to V_static.

This logic suggests an additional comparative static that is absent in the exogenous-class analysis: (by enriching 𝒫, increasing horizon stability, or improving state inference) can be self-defeating once agents can upgrade. The principal may optimally commit to using exploitative dynamic policies if such behavior would induce widespread adoption of defensive algorithms. In procurement language, this is a form of dynamic ``discipline’’: opportunistic contracting practices can shift the agent population toward more robust learning, reducing the principal’s future rents.

One can formalize this by allowing the principal to commit to a indexed by an observably certified constraint (e.g., static-only contracts,'' orcontracts that are Lipschitz in time’’), and then analyzing how these commitments change the agent’s stage-$0 incentives. In such variants, the principal faces an explicit tradeoff between higher short-run extraction and the longer-run induced shift in agent sophistication.

The stage-$0 view connects directly to practical questions about the deployment of AI assistants and other automated agents. In many settings, a ``principal’’ (a firm, a platform, or an end user) interacts repeatedly with an assistant whose behavior adapts over time, while the principal controls rewards, pricing, or evaluation metrics. Our analysis suggests that —such as internal-regret bounds or stronger notions of stability—should be treated as economically meaningful attributes, akin to warranties or safety certifications.

Two implications stand out.

An agent (or vendor) can use certification of defensive learning to credibly commit that it will not be steered by transient incentives. In our model, such certification effectively moves the interaction from 𝔏_mb to 𝔏_swap, collapsing the principal’s dynamic advantage to the static benchmark. This is attractive to the agent because it protects long-run surplus, but it may also be attractive to the principal in environments where exploitation risks create reputational or regulatory costs. Importantly, certification is only valuable if it is ; otherwise, the principal cannot condition its policy on the claimed algorithm class, and the meta-game unravels.

In practice, verifiability could take the form of audited training procedures, reproducible evaluation suites that test for internal-regret-like behavior under adversarial reward shaping, or cryptographic attestations of deployed code. Our analysis does not prescribe a particular mechanism, but it clarifies what must be certified: not merely that the agent ``performs well,’’ but that it satisfies a stability notion that blocks dynamic steering.

When a principal procures an AI assistant, it typically specifies performance metrics and payment terms. The model highlights that these terms implicitly shape the assistant’s learning dynamics: a payment scheme that is innocuous against a swap-regret learner may be highly distortive against a mean-based learner, and vice versa. Therefore procurement may need to include (e.g., internal regret bounds, exploration floors, or update-rate caps) as part of the contract, much like specifying security standards.

A simple reading of is that procurement can shift the equilibrium by subsidizing sophistication: the principal (or a regulator) can reduce κ(𝒜_swap) through tooling, shared infrastructure, or mandated defaults. Doing so may reduce the incidence of exploitative dynamics, albeit potentially at the cost of slower adaptation or higher compute. This frames a concrete policy tradeoff: lowering the cost of defensive learning improves robustness to manipulation but may reduce the feasibility of lightweight deployments.

We stress two limitations of the stylized stage-$0 formulation. First, the agent’s choice set is richer than {𝒜_mb, 𝒜_swap}; real systems interpolate continuously between them (e.g., approximate internal regret, partial monitoring, bounded memory). A more realistic model would let the agent pick a parameter η governing an approximate swap-regret guarantee SwapReg(T) ≤ ηT, with an increasing cost κ(η), and then use continuity results (as in Proposition~3) to obtain a smooth version of the cutoff. Second, we have treated the agent as the sole decision-maker over its algorithm, whereas in many markets the algorithm is selected by a developer while the ``agent’’ experiencing payoffs is a downstream user. This wedge can generate underinvestment in defensive learning and thus amplify the principal’s dynamic power.

Despite these caveats, the core lesson is robust: when learning rules are endogenous, the principal’s ability to exploit behavioral inertia is not merely a property of the environment; it is an equilibrium outcome shaped by the relative costs of sophistication and the availability of credible guarantees. This motivates the extensions in the next section, where we ask how these conclusions change once the environment is contextual or stateful and the relevant defensive notions must be strengthened beyond external regret.

7. Extensions (optional, scoped): contextual/stochastic environments; policy-regret analogues; partial feedback; stopping-time uncertainty as robustness stress test.

The baseline model treats each round as a fresh moral-hazard instance, with the principal choosing p_t and the agent choosing a_t absent any persistent state or exogenous covariates. This abstraction is useful for isolating the economic role of learning guarantees, but many applications are explicitly (the mapping from actions to outcomes depends on observable features), (today’s action changes tomorrow’s opportunity set), or (the agent only observes bandit feedback). We briefly sketch how the main logic extends, and where genuinely new phenomena can arise.

Suppose that before contracting in round t a publicly observed context x_t ∈ 𝒳 is realized. The principal posts a context-contingent contract p_t(⋅ ; x_t) ∈ 𝒫(x_t) ⊆ ℝ₊^m, the agent chooses a_t ∈ A, and then o_t ∼ F_{a_t}(⋅ |x_t). Utilities remain
u_A(p_t(⋅; x_t), a_t; x_t) = 𝔼[p_t(o; x_t) ∣ x_t, a_t] − c_{a_t}, u_P(p_t(⋅; x_t), a_t; x_t) = R_{a_t}(x_t) − 𝔼[p_t(o; x_t) ∣ x_t, a_t],
where R_a(x) := 𝔼[r(o) ∣ x, a].

A natural benchmark is now a p : 𝒳 → 𝒫(⋅) chosen once and applied each round:
$$ V_{\mathrm{static}}^{\mathrm{ctx}} :=\sup_{p(\cdot)}\ \liminf_{T\to\infty}\ \frac{1}{T}\sum_{t=1}^T\ \min_{a\in BR(p(\cdot;x_t);x_t)} u_P\bigl(p(\cdot;x_t),a;x_t\bigr), $$
with BR(p(⋅; x); x) defined in the obvious way. Under i.i.d. contexts x_t ∼ 𝒟, this reduces to maximizing an expectation over x ∼ 𝒟; under adversarial contexts it becomes a worst-case time average.

The key question is whether dynamic, history-dependent policies can outperform this benchmark against sophisticated learners. The same mechanism behind Proposition~1 suggests a qualified ``no’’ once the agent controls internal regret in the stage game whose action set is still A but whose payoff depends on (x_t, p_t). Intuitively, if the agent’s learning rule enforces approximate obedience constraints , then the repeated play is pinned to a correlated-equilibrium-like set in each information slice, and the principal cannot systematically extract more than the best stationary mapping from contexts to contracts. What changes relative to the non-contextual case is not the logic but the object: static optimality becomes .

At the same time, contextuality expands the principal’s steering tools against weaker learners. Even when 𝒫(x) is simple for each x, the principal can interleave contexts so as to create persistent cumulative-utility gaps (the analogue of free-fall'') that are invisible to an external-regret criterion aggregated across heterogeneous rounds. This is one reason contextual bandit settings are a natural laboratory for manipulation: the principal can useeasy’’ contexts to subsidize a costly action and ``hard’’ contexts to harvest rents, while mean-based learners may not correctly normalize across these regimes unless their guarantee is explicitly contextual.

A more substantive extension introduces an unobserved-to-the-principal action state or a publicly observed system state s_t ∈ 𝒮 with Markovian dynamics. One convenient formulation is an MDP-like interaction in which, after contract choice p_t(⋅ ; s_t) and action a_t, an outcome o_t is realized and the next state satisfies
(s_t + 1, o_t) ∼ P(⋅, ⋅ ∣ s_t, a_t),
with r(o_t) and p_t(o_t; s_t) paid as before. (The principal may also condition on s_t if it is observed.) The agent’s strategic problem is now inherently : choosing a_t trades off current transfers against future state visitation.

In such environments, standard external regret is often the wrong yardstick because it compares the realized trajectory to a fixed action that would have been played against the sequence of contracts, even though changing actions would also change future states and thus future contract opportunities. This is precisely where stronger notions such as (or more generally counterfactual regret defined on policies) become economically meaningful. One stylized target is
$$ \mathrm{PolReg}(T) :=\max_{\pi_A\in\Pi}\ \mathbb E\Bigl[\sum_{t=1}^T u_A\bigl(p_t(\cdot;s_t),a_t^{\pi_A};s_t\bigr)\Bigr] -\mathbb E\Bigl[\sum_{t=1}^T u_A\bigl(p_t(\cdot;s_t),a_t;s_t\bigr)\Bigr], $$
where Π is a class of stationary (or slowly varying) Markov policies for the agent and a_t^π_A ∼ π_A(⋅ ∣ s_t) is the counterfactual action under the state path induced by π_A.

The economic content of Proposition~5 (stated earlier as optional) is that if the agent can guarantee PolReg(T) = o(T) against adaptive adversaries, then the principal’s dynamic advantage again collapses to a benchmark over and . The underlying intuition mirrors the swap-regret case: policy-regret control prevents the principal from benefiting from nonstationarities that only exist because the learner is being ``walked’’ through transient states and transient beliefs. When the agent can compare its realized performance to coherent counterfactual policies that account for state evolution, the principal loses the ability to profit from such path dependence except through genuine improvements in long-run welfare.

We view this extension as conceptually important but technically delicate. Unlike the stage game with finite (p, a), MDP interactions require additional regularity—mixing conditions, bounded influence of contract perturbations on state occupancy, or restrictions on how finely the principal can condition p_t on history—to avoid degenerate pathologies. The main lesson, however, is robust: in stateful settings, from internal regret to policy regret (or an equivalent stability notion), and the same distribution-versus-efficiency decomposition reappears once we work at the right level of abstraction.

Our baseline discussion implicitly allows full-information feedback to the agent, in the sense that the agent can evaluate U_A(t, a) for all a ∈ A given (p_t, o_t) and knowledge of F_a. In many contracting environments this is unrealistic: the agent may observe only realized payoff p_t(o_t) − c_{a_t} and perhaps the contract p_t, but not the counterfactual outcome distributions under actions it did not take. This pushes the agent into a bandit-like learning problem.

Partial feedback matters for two reasons.

First, it changes which guarantees are feasible. Achieving no-swap-regret under bandit feedback is substantially harder than achieving external no-regret, and often requires explicit exploration. If the agent is unwilling to explore (because exploration is costly in utility terms), then internal-regret-type protection may be infeasible, and the principal can exploit informational fragility even when the agent to be defensive.

Second, partial feedback can amplify dynamic manipulation. Mean-based dynamics rely on the accumulation of utilities σ̂_t(a); a principal who can influence the variance and bias of these estimates (by shaping outcome risk through contract convexity, or by inducing rare but salient payments) may create persistent miscalibration. In this sense, informational constraints create a second manipulation channel beyond pure path dependence: even if the agent’s update rule is well-intentioned, the principal may profit by making some actions hard to evaluate and others easy to learn.

A practical implication is that ``certification of defensive learning’’ in bandit settings cannot merely assert an abstract regret bound; it must also specify the required information structure and exploration regime. For instance, an exploration floor (a lower bound on Pr [a_t = a]) may protect against certain exploitation patterns, but it also changes welfare by forcing inefficient sampling. This creates an explicit three-way tradeoff between (i) robustness to dynamic incentives, (ii) efficiency, and (iii) informational feasibility.

Finally, we can stress-test dynamic contracting schemes by allowing the interaction to end at an exogenous stopping time. Concretely, let τ be a random horizon (possibly geometric, yielding discounting), observed only when it occurs. The principal’s objective becomes an expected average or discounted payoff, and the agent’s learning problem must perform well under an uncertain horizon.

Stopping-time uncertainty interacts sharply with the dynamic extraction mechanisms in Proposition~2. Many steering policies rely on a front-loaded subsidy phase followed by a harvesting phase. If the game might stop before harvesting, then the principal bears additional risk; conversely, if the agent discounts the future or anticipates termination, then costly actions become harder to sustain, and ``free-fall’’ may accelerate. This makes τ a simple robustness parameter: dynamic manipulation that requires long horizons is fragile to termination risk, while manipulation that operates quickly is more robust but may require larger transfers (and hence may be constrained by limited liability).

From the agent’s perspective, uncertain horizons also affect which defenses are valuable. Swap-regret control is asymptotic, but in stopping-time environments the relevant comparison is finite-sample: how quickly does the defense prevent exploitation relative to expected remaining time? This suggests that algorithm-choice cutoffs (like ) should depend not only on long-run value gaps but also on convergence rates and on the distribution of τ. Put differently, when relationships are short-lived, ``sophistication’’ may not pay even if it is valuable asymptotically; when relationships are persistent, the same investment becomes compelling.

We view stopping-time variants as analytically useful even when one ultimately cares about long horizons. They force us to separate two distinct claims that can otherwise be conflated: (i) whether a defense eliminates dynamic advantage , and (ii) whether it does so . In applications such as platforms iterating rapidly on pricing rules or reward metrics, the second question is often the binding one.

Taken together, these extensions reinforce the core message while sharpening it. Dynamic exploitation is not merely a feature of repeated moral hazard; it is a feature of repeated moral hazard . Once we move to contextual, stateful, or partially observed environments, the appropriate benchmark and the appropriate defense both change, but the same economic logic continues to organize the analysis: stronger stability notions collapse the principal’s dynamic advantage toward a static (or stationary) benchmark, while weaker notions leave transient dynamics available for extraction.

8. Discussion: policy and platform design (auditability, transparency requirements); guidance for agent designers; open questions.

Our results have a simple organizing message for applied contracting: whether dynamic incentive schemes are value creating'' orextractive’’ depends less on the sophistication of the principal and more on the embedded in the agent’s adaptation rule and the of the resulting interaction. When the agent effectively enforces internal-consistency constraints (swap regret, policy regret, or related notions), dynamic contracting largely collapses to a static or stationary benchmark; when the agent only guarantees weaker external-regret or mean-based properties, a principal can sometimes profit from steering the agent through transient regions of behavior that would not arise under a fully ``defended’’ response. In this section we translate that logic into implications for (i) platform and policy design, (ii) the design of agent-side learning systems, and (iii) a set of open questions that we view as central for making the theory operational.

A practical obstacle to deploying ``defensive learning’’ is not conceptual but verifiability: many stability properties are defined counterfactually (e.g., deviations that remap actions, or policy comparisons under counterfactual state paths). This makes it hard for regulators, users, or third-party auditors to certify that an agent is protected against dynamic exploitation. We therefore view as a first-class design primitive, alongside limited liability and informational constraints.

Two complementary approaches are plausible.

In repeated contracting on platforms, the principal (platform) can often log the full sequence {(p_t, o_t)}_t ≤ T, while the agent may log additional internal signals. An audit can then check whether observed play is consistent with a class of ``stable’’ agents, in the sense that there exists a sequence of mixed actions consistent with approximate obedience constraints. In finite environments, these constraints are linear in the empirical distribution of (p_t, a_t) and can be expressed in the same spirit as approximate correlated equilibrium. While such audits do not recover a_t when actions are hidden, they can still be informative in settings where outcomes are sufficiently diagnostic of actions, or where the principal is required to report additional statistics that allow inference. The policy implication is immediate: transparency requirements that force richer reporting (e.g., calibrated outcome summaries by contract) can make manipulation harder by enabling sharper audits.

Many defenses are feasible only if the agent can compute or estimate counterfactual utilities U_A(t, a), or at least unbiased estimates thereof. This suggests that ``transparency’’ should not be understood as revealing the principal’s internal objectives, but rather as revealing the needed for stable learning: the mapping from outcomes to payments, the relevant outcome taxonomy, and (when feasible) enough statistical structure to estimate how alternative actions would have performed. In other words, the transparency target is the information required for the agent to implement a stronger regret notion, not the information required for the principal to optimize extraction.

A concrete policy lever is to mandate standardized reporting of outcome models (or validated simulators) in domains such as ad auctions, gig platforms, and content moderation, where the agent’s action space is large and the consequences of actions are noisy. Such requirements are not costless—they may reveal proprietary information or enable gaming by third parties—but our framework clarifies the tradeoff: restricting counterfactual evaluation pushes agents toward weaker, more exploitable learning rules, enlarging the principal’s dynamic advantage set.

A platform designer often plays a dual role: it is a principal vis-`a-vis participants, but it may also be a social planner subject to constraints (fairness, contestability, consumer protection). The theory suggests three actionable guidelines.

When participants plausibly implement sophisticated defenses (swap-regret-like learning, or algorithmically chosen defenses with low complexity cost), dynamic schemes tend not to improve on the best static benchmark in a robust sense. In such environments, frequent contract changes are more likely to increase variance, compliance costs, and perceived unfairness than to increase long-run efficiency. A policy of ``metric stability’’—limiting the rate at which scoring rules or payment weights can change—can therefore be justified not only on administrative grounds but also on incentive grounds: it removes the very degrees of freedom that enable extraction against weaker learners, without sacrificing much against stronger ones.

There are domains (fraud detection, content ranking, safety) where nonstationarity is intrinsic. In those cases, the platform can still reduce manipulation risk by constraining contract updates to be interpretable and monotone in the sense that participants can understand how actions map into outcomes and payments over time. Our model highlights why: mean-based or entropy-regularized agents can be ``walked’’ through transient phases when incentives move in ways that are hard to compare across time. Constraining updates makes it easier for agents to normalize utilities across regimes and makes it easier for auditors to detect systematic steering.

A recurring theme in dynamic extraction constructions is the use of front-loaded subsidies followed by harvesting. Limited liability bounds the available subsidy and therefore limits the speed and magnitude of steering. From a design standpoint, payment caps and escrow requirements can thus play a protective role for participants, beyond preventing insolvency: they restrict the principal’s ability to create large cumulative-utility gaps that weaker learners cannot immediately undo. Put differently, limited liability can substitute (imperfectly) for sophisticated defenses when those defenses are infeasible.

From the agent side, the main design choice is not merely which learning algorithm minimizes regret, but which algorithm under an adaptive principal. This reframes familiar engineering decisions—exploration schedules, regularization, and feedback assumptions—as economic design choices.

If the interaction is well approximated by a repeated stage game with payoffs that can be evaluated (or consistently estimated), then internal-regret minimization is a natural defense: it limits the principal’s ability to benefit from history dependence beyond the static benchmark. Practically, this points toward algorithms that control swap regret or implement calibrated learning dynamics, even if they are computationally heavier than standard no-regret methods. The meta-game logic (algorithm-choice thresholds) suggests that such upgrades should be targeted to environments where the expected exploitation gap is large relative to computational and implementation costs.

In bandit-like environments, an agent that cannot estimate counterfactual payoffs is structurally vulnerable. Agent designers can sometimes mitigate this by investing in instrumentation: logging richer features, building causal models that predict outcomes under alternative actions, or negotiating for additional signals from the platform. Even imperfect counterfactual models can help if they reduce bias and variance enough to approximate the obedience constraints that internal-regret learning requires.

Standard offline evaluation of learning systems often tests performance against fixed environments. Our results suggest adding adversarially adaptive principals (or platform simulators that update p_t based on observed behavior) as a stress test. A simple version is to measure the worst-case gap between realized payoff and a static best response under the empirical distribution of contracts. More ambitious versions attempt to certify approximate swap-regret or policy-regret bounds under a family of plausible principal policies.

Even when asymptotic guarantees are strong, finite-horizon behavior can be exploitable. Agent designers should therefore treat convergence speed as economically meaningful: if a defense requires long horizons to ``kick in,’’ it may fail in relationships with churn, termination risk, or regime changes. This suggests using anytime algorithms, explicit learning-rate schedules tied to horizon estimates, and conservative initialization procedures that reduce the profitability of front-loaded steering.

Regulatory debates around platforms frequently emphasize transparency, but the relevant notion is subtle. Revealing more about the platform’s optimization may or may not help participants. By contrast, revealing (or standardizing) the and the directly affects the feasibility of defensive learning.

A regulatory framework aligned with our model would focus on three objects:

Notably, these recommendations do not require regulating the principal’s objective r(⋅) directly. They instead regulate the that determine whether the principal can profit from adaptivity against weaker learners.

We conclude with open questions that, in our view, determine how far this line of work can be pushed in theory and practice.

Mean-based and entropy-regularized learners are stylized. Real systems mix heuristics, constraints, and partial observability. A central challenge is to define learner classes 𝔏 that are both behaviorally realistic and analytically tractable, and then to characterize V(𝔏) beyond existence of gaps. Even in finite settings, the principal’s optimal policy may resemble an optimal control problem over the agent’s score vector σ_t(⋅); turning this into usable prescriptions remains largely open.

Bandit feedback makes strong defenses expensive because exploration reduces utility. How should an agent optimally trade off immediate payoff against robustness to dynamic extraction? Conversely, how should a platform be allowed to shape information disclosure without creating perverse incentives to obscure feedback? We expect the right model to treat exploration as a choice variable with an explicit economic cost, yielding equilibrium predictions for when ``defensive exploration’’ emerges.

Platforms typically contract with many agents simultaneously. Competition can either discipline the principal (by making extraction harder) or intensify it (by creating relative-performance schemes that magnify steering). Extending robust-value notions to many agents raises new questions: do correlated-equilibrium-like constraints across agents limit dynamic extraction, or can a principal use cross-agent coupling to reintroduce exploitation even when each agent individually controls internal regret?

In practice, platforms can commit imperfectly to update rules (e.g., published policies, versioned APIs, or verifiable smart contracts). Formalizing partial commitment could sharpen the policy relevance of ``static benchmark’’ results: if a platform can credibly commit to a stationary mapping within a version, then participants can rationally invest in defenses tailored to that version. The design of verifiable commitment mechanisms that preserve flexibility while limiting opportunistic adaptivity is an open design problem.

Our analysis emphasizes the principal’s robust payoff. In many domains the policy question is distributive: how much surplus is shifted from agents to platforms by dynamic manipulation, and how does this interact with fairness constraints? Developing welfare decompositions that remain valid under learning dynamics (especially with state and context) would connect the theory more directly to antitrust and labor-market policy.

The broader lesson is that ``learning’’ is not merely a friction that disappears in the limit; it is a strategic interface that can be engineered on both sides. Platforms and regulators can influence which guarantees are feasible through transparency and feedback design, while agent designers can choose defenses that convert repeated interaction from a manipulable process into one that approximates a static contracting problem. Our model does not claim that dynamic incentives are always harmful or always beneficial; rather, it illuminates when dynamic adaptation is likely to improve efficiency and when it is likely to function primarily as an extraction technology.