Digital labor markets and consumer platforms increasingly interact with agents rather than one-shot, fully rational decision makers. In 2026 this is no longer a speculative premise: customer-support copilots route tickets and decide escalation levels; autonomous sales agents choose outreach intensity and channel mix; content-moderation models select review depth; and ``AI employees’’ in enterprise workflows decide whether to run costly checks, request clarifications, or take shortcuts. In each case, a designer (an employer, platform, or regulator) specifies a mapping from observable outcomes to payments, credits, access, or future opportunities, while the agent chooses among actions that differ in cost and in their induced outcome distributions. What makes these environments distinct is not the presence of hidden action alone—moral hazard is old—but the fact that the agent is typically implemented as an algorithm that updates behavior from experience. The designer is therefore not only facing a best response to the current incentives, but also shaping the agent’s .
This paper studies a simple but, we argue, foundational question for such settings: In traditional contract theory, a principal who can commit to a contract often focuses on a static optimum, while dynamics enter through additional informational frictions or intertemporal constraints. In contrast, modern deployments routinely give the principal fine-grained ability to change incentives round by round (A/B tests, personalized bonuses, shifting KPIs, adaptive reward models), while the agent may be running standard online learning or reinforcement learning updates. These dynamics create a new channel: even if each round resembles a static moral-hazard problem, the sequence of incentives can interact with the agent’s update rule to create persistent ``inertia’’ or predictable transitions across actions. The principal may be able to exploit this inertia, but only insofar as the agent’s learning rule fails to defend against certain path-dependent deviations.
Our starting point is a repeated contract environment in which the principal posts a bounded, limited-liability payment scheme each round and the agent selects an action that stochastically determines an observable outcome. This abstraction captures a range of practical situations. In data-labeling pipelines, the outcome can be label quality categories, while the action is the degree of diligence. In online marketplaces, the outcome can be delivery timeliness or customer ratings, while the action is effort or resource allocation. In safety-critical AI workflows, the outcome can be the frequency of detected errors, while the action is whether to run a costly verification tool. In all of these examples, the principal does not directly observe the action but observes outcomes and can condition transfers on them. Crucially, the agent is not modeled as choosing an optimal action each round given the posted contract, but as running a learning algorithm that reacts to experienced payoffs.
The key conceptual distinction we draw is between two broad types of
learning behavior that are both commonly described as
no-regret,'' yet have sharply different strategic implications for a principal. On one side are learners that guarantee \emph{internal} consistency, often formalized as \emph{swap regret} (or equivalently, vanishing internal regret). These learners not only compete with the best fixed action in hindsight, but are robust to deviations that systematically remap one played action into another. Many algorithms used in game-theoretic learning and equilibrium computation satisfy such guarantees. On the other side are learners that are only protected against deviations to fixed actions---external regret---and whose choice probabilities are driven by accumulated average payoffs in amean-based’’
way (including common multiplicative-weights and entropy-regularized
softmax variants). These are prominent in practice because of their
simplicity, stability, and compatibility with partial feedback.
Our main message is a dichotomy: If the agent is protected by swap-regret guarantees, then dynamic incentive schemes do create additional extractable value: the principal cannot guarantee an average payoff above what is achievable with a single, optimally chosen static contract. In that regime, adaptive tinkering with incentives is essentially neutralized; whatever transient advantages might appear can be undone by the agent’s internal-consistency checks, which rule out precisely the kind of path-dependent exploitation that dynamic policies attempt to leverage. This result provides a sharp negative statement for dynamic mechanism design against sufficiently ``defensive’’ learning.
By contrast, if the agent is governed by a broad class of mean-based no-regret dynamics, then dynamic incentives can strictly improve the principal’s long-run payoff relative to the best static contract. The reason is not that the agent fails to learn; rather, the learning rule may be , comparing only to fixed-action alternatives and failing to account for systematic deviations that re-interpret earlier choices. This gap allows the principal to strategically create and then harvest . Intuitively, the principal can temporarily over-incentivize a costly, high-reward action so that it builds a large lead in the agent’s cumulative payoff estimates. Once the learner is ``locked in’’ by this lead, the principal can reduce incentives (sometimes dramatically), and the learner will not immediately switch away, because doing so would require overcoming the accumulated advantage. Over time the lead decays—hence a —but during this decay the principal receives the benefit of the high-reward action while paying less than would be required to induce it in a static one-shot sense. Importantly, this is not a knife-edge phenomenon: we identify environments where the improvement is a constant factor, and families where the multiplicative gap grows with the number of available actions under natural boundedness normalizations.
This perspective reframes several empirical observations about incentive systems in algorithmic workplaces and platforms. Designers often report that short-lived bonuses or temporary KPI shifts have surprisingly persistent behavioral effects, even after the incentives are withdrawn. In our framework, such persistence is not merely a behavioral anomaly; it can arise endogenously from standard learning updates that aggregate past payoffs. Conversely, from the agent’s perspective, the relevant defense is not simply ``learning faster’’ in the external-regret sense, but adopting a learning rule that is robust to a richer class of counterfactual deviations. In other words, the agent may need to invest in a more sophisticated form of regret minimization to avoid being steered by the principal’s dynamic policy.
We make four contributions. First, we formalize a repeated principal–agent contracting problem against a of learning algorithms and define the principal’s robust long-run value as the best guarantee achievable by any (possibly adaptive) contract policy. This worst-case lens is motivated by deployment: a principal may not know the exact learning rule embedded in an AI agent (or in a human–AI team), but may know it belongs to a family (e.g., entropy-regularized bandit learners, or internal-regret minimizers used in equilibrium solvers). Second, we show that against swap-regret learners the principal’s robust value coincides with a static benchmark: the optimal static contract under best responses. This collapses the dynamic design problem to a one-shot Stackelberg-style contract choice, providing a clean boundary condition for when dynamic contracting is a source of extra power.
Third, we construct environments in which mean-based no-regret learners are systematically exploitable by dynamic policies, yielding principal payoffs that strictly exceed the static benchmark. Our constructions isolate the mechanism behind this effect—the creation and controlled dissipation of cumulative advantage—and show that it can be made quantitatively large. While the underlying mathematics is stylized, the economic takeaway is robust: when the agent’s learning rule aggregates payoffs in a way that creates inertia, a principal can trade early transfers for later rents.
Fourth, we connect these results to an increasingly salient design choice: the agent may be able to select (or configure) its learning algorithm, trading off performance and computational or implementation cost. We therefore study a simple stage-0 meta-game in which the agent chooses between a weaker mean-based learner and a stronger swap-regret learner, paying a complexity cost for the latter. The induced equilibrium has a threshold structure: the agent invests in the defensive, internally consistent algorithm exactly when the expected exploitation loss from using the weaker learner exceeds the cost differential. This captures, in a minimalist way, a broader phenomenon: as principals become more sophisticated in designing adaptive incentives, agents (including AI developers acting on their behalf) have greater incentive to deploy more defensive learning architectures.
From a policy and practice standpoint, our results speak to three audiences. For platform designers and employers, the analysis clarifies when adaptive incentive schemes are merely optimizing within a static frontier versus when they are effectively a learning process to extract additional surplus. For regulators and auditors, it highlights that ``no-regret’’ claims are not interchangeable: an AI agent that is externally no-regret may still be vulnerable to exploitation by an adaptive principal, raising concerns analogous to dark patterns but mediated through learning dynamics. For agent designers, it suggests that internal-regret minimization can be interpreted as a form of , potentially worth its computational or engineering costs when interacting with strategic or adaptive counterparties.
We also emphasize limitations. Our baseline model is intentionally spare: the action and outcome sets are finite; contracts are bounded and satisfy limited liability; and the principal’s objective is evaluated in long-run average terms. We abstract away from richer informational structures (e.g., contextual features, state dynamics, multi-agent interactions, or noisy observation of outcomes), and we do not claim that every practical learning system fits neatly into either the swap-regret or mean-based bucket. Rather, our goal is to provide a tractable map from learning guarantees to the principal’s dynamic power, and to identify the precise sense in which stronger regret notions shut down dynamic manipulation channels. Extensions to stateful settings and partial feedback introduce additional technicalities and may require stronger assumptions; we view these as promising directions rather than settled conclusions.
The remainder of the paper proceeds as follows. Section~2 introduces the repeated contract environment, defines the relevant learner classes (including mean-based and swap-regret notions), and formalizes the principal’s robust value and static benchmark. Subsequent sections establish the static optimality result against swap regret, develop the dynamic steerability constructions against mean-based learners, and analyze the algorithm-choice threshold when the agent can endogenously select its learning sophistication. Throughout, we aim to keep the economic logic in view: dynamic contracts matter not because the stage game changes, but because learning turns the history of incentives into a state variable that a principal may, or may not, be able to control.
We study a repeated principal–agent interaction with hidden action and observable outcomes. Time is discrete, indexed by t ∈ [T] := {1, …, T}, and we ultimately evaluate long-run average payoffs as T → ∞.
The agent has a finite action set A = [n]. Action a ∈ A incurs a (known) cost ca ≥ 0 and induces a distribution over observable outcomes O = [m], denoted Fa ∈ Δ(O). In round t, after the agent chooses at, an outcome ot ∈ O is drawn according to ot ∼ Fat.
The principal derives a nonnegative reward from the realized outcome.
We represent rewards by a vector r ∈ ℝ+m,
where r(o) denotes
the reward from outcome o ∈ O. For each action
a, define the principal’s
expected reward
Ra := 𝔼o ∼ Fa[r(o)].
We impose boundedness throughout: there exists r̄ < ∞ such that 0 ≤ r(o) ≤ r̄ for
all o ∈ O. Similarly,
we assume costs are bounded, 0 ≤ ca ≤ c̄
for all a ∈ A. These
bounds ensure per-period utilities are uniformly bounded, which allows
us to use standard o(T) regret notions and
interchange expectations and averages without technical
distractions.
A (one-period) contract is a mapping from outcomes to nonnegative
transfers, i.e.,
p : O → ℝ+, p(o) ≥ 0 ∀o ∈ O.
We identify a contract with its vector p ∈ ℝ+m,
where the oth coordinate is
p(o). The principal’s
feasible contract space is a fixed set 𝒫 ⊆ ℝ+m
satisfying two restrictions that reflect common implementation
constraints.
Transfers are nonnegative, so the agent cannot be forced to make payments to the principal. This is built into 𝒫 ⊆ ℝ+m.
There is an exogenous cap on payments: 𝒫 is bounded. Concretely, we may assume there exists p̄ < ∞ such that 0 ≤ p(o) ≤ p̄ for all p ∈ 𝒫 and all outcomes o ∈ O. This captures budget or policy constraints (e.g., bonus pools, rate limits on credits, or institutional limits on penalties). It also prevents degenerate constructions in which the principal uses arbitrarily large temporary payments to force behavior.
Given a contract p and
action a, the agent’s
one-period expected transfer is 𝔼o ∼ Fa[p(o)].
We write the agent and principal expected utilities in the stage
interaction as
uA(p, a) = 𝔼o ∼ Fa[p(o)] − ca, uP(p, a) = 𝔼o ∼ Fa[r(o) − p(o)] = Ra − 𝔼o ∼ Fa[p(o)].
It is often helpful to emphasize that uP(p, a) + uA(p, a) = Ra − ca,
so transfers only redistribute surplus between principal and agent,
conditional on the induced action.
The realized contract pt and outcome ot are observed by both parties. The agent’s action at may be unobserved by the principal (the standard moral-hazard formulation), though our benchmark and regret definitions do not require the principal to observe at directly.
We intentionally allow flexibility in what feedback the agent uses to update. In a variant, the agent can evaluate the expected utility it would have obtained from every action under the posted contract pt (e.g., because it knows (Fa)a ∈ A and (ca)a ∈ A). In a variant, the agent only observes the realized utility from the chosen action. Our results are stated in terms of abstract regret properties (external, internal/swap, mean-based), which can be achieved under either feedback model with appropriate algorithms; the substantive distinction for us is not the feedback per se, but the deviation class the learner protects against.
A principal policy π maps
histories to contracts. Let ht denote the
public history prior to round t,
ht := (p1, o1, …, pt − 1, ot − 1).
An (history-dependent) policy is any mapping π such that pt = π(ht),
possibly randomized. An policy is one that does not depend on realized
outcomes, for instance a predetermined sequence (pt)t ≥ 1
or a stationary randomized rule that draws pt i.i.d. from
some distribution over 𝒫. Adaptive
policies capture common practices such as A/B testing with iterative
updates, bonus schedules that respond to performance metrics, or dynamic
KPI reweighting.
We take the principal to commit to a policy at the outset (at least conceptually), and we evaluate what payoff this policy guarantees against a specified class of agents. This worst-case posture is motivated by environments where the principal may not know the agent’s exact update rule, but can plausibly restrict it to a class (e.g., entropy-regularized online learners, or internal-regret minimizers used for strategic robustness).
The agent is represented by a learning algorithm 𝒜 that, given the observed history and the current contract, selects actions (possibly at random). We do not impose expected-utility maximization period by period; instead, we assume the agent satisfies a regret guarantee relative to an appropriate benchmark class of deviations.
For each round t and action
a, define the agent’s utility
from playing a against the
posted contract pt:
UA(t, a) := 𝔼o ∼ Fa[pt(o)] − ca.
Let at be
the realized action chosen by the algorithm in round t. The agent’s realized utility is
pt(ot) − cat,
while UA(t, at)
is the expectation conditional on (pt, at).
We will refer to cumulative expected utilities
$$
\sigma_t(a)\;:=\;\sum_{s=1}^{t-1} U_A(s,a),
$$
which play a central role in mean-based dynamics. Intuitively, σt(a)
is the score assigned to action a by a learner that aggregates past
expected payoffs.
The standard no-regret condition requires that, for every action
a ∈ A, the learner’s
cumulative utility is asymptotically at least that of always playing
a. Formally, the external
regret after T rounds is
$$
\mathrm{Reg}(T)
\;:=\;
\max_{a\in A}\sum_{t=1}^T U_A(t,a)\;-\;\sum_{t=1}^T U_A(t,a_t),
$$
and a no-regret learner satisfies Reg(T) = o(T).
External regret captures robustness to deviations that switch the entire action sequence to a single fixed action. It does protect against deviations that systematically revise actions in a history-dependent way (for instance, ``whenever I played action i, I should instead have played j’’).
To formalize stronger, internally consistent learning, we use swap
regret. For any mapping
(swap'') $\phi:A\to A$, consider the deviation that replaces each played action $a_t$ with $\phi(a_t)$. The swap regret is \[ \mathrm{SwapReg}(T) \;:=\; \max_{\phi:A\to A}\sum_{t=1}^T U_A\big(t,\phi(a_t)\big)\;-\;\sum_{t=1}^T U_A(t,a_t). \] A no-swap-regret learner satisfies $\mathrm{SwapReg}(T)=o(T)$. This notion is strictly stronger than external regret and is closely tied to correlated-equilibrium obedience constraints in repeated play. Economically, it represents an agent that can defend against \emph{path-dependent} exploitation: if a principal's policy benefits from inducing the agent to play different actions at different times, swap regret asks whether the agent could have systematicallyre-labeled’’
those choices to do better.
Many widely used online learning rules—including multiplicative weights and softmax choice with entropy regularization—make action probabilities an explicit function of cumulative scores such as σt(⋅). For our purposes, we isolate a behavioral implication rather than a specific algorithm.
We say a learner is if there exists a function γ(T) = o(1) such
that, for all times t ≤ T and actions i, j ∈ A,
σt(i) ≤ σt(j) − γ(T)T ⇒ Pr [at = i] ≤ γ(T).
That is, if an action i is
behind another action j by a
sufficiently large cumulative margin, then the learner assigns i only vanishing probability. This
condition captures the inertia central to our dynamic-contract
constructions: once an action has accumulated a large advantage in
cumulative utility, it remains likely to be played even if it is no
longer optimal in the instantaneous sense. At the same time, the
condition is weak enough to include a broad family of algorithms and to
permit persistent (but diminishing) exploration.
A canonical example is a softmax learner that selects actions
according to
$$
\Pr[a_t=a]\;\propto\;\exp\!\left(\frac{\sigma_t(a)}{\tau}\right),
$$
where τ > 0 is a
temperature (equivalently, an entropy-regularization parameter). Smaller
τ makes the learner more
greedy with respect to cumulative scores, while larger τ increases exploration. Such
learners are mean-based in the above sense (with γ(T) depending on τ and the bounded payoff range), and
can satisfy external no-regret under mild conditions. However, unless
explicitly designed for internal consistency, they need not achieve
no-swap-regret, and thus may remain vulnerable to dynamic steering.
Because our benchmark comparisons will involve static contracting, we
define the agent’s best-response correspondence to a contract p ∈ 𝒫:
BR(p) := arg maxa ∈ A{𝔼o ∼ Fa[p(o)] − ca}.
The set-valued nature of BR(p) matters: if
multiple actions tie for highest expected utility, then a learning agent
might converge to any of them (or cycle), and a robust principal should
evaluate the worst case. This motivates the ``mina ∈ BR(p)’’
convention in our static benchmark.
The principal’s realized cumulative payoff under policy π and agent algorithm 𝒜 is $\sum_{t=1}^T
u_P(p_t,a_t)$, where pt = π(ht).
We evaluate long-run performance in expected average terms, taking
expectations over outcome randomness and any randomization in π or 𝒜. For a learner class 𝔏, define the principal’s robust long-run
value as
$$
V(\mathfrak L)
\;:=\;
\sup_{\pi}\ \inf_{\mathcal A\in\mathfrak L}\
\liminf_{T\to\infty}\frac{1}{T}\,
\mathbb E\!\left[\sum_{t=1}^T u_P(p_t,a_t)\right].
$$
Two aspects of this definition are deliberate. First, we take a supremum
over rather than static contracts, so V(𝔏) measures the maximal benefit of
dynamic contracting when facing learner class 𝔏. Second, we take an infimum over algorithms
in the class, which encodes robustness: the principal can rely only on
the properties shared by all learners in 𝔏 (e.g., no-swap-regret or mean-based
behavior), not on detailed implementation.
To anchor the analysis, we also define an optimal benchmark. If the
principal commits to a single contract p ∈ 𝒫 in every round, then any
sufficiently patient learner with vanishing regret should asymptotically
concentrate on best responses to p. The worst-case long-run payoff
from static commitment is therefore captured by
Vstatic := maxp ∈ 𝒫 mina ∈ BR(p)uP(p, a).
This benchmark is the natural analogue of a Stackelberg value in a
one-shot principal–agent problem with possibly non-unique best
responses: the principal chooses p, and Nature selects the least
favorable best response for the principal. In Section~3 we relate this
object to standard one-shot contracting logic and show why, against
sufficiently defensive learning (notably, no-swap-regret), dynamic
policies cannot systematically outperform it.
Our notion of
dynamic advantage'' is only meaningful relative to a clean static baseline. In this section we therefore unpack the benchmark \[ V_{\mathrm{static}} \;:=\; \max_{p\in\mathcal P}\ \min_{a\in BR(p)} u_P(p,a), \] and explain why it is the appropriate one-shot Stackelberg value in our contracting environment. We also clarify when, and in what sense, repeatedly posting a single contract replicates the one-shot outcome against a learning agent. These observations will be the hinge for the impossibility result in the next section: once the agent's learning rule is sufficientlydefensive’’
(in particular, no-swap-regret), the principal is effectively pushed
back to the static benchmark.
Consider the underlying stage interaction in isolation. The principal first commits to a contract p ∈ 𝒫, after which the agent chooses an action a ∈ A that maximizes its expected utility uA(p, a). This is the standard Stackelberg timing used in moral-hazard models: incentives are set ex ante, actions follow.
Two modeling choices matter for the value the principal can guarantee. The first is that best responses may not be unique. The second is that, in our robust perspective, the principal does not control the agent’s tie-breaking (nor do we assume the principal can predict it). We therefore adopt the Stackelberg convention: when multiple actions maximize uA(p, ⋅), Nature selects the one that is worst for the principal. Formally, the principal’s guaranteed payoff from contract p is mina ∈ BR(p)uP(p, a), and optimizing over p yields Vstatic.
This pessimistic tie-breaking is conservative but economically natural in settings where (i) the principal does not observe actions and cannot condition on them, and (ii) the agent is represented by an algorithm whose fine-grained selection among near-ties may be opaque. In applications, a principal may hope for tie-breaking (e.g., through communication, norms, or default choices), which corresponds to maxa ∈ BR(p)uP(p, a). Our subsequent results are stated for the pessimistic benchmark because it is the correct comparator for worst-case guarantees and because it interacts cleanly with regret-based learning: learning rules generally guarantee performance relative to deviation classes, not favorable tie-breaking for the principal.
It is helpful to relate Vstatic to the familiar
one-shot incentive-design program. Fix a target action a. In a textbook formulation, the
principal would like to minimize the expected transfer subject to
incentive constraints ensuring a is optimal for the agent:
uA(p, a) ≥ uA(p, b) ∀b ∈ A.
If a is strictly optimal under
the chosen p, then the agent’s
best response is unique and the principal’s payoff is simply uP(p, a) = Ra − 𝔼o ∼ Fa[p(o)].
When strict optimality is not guaranteed (because 𝒫 is coarse, or because the optimal p lies on an indifference boundary),
then the principal must confront the possibility that the induced
best-response set contains multiple actions with different implications
for uP.
The object Vstatic can be viewed as ``one-shot contracting with adverse tie-breaking.’’ It asks the principal to choose p not merely to make some action optimal, but to ensure that action that is optimal for the agent under p yields the principal at least the guaranteed value. In other words, a contract is only as good as its worst best response.
This distinction matters even in classical environments. For example, limited liability and bounded transfers can force indifferences: the principal may be unable to separate two actions that generate similar outcome distributions or differ in costs by less than the available incentive power. In such cases, the principal may be able to make a desirable action a optimal but may be unable to prevent an alternative action b (with worse uP) from also being optimal. The pessimistic Stackelberg benchmark correctly treats this as a real implementation constraint.
We now connect the one-shot benchmark to the repeated interaction.
Suppose the principal posts a contract p in every round: pt ≡ p.
Then the agent faces a stationary payoff environment in which, for each
action a, the per-period
expected utility is constant:
UA(t, a) = uA(p, a) ∀t.
In such a stationary environment, standard regret guarantees imply that
the agent’s long-run behavior concentrates on (approximate) best
responses to p, so long-run
payoffs coincide with the one-shot logic up to vanishing errors.
To make this precise, let ϵT := Reg(T)/T
denote average external regret. Since UA(t, a)
does not vary with t under a
fixed contract, the external no-regret condition implies
$$
\frac{1}{T}\sum_{t=1}^T U_A(t,a_t)
\;\ge\;
\max_{a\in A} u_A(p,a)\ -\ \epsilon_T.
$$
Let a⋆ ∈ arg maxauA(p, a)
be an optimal action, and define the empirical distribution of play
qT ∈ Δ(A)
by
$$
q_T(a)\;:=\;\frac{1}{T}\sum_{t=1}^T \Pr[a_t=a].
$$
Taking expectations over the agent’s randomization and using
stationarity yields
∑a ∈ AqT(a) uA(p, a) ≥ uA(p, a⋆) − ϵT.
This inequality immediately controls the mass placed on suboptimal
actions. If an action i is
Δ-suboptimal, in the sense
that
uA(p, a⋆) − uA(p, i) ≥ Δ > 0,
then rearranging gives
$$
q_T(i)\ \le\ \frac{\epsilon_T}{\Delta}.
$$
Thus, whenever the best response is unique (or, more generally,
separated by a positive gap from the rest), external no-regret forces
the long-run frequency of non-best-response actions to vanish. In that
case, repeatedly posting p
replicates the one-shot prediction: asymptotically the agent plays the
unique best response, and the principal’s average payoff converges to
uP(p, a⋆).
The remaining subtlety is precisely the one encoded by mina ∈ BR(p).
When BR(p)
contains multiple actions, the above argument only implies that qT concentrates
on that set, not which element of the set is selected. Consequently, the
principal’s long-run payoff under a fixed contract p converges (along subsequences)
to
∑a ∈ BR(p)q(a) uP(p, a)
for some limiting distribution q supported on BR(p). Without
additional structure on the learning algorithm’s tie-breaking, the
principal cannot rule out convergence to the worst element of BR(p), which
motivates the pessimistic evaluation mina ∈ BR(p)uP(p, a).
A useful genericity remark is that indifferences are ``knife-edge’’ when 𝒫 is sufficiently rich. Indeed, for fixed i ≠ j, the indifference condition uA(p, i) = uA(p, j) defines an affine hyperplane in ℝm (because uA(p, a) is linear in p through 𝔼o ∼ Fa[p(o)]). If 𝒫 has nontrivial dimension, one can often perturb p slightly (within feasibility) to break ties while changing the principal’s payoff only slightly. This heuristic helps reconcile the pessimistic definition with the intuition that, in many continuous contract families, optimal contracts will have unique best responses. We nevertheless keep the worst-case tie-breaking explicitly because it is the correct benchmark under bounded, discrete, or otherwise coarse contract spaces, and because our robust value comparisons are stated uniformly over environments.
The preceding discussion establishes that Vstatic is not an arbitrary comparator: it is the value of a genuine one-shot Stackelberg problem and is also the long-run value delivered by repeatedly posting a single contract, up to vanishing regret errors, against a broad set of learning agents. This gives Vstatic two distinct interpretations:Both interpretations are important for what follows. The first ties our repeated-game benchmark to standard contracting theory. The second clarifies why the repeated game is not automatically more powerful for the principal: if the agent’s learning is sufficiently robust, then the principal cannot leverage nonstationarity to extract additional surplus beyond what is available under static commitment.
At the same time, viewing Vstatic as a baseline also highlights the only possible channel for dynamic gains. Any improvement over Vstatic must come from exploiting dynamics of learning: by varying pt over time, the principal may be able to shape the agent’s cumulative scores or beliefs so that the agent plays actions that would not be chosen under the ultimately intended stationary contract. Whether this is feasible depends entirely on what deviation class the learner defends against. External no-regret alone leaves considerable room for path-dependent manipulation, while internal (swap) regret closes much of it by enforcing a form of dynamic consistency. The next section formalizes this distinction: against no-swap-regret learners, dynamic policies collapse back to the static benchmark, whereas against mean-based learners, carefully designed nonstationarities can strictly improve the principal’s long-run payoff.
The previous section isolated the static pessimistic Stackelberg benchmark Vstatic as the natural comparator for repeated contracting. We now show that, once the agent’s learning rule is sufficiently —in the sense of guaranteeing vanishing (internal) regret—the principal cannot systematically profit from nonstationarity. In short, dynamic contracts do not buy the principal additional long-run value against an agent that is robust to action-remapping deviations.
Fix any class 𝔏swap of learning algorithms that guarantee SwapReg(T) = o(T) with respect to the agent’s expected utilities UA(t, ⋅). Recall that the principal evaluates a policy π pessimistically, taking an infimum over algorithms in the class. Our first main observation is that, under this evaluation, dynamic policies collapse to the static benchmark.
The economic logic is straightforward. Dynamic advantage requires
exploiting : the principal varies pt to shape the
agent’s internal state (cumulative utilities, beliefs, scores), thereby
inducing actions that would not be chosen under the
eventual'' contract. Swap regret is precisely a defense against such path dependence. It allows the agent to compare its realized trajectory not only to \emph{fixed} alternative actions (external regret), but to \emph{systematic relabelings} of its own behavior (internal regret). If the principal's manipulation hinges on getting the agent tostick’’
with a dominated label (e.g., continuing to play a costly action because
it once built up score advantage), then a swap deviation that replaces
that label by a cheaper alternative exposes the manipulation.
Anticipating this, a no-swap-regret learner will not provide the
principal with the sustained slack needed to extract rents
dynamically.
One can also view the result through an extreme (but instructive)
special case: a fully informed agent can play a each period, selecting
some at ∈ BR(pt).
Such behavior has swap regret, because for every mapping ϕ : A → A and
every t,
$$
U_A(t,\phi(a_t))\ \le\ U_A(t,a_t)
\qquad\Rightarrow\qquad
\sum_{t=1}^T U_A(t,\phi(a_t))-\sum_{t=1}^T U_A(t,a_t)\ \le\ 0.
$$
Against such an agent, the principal’s per-round payoff is always
bounded above by the pessimistic Stackelberg payoff of the posted
contract:
uP(pt, at) ≤ mina ∈ BR(pt)uP(pt, a),
so averaging over t and
maximizing over the choice of pt cannot exceed
maxpmina ∈ BR(p)uP(p, a) = Vstatic.
This already shows that (i.e., under an infimum over all swap-regret
learners), the principal cannot guarantee more than Vstatic. The more
substantive content of the proposition is that the same upper bound is
enforced even when we do not assume myopic best responses, but only the
weaker asymptotic property SwapReg(T) = o(T).
To connect swap regret to equilibrium constraints, let us write the
internal-regret inequality in a form that exposes its ``obedience’’
content. Fix a horizon T and
consider the agent’s realized sequence {at}t = 1T
against the principal’s posted contracts {pt}t = 1T.
The swap regret bound says that for every mapping ϕ : A → A,
It suffices to consider deviations that change one action into another:
for each pair i, j ∈ A, let
ϕi → j
map i to j and fix all other actions.
Substituting ϕi → j
into yields, for all i, j,
Equation says:
This is exactly the form of the obedience constraints, but with a
one-sided focus on the agent. To make the correspondence explicit,
define the empirical distribution μT over pairs
(p, a) ∈ 𝒫 × A
induced by the repeated play:
$$
\mu_T(B\times\{i\})\ :=\ \frac{1}{T}\sum_{t=1}^T \mathbf{1}\{p_t\in B,\
a_t=i\},
\qquad B\subseteq\mathcal P.
$$
Then can be rewritten as
where we used that UA(t, a) = uA(pt, a)
in expectation. Interpreting a
as a ``recommendation’’ (the action the agent ends up taking), states
that the agent has (asymptotically) no incentive to deviate from the
recommendation to any fixed alternative action j. Thus, any limit point μ of {μT} satisfies
the correlated-equilibrium-type obedience constraints for the
agent:
𝔼μ[1{a = i}(uA(p, j) − uA(p, i))] ≤ 0, ∀i, j ∈ A.
A convenient way to read these constraints is through conditional
expectations. If we condition on a = i and define the
conditional distribution over contracts μ(⋅ ∣ a = i), then
obedience implies that i
maximizes the agent’s conditional expected utility:
𝔼p ∼ μ(⋅ ∣ a = i)[uA(p, i)] ≥ 𝔼p ∼ μ(⋅ ∣ a = i)[uA(p, j)] ∀j ∈ A.
Because uA(p, a)
is linear in p (through 𝔼o ∼ Fa[p(o)]),
we may equivalently define the ,
p̄i := 𝔼μ[ p ∣ a = i ] ∈ co(𝒫),
and conclude that i ∈ BR(p̄i)
for every i played with
positive probability. In other words, swap regret forces the play to
look like a correlation device that draws a contract from a distribution
tailored to the recommended action, but still respects the agent’s
incentive compatibility . This is the formal sense in which internal
regret eliminates ``behavioral mistakes’’ that a principal could
otherwise amplify dynamically.
We now explain why these agent-side correlated-equilibrium constraints are enough to kill dynamic advantage for the principal in our contracting game. The high-level reason is that the principal cannot condition transfers on the agent’s hidden action. Consequently, any long-run outcome must be supportable by contracts for which the realized actions are (approximately) best responses; but when the agent is allowed to resolve indifferences adversarially, the principal is driven to the same worst-best-response evaluation that defines Vstatic.
A compact way to formalize this is to view the repeated interaction
as inducing, in the limit, a joint distribution μ over (p, a) satisfying the
obedience constraints. The principal’s long-run average payoff under
μ is
𝔼μ[uP(p, a)].
Since the principal takes an infimum over learners in 𝔏swap, we should treat μ pessimistically: whenever the
agent is (approximately) indifferent among multiple actions, the learner
may select those actions so as to minimize the principal’s payoff while
preserving obedience.
This pessimism is not an additional modeling choice; it is precisely what internal regret enables. If the principal tries to exploit an indifference region by ``nudging’’ the learner into a principal-favorable best response, a swap-regret learner can instead implement a different best response on that region without sacrificing its own utility (or its regret guarantees). From the principal’s perspective, the only contracts whose induced behavior is robust to such relabelings are those that maximize the principal payoff under worst-case best-response selection—exactly the definition of Vstatic.
Technically, one can convert this intuition into an upper bound via a
reduction to the one-shot pessimistic Stackelberg problem. Fix any
(possibly adaptive) principal policy π. Consider the subclass of
no-swap-regret learners that, whenever there are multiple approximate
best responses, choose among them so as to minimize uP(pt, a)
(subject to maintaining the no-swap-regret property). Such tie-breaking
is compatible with swap regret because internal regret constrains only
the agent’s utilities, and within an indifference set the agent can move
probability mass without affecting its cumulative utility up to o(T). It follows that,
under this worst-case learner, the principal’s per-period payoff is
asymptotically bounded by the pessimistic Stackelberg payoff of the
posted contract:
uP(pt, at) ≤ mina ∈ BR(pt)uP(pt, a) + o(1).
Averaging over t gives
$$
\frac{1}{T}\sum_{t=1}^T u_P(p_t,a_t)
\ \le\
\frac{1}{T}\sum_{t=1}^T \min_{a\in BR(p_t)}u_P(p_t,a)\ +\ o(1)
\ \le\
\max_{p\in\mathcal P}\min_{a\in BR(p)}u_P(p,a)\ +\ o(1),
$$
which yields the desired upper bound V(𝔏swap) ≤ Vstatic.
The reverse inequality V(𝔏swap) ≥ Vstatic
is achieved by the stationary policy that repeats an optimal static
contract.
This proof route emphasizes the game-theoretic meaning of internal regret: it gives the agent enough discipline to implement (approximate) obedience constraints enough freedom to select the principal-worst obedient behavior whenever the contract does not pin down a unique response. The repeated game therefore does not enlarge the principal’s guaranteed value set beyond what is already feasible under static commitment with pessimistic tie-breaking.
The preceding argument is clearest when 𝒫 is finite, but the conclusion does not rely on finiteness. The key requirements are (i) bounded payments (so that regret bounds apply uniformly), and (ii) linearity of payoffs in the contract vector p ∈ ℝ+m. When 𝒫 is compact (as under limited liability with an upper bound), one can discretize 𝒫 by an ε-net in ℓ∞ and observe that both uA(p, a) and uP(p, a) are Lipschitz in p with constants controlled by the outcome probabilities. The discretized game yields the same conclusion up to an O(ε) error, and letting ε → 0 recovers Vstatic.
Because both players’ expected utilities depend on p only through expectations 𝔼o ∼ Fa[p(o)], randomizing over contracts is equivalent to posting the expected contract whenever 𝒫 is convex. If 𝒫 is not convex, then allowing the principal to mix effectively enlarges the feasible set to co(𝒫). The same impossibility logic goes through with Vstatic interpreted over the relevant feasible set (either 𝒫 if mixing is disallowed, or co(𝒫) if mixing is allowed). In either case, internal regret prevents improvement beyond the appropriate static commitment benchmark.
The knife-edge nature of the impossibility result is economically
informative: dynamic advantage reappears only to the extent that
internal-regret protection is imperfect. If the agent guarantees SwapReg(T) ≤ ηT
for some small η > 0, then
the obedience constraints are violated by at most η in aggregate. In finite games
these become a system of linear inequalities, and standard duality
arguments imply a Lipschitz-type bound: the principal’s best dynamic
advantage is at most additive O(η) above Vstatic (with constants
depending on payoff bounds). Thus,
slightly'' defensive learners admit onlyslightly’’
exploitable dynamics.
Finally, we stress what the result does and does not say. It is not that a swap-regret learner must literally best respond every period; rather, any learner that ensures low internal regret is protected from systematic dynamic exploitation in the long run. The next section shows that this protection is special to internal regret: mean-based or entropy-regularized external-regret learners, while rational in the static sense, can be predictably steered by nonstationary incentives. This contrast is precisely what makes the static benchmark a sharp dividing line: it is simultaneously achievable by a stationary principal and unavoidable against sufficiently sophisticated (swap-regret) agents, yet not descriptive of what happens when learning dynamics are weaker.
The impossibility result in the previous section hinges on a strong form of ``defensive’’ behavior: internal-regret protection prevents the principal from leveraging the agent’s path-dependent state. When the agent instead uses a broad family of (and, in particular, entropy-regularized) no-regret methods, the picture changes sharply. In these dynamics, actions are not chosen as exact best responses to the current contract; rather, they are chosen as a smooth (or inertia-laden) function of scores. This opens a channel for the principal to in moving the learner to a desirable region of its state space and then by reducing incentives while the induced behavior decays only gradually.
We will not tie ourselves to a single algorithm. Instead, we use a
property that is shared by many externally no-regret procedures used in
practice—multiplicative weights, regularized follow-the-leader,
logit/softmax choice with annealed step-sizes, and related variants.
Recall the cumulative (expected) utility score
$$
\sigma_t(a)\ :=\ \sum_{s=1}^{t-1}U_A(s,a)
\qquad\text{where}\qquad
U_A(t,a)=\mathbb E_{o\sim F_a}[p_t(o)]-c_a.
$$
A learner is (in the sense we require) if there exists γ(T) = o(1) such
that for all rounds t ≤ T and all actions i, j ∈ A,
Condition says that once an action falls behind in cumulative utility by
a linear-in-T margin, it
becomes vanishingly unlikely to be played. Importantly, does impose the
action-remapping stability that internal regret delivers. It only rules
out persistently playing dominated actions in the long run. As a result,
the principal can profit from dominance: by creating a large early lead
for a costly, high-reward action, the principal can keep the learner on
that action even after the contract is modified, until the score
advantage is ``worked off’’.
Entropy regularization makes this intuition particularly transparent.
A canonical model is the logit rule
Pr [at = i] ∝ exp (σt(i)/τ),
where τ > 0 is a
temperature parameter. Small τ
yields sharp best-response-like behavior, while large τ induces exploration and faster
mixing. Under decreasing step-sizes (so that σt accumulates
rather than averaging), the principal can create large score gaps that
persist for many periods. Thus, even when the agent is asymptotically
no-regret in the external sense, the of play can be highly
steerable.
We now sketch a simple construction that already yields a strict gap
over the static pessimistic benchmark. The environment uses two
outcomes, O = {0, 1}, where
o = 1 is ``success.’’ Let the
principal’s reward be r(1) = 1
and r(0) = 0. Consider a
one-parameter (linear) limited-liability contract family
px(1) = x, px(0) = 0, x ∈ [0, x̄],
so that the only lever is the success bonus x.
Take two actions, a low action a = L and a high action
a = H. Let success
probabilities be qL < qH,
costs be cL = 0 and cH = c > 0.
Then the agent’s expected utility difference under contract x is
uA(px, H) − uA(px, L) = x(qH − qL) − c,
so there is a breakpoint x⋆ := c/(qH − qL):
for x > x⋆ the
high action is uniquely optimal, while for x < x⋆ the
low action is uniquely optimal. In a static contract, pushing the agent
to H requires paying at least
x⋆ per success,
which costs the principal x⋆qH
in expectation. The static pessimistic value is therefore
$$
V_{\mathrm{static}}
=\max\Bigl\{q_L,\ q_H-x^\star q_H\Bigr\}
=\max\Bigl\{q_L,\ q_H\Bigl(1-\frac{c}{q_H-q_L}\Bigr)\Bigr\},
$$
with pessimistic tie-breaking at x = x⋆.
A dynamic policy can do better against a mean-based learner by separating time into two phases:
Post a bonus xhi > x⋆ for T1 rounds. This makes H strictly better each period, so σT1 + 1(H) − σT1 + 1(L) grows linearly in T1. Under (or under a softmax with small τ), after sufficiently many rounds the learner places almost all probability on H.
Drop the bonus to xlo < x⋆
for T2 rounds.
Period-by-period, the low action is now better for the agent, so the
score advantage of H by
roughly
Δ := uA(pxlo, L) − uA(pxlo, H) = c − xlo(qH − qL) > 0
each round. However, if the principal chose T1 large enough, the
cumulative lead built in Phase I takes on the order of (σ(H) − σ(L))/Δ
rounds to dissipate. During this dissipation window, the mean-based
learner continues to play H
with high probability, even though xlo is too small to make
H optimal myopically.
From the principal’s perspective, the payoff in Phase II is close
to
uP(pxlo, H) = qH − xloqH,
which can be much larger than the static pessimistic value if xlo is chosen small. The
principal ``pays’’ for this by over-incentivizing in Phase I, but the
key is that the investment cost is incurred for T1 rounds while the
harvest benefit can be collected for T2 ≫ T1
rounds, by choosing T1 just large enough to
create a score buffer of order T2.
The construction is deliberately stylized, but it captures a robust mechanism: external no-regret allows the agent to be correct while remaining exploitable in the . In particular, the principal can ensure that the agent’s realized play spends a long fraction of time on an action that is a best response to the current contract, without forcing the agent into external regret, because the learner’s benchmark does not permit action-contingent remappings of its own behavior.
The two-action example yields a constant-factor improvement by making Phase II long relative to Phase I. Much larger gaps arise once we have a ladder of actions that differ slightly in incentives but substantially in principal value. The canonical ``free-fall cascade’’ construction uses n actions a ∈ {1, 2, …, n} with increasing principal rewards R1 < ⋯ < Rn and increasing costs c1 < ⋯ < cn, together with a contract family rich enough to create a sequence of adjacent breakpoints: for each k, there is a region of contracts where k is slightly preferable to k − 1 for the agent.
A convenient way to think about the principal’s policy is as a controlled walk on the learner’s score vector. In an phase, we post a contract that makes action k + 1 strictly better than k by a small margin, long enough to build a cumulative lead for k + 1. Repeating this step-by-step moves probability mass up the ladder toward n. Once the learner is concentrated on high actions, we switch to a phase in which the posted contract makes lower actions myopically preferable, but only slightly so. The learner then ``falls’’ from n to n − 1 to ⋯ as cumulative leads are depleted, and each step of the fall takes many rounds because the score gaps are large.
The principal benefits because the high actions have large Ra while the harvest contracts keep expected payments low. Under bounded rewards and payments, one can choose parameters so that:In such families, the ratio V(𝔏mb)/Vstatic can grow with n; informally, larger action spaces permit longer cascades (more ``stored’’ score slack) and therefore larger extraction windows.
This scaling perspective also clarifies comparative statics with respect to entropy. With a higher softmax temperature τ, the agent randomizes more and the cascade is blurred: the learner begins to mix into lower actions sooner, shortening the harvest window. Conversely, when τ is small (or when step-sizes place heavy weight on accumulated past advantages), the walk becomes sticky and the principal can implement long, predictable phases. This gives a crisp interpretation of why exploration can be protective: it reduces the principal’s ability to create long-lived score imbalances that keep the agent on a dominated label.
Transfers net out of total surplus, so the welfare effects of dynamic
manipulation come entirely from the induced action path:
Welfare in round t under
at: W(at) = Rat − cat.
Dynamic policies that keep the agent on high actions for long periods
can therefore welfare relative to a static contract that settles on a
lower action. But this same manipulation typically the agent’s long-run
share of the surplus. Indeed, since
uP(pt, at) = W(at) − uA(pt, at),
the principal can increase profits either by increasing welfare W(at)
(more productive actions) or by decreasing the agent’s utility uA (rent
extraction), and the free-fall mechanism often does both: Phase I makes
the agent whole (or better) briefly, while Phase II compresses the
agent’s payoff as incentives are withdrawn but behavior adjusts
slowly.
This observation has two practical implications. First, when we interpret the agent as an organization (or an automated system) choosing effort/quality levels, the principal’s ability to ``train’’ the system with early subsidies and later reduce compensation resembles familiar concerns about hold-up and dynamic monopsony power. Second, from a design standpoint, these dynamics make the itself economically salient: two agents with the same objectives but different algorithms can generate very different long-run surplus splits under the same contracting institution.
We also emphasize a limitation. The constructions exploit long
horizons and relatively stable environments; if contracts, outcomes, or
costs drift exogenously, then the score-based inertia that enables
harvest can be diluted. Moreover, mean-based behavior is an assumption
about the agent’s internal update rule; it is not a claim about full
rationality. Our point is narrower: among widely used no-regret
procedures, there is a systematic gap between
being hard to beat by a fixed action'' andbeing hard to
steer by a strategic principal.’’ The next section takes this seriously
by endogenizing the agent’s choice of sophistication: if dynamic
extraction is large enough, investing in a stronger (internal-regret)
defense becomes privately valuable, and equilibrium selection hinges on
the cost of that investment.
Thus far we have taken the learner class 𝔏 as exogenous. In many applications, however, the agent (or the agent’s developer) has discretion over the learning rule: one can run a lightweight, myopic, mean-based update that is fast and easy to implement, or a more ``defensive’’ procedure that controls internal regret but requires additional computation, memory, or engineering effort. This motivates a simple meta-game in which is itself an endogenous choice, purchased at a cost.
We augment the repeated interaction with a stage 0 in which the agent selects an algorithm
from a menu. For expositional clarity, consider two options:
𝒜 ∈ {𝒜mb, 𝒜swap},
where 𝒜mb is drawn from a
mean-based externally no-regret class 𝔏mb, and 𝒜swap is drawn from an
internal-regret (no-swap-regret) class 𝔏swap. The agent incurs a
(possibly one-time) complexity cost κ(𝒜) ≥ 0, with κ(𝒜swap) ≥ κ(𝒜mb)
capturing the idea that stronger defensive learning is more
expensive.
After observing (or being able to infer) the algorithm choice, the principal commits to a policy π mapping histories to contracts, and the repeated interaction proceeds as before. We analyze the natural Stackelberg timing: the principal chooses an optimal policy anticipating the agent’s stage-0 choice, while the agent chooses an algorithm anticipating the principal’s best response.
Given a learner class 𝔏, recall the
principal’s robust long-run value
$$
V(\mathfrak L)
=\sup_{\pi}\inf_{\mathcal A\in\mathfrak
L}\liminf_{T\to\infty}\frac{1}{T}\mathbb E\Bigl[\sum_{t=1}^T
u_P(p_t,a_t)\Bigr].
$$
In particular, Propositions~1–2 identify two relevant quantities:
V(𝔏swap) = Vstatic, V(𝔏mb) ≥ Vstatic + ΔP,
for some environments where the principal’s dynamic advantage ΔP > 0 can be
constant-factor or even grow with n under bounded normalizations.
To close the stage-0 problem, we
must also track the agent’s induced long-run payoff. For a fixed
policy–algorithm pair (π, 𝒜),
define the agent’s long-run average utility
$$
\bar u_A(\pi,\mathcal A)
:=\liminf_{T\to\infty}\frac{1}{T}\mathbb E\Bigl[\sum_{t=1}^T
u_A(p_t,a_t)\Bigr].
$$
The agent chooses 𝒜 to maximize ūA(π⋆(𝒜), 𝒜) − κ(𝒜),
where π⋆(𝒜) denotes
the principal’s best response to the selected algorithm (or more
generally to the class it belongs to).
A useful accounting identity links the two players through per-round
welfare. Let W(a) := Ra − ca
denote total surplus under action a. Because transfers net out,
uP(pt, at) + uA(pt, at) = W(at).
Therefore, for any (π, 𝒜),
where W̄(π, 𝒜) is the
long-run average welfare induced by the action path. Equation makes
explicit what the meta-game is really about: defensive learning matters
to the agent to the extent it (i) prevents the principal from increasing
ūP at the
agent’s expense, and/or (ii) changes the induced welfare trajectory
W̄.
In the simplest and most relevant case, adopting a swap-regret defense forecloses the principal’s dynamic extraction channel without drastically changing efficient behavior. Then the dominant effect is distributional: relative to mean-based learning, swap-regret learning improves the agent’s long-run utility by approximately the amount of principal value that can no longer be extracted dynamically.
To express this cleanly, define the principal’s optimal values
against each class:
Vmb := V(𝔏mb), Vswap := V(𝔏swap) = Vstatic.
Define also the corresponding induced welfare levels under
principal-optimal policies,
$$
W_{\mathrm{mb}}
:=\liminf_{T\to\infty}\frac{1}{T}\mathbb E\Bigl[\sum_{t=1}^T
W(a_t)\Bigr]\ \ \text{under an optimal policy against }\mathfrak
L_{\mathrm{mb}},
$$
and similarly Wswap
under an optimal (static) policy against 𝔏swap. While Wmb and Wswap can differ in
general (dynamic steering may raise welfare by keeping the agent on high
actions longer), the decomposition implies that the agent’s long-run
payoff under principal-optimal play can be written as
Umb := Wmb − Vmb, Uswap := Wswap − Vswap.
Hence the agent prefers the defensive algorithm 𝒜swap whenever
}-(A_{}),
\end{equation}
or equivalently
A_{})-(A_{})).
\end{equation}
The left-hand side is the principal’s from facing a mean-based learner
rather than a swap-regret learner. The right-hand side adds two forces
that push in the opposite direction from the agent’s perspective: (i)
any welfare improvement created by dynamic steering, which the agent may
partially internalize, and (ii) the direct complexity cost of adopting
the defense. In environments where welfare changes are second-order—or
where the principal can reallocate essentially all welfare gains away
from the agent—condition reduces to a clean cutoff:
l A_{})-(A_{}).
\end{equation}
This yields a concrete economic reading. Internal regret is a . The agent ``buys’’ it if and only if the expected surplus protected (the dynamic gap the principal could otherwise appropriate) exceeds the defense’s cost. Moreover, because Proposition~2 allows the gap Vmb − Vstatic to scale with problem size (e.g., with n in cascade constructions), the model predicts a stark nonlinearity: as environments become more complex, or as the principal gains more degrees of freedom in contracting, the private value of defensive learning can jump discontinuously from negligible to decisive.
From the principal’s perspective, the stage-0 meta-game endogenizes the relevant learner class. If the complexity premium κ(𝒜swap) − κ(𝒜mb) is high, the principal can rationally invest in dynamic policies tailored to mean-based inertia, expecting the agent to remain ``cheaply steerable.’’ Conversely, if defensive learning is inexpensive, then even a small dynamic advantage triggers algorithm upgrading, collapsing the principal’s achievable value back to Vstatic.
This logic suggests an additional comparative static that is absent in the exogenous-class analysis: (by enriching 𝒫, increasing horizon stability, or improving state inference) can be self-defeating once agents can upgrade. The principal may optimally commit to using exploitative dynamic policies if such behavior would induce widespread adoption of defensive algorithms. In procurement language, this is a form of dynamic ``discipline’’: opportunistic contracting practices can shift the agent population toward more robust learning, reducing the principal’s future rents.
One can formalize this by allowing the principal to commit to a
indexed by an observably certified constraint (e.g.,
static-only contracts,'' orcontracts that are Lipschitz in
time’’), and then analyzing how these commitments change the agent’s
stage-$0 incentives. In such variants, the principal faces an explicit
tradeoff between higher short-run extraction and the longer-run induced
shift in agent sophistication.
The stage-$0 view connects directly to practical questions about the deployment of AI assistants and other automated agents. In many settings, a ``principal’’ (a firm, a platform, or an end user) interacts repeatedly with an assistant whose behavior adapts over time, while the principal controls rewards, pricing, or evaluation metrics. Our analysis suggests that —such as internal-regret bounds or stronger notions of stability—should be treated as economically meaningful attributes, akin to warranties or safety certifications.
Two implications stand out.
An agent (or vendor) can use certification of defensive learning to credibly commit that it will not be steered by transient incentives. In our model, such certification effectively moves the interaction from 𝔏mb to 𝔏swap, collapsing the principal’s dynamic advantage to the static benchmark. This is attractive to the agent because it protects long-run surplus, but it may also be attractive to the principal in environments where exploitation risks create reputational or regulatory costs. Importantly, certification is only valuable if it is ; otherwise, the principal cannot condition its policy on the claimed algorithm class, and the meta-game unravels.
In practice, verifiability could take the form of audited training procedures, reproducible evaluation suites that test for internal-regret-like behavior under adversarial reward shaping, or cryptographic attestations of deployed code. Our analysis does not prescribe a particular mechanism, but it clarifies what must be certified: not merely that the agent ``performs well,’’ but that it satisfies a stability notion that blocks dynamic steering.
When a principal procures an AI assistant, it typically specifies performance metrics and payment terms. The model highlights that these terms implicitly shape the assistant’s learning dynamics: a payment scheme that is innocuous against a swap-regret learner may be highly distortive against a mean-based learner, and vice versa. Therefore procurement may need to include (e.g., internal regret bounds, exploration floors, or update-rate caps) as part of the contract, much like specifying security standards.
A simple reading of is that procurement can shift the equilibrium by subsidizing sophistication: the principal (or a regulator) can reduce κ(𝒜swap) through tooling, shared infrastructure, or mandated defaults. Doing so may reduce the incidence of exploitative dynamics, albeit potentially at the cost of slower adaptation or higher compute. This frames a concrete policy tradeoff: lowering the cost of defensive learning improves robustness to manipulation but may reduce the feasibility of lightweight deployments.
We stress two limitations of the stylized stage-$0 formulation. First, the agent’s choice set is richer than {𝒜mb, 𝒜swap}; real systems interpolate continuously between them (e.g., approximate internal regret, partial monitoring, bounded memory). A more realistic model would let the agent pick a parameter η governing an approximate swap-regret guarantee SwapReg(T) ≤ ηT, with an increasing cost κ(η), and then use continuity results (as in Proposition~3) to obtain a smooth version of the cutoff. Second, we have treated the agent as the sole decision-maker over its algorithm, whereas in many markets the algorithm is selected by a developer while the ``agent’’ experiencing payoffs is a downstream user. This wedge can generate underinvestment in defensive learning and thus amplify the principal’s dynamic power.
Despite these caveats, the core lesson is robust: when learning rules are endogenous, the principal’s ability to exploit behavioral inertia is not merely a property of the environment; it is an equilibrium outcome shaped by the relative costs of sophistication and the availability of credible guarantees. This motivates the extensions in the next section, where we ask how these conclusions change once the environment is contextual or stateful and the relevant defensive notions must be strengthened beyond external regret.
The baseline model treats each round as a fresh moral-hazard instance, with the principal choosing pt and the agent choosing at absent any persistent state or exogenous covariates. This abstraction is useful for isolating the economic role of learning guarantees, but many applications are explicitly (the mapping from actions to outcomes depends on observable features), (today’s action changes tomorrow’s opportunity set), or (the agent only observes bandit feedback). We briefly sketch how the main logic extends, and where genuinely new phenomena can arise.
Suppose that before contracting in round t a publicly observed context xt ∈ 𝒳 is
realized. The principal posts a context-contingent contract pt(⋅ ; xt) ∈ 𝒫(xt) ⊆ ℝ+m,
the agent chooses at ∈ A,
and then ot ∼ Fat(⋅ |xt).
Utilities remain
uA(pt(⋅; xt), at; xt) = 𝔼[pt(o; xt) ∣ xt, at] − cat, uP(pt(⋅; xt), at; xt) = Rat(xt) − 𝔼[pt(o; xt) ∣ xt, at],
where Ra(x) := 𝔼[r(o) ∣ x, a].
A natural benchmark is now a p : 𝒳 → 𝒫(⋅) chosen once and applied
each round:
$$
V_{\mathrm{static}}^{\mathrm{ctx}}
:=\sup_{p(\cdot)}\ \liminf_{T\to\infty}\ \frac{1}{T}\sum_{t=1}^T\
\min_{a\in BR(p(\cdot;x_t);x_t)} u_P\bigl(p(\cdot;x_t),a;x_t\bigr),
$$
with BR(p(⋅; x); x)
defined in the obvious way. Under i.i.d. contexts xt ∼ 𝒟, this
reduces to maximizing an expectation over x ∼ 𝒟; under adversarial contexts it
becomes a worst-case time average.
The key question is whether dynamic, history-dependent policies can outperform this benchmark against sophisticated learners. The same mechanism behind Proposition~1 suggests a qualified ``no’’ once the agent controls internal regret in the stage game whose action set is still A but whose payoff depends on (xt, pt). Intuitively, if the agent’s learning rule enforces approximate obedience constraints , then the repeated play is pinned to a correlated-equilibrium-like set in each information slice, and the principal cannot systematically extract more than the best stationary mapping from contexts to contracts. What changes relative to the non-contextual case is not the logic but the object: static optimality becomes .
At the same time, contextuality expands the principal’s steering
tools against weaker learners. Even when 𝒫(x) is simple for each x, the principal can interleave
contexts so as to create persistent cumulative-utility gaps (the
analogue of
free-fall'') that are invisible to an external-regret criterion aggregated across heterogeneous rounds. This is one reason contextual bandit settings are a natural laboratory for manipulation: the principal can useeasy’’
contexts to subsidize a costly action and ``hard’’ contexts to harvest
rents, while mean-based learners may not correctly normalize across
these regimes unless their guarantee is explicitly contextual.
A more substantive extension introduces an
unobserved-to-the-principal action state or a publicly observed system
state st ∈ 𝒮 with
Markovian dynamics. One convenient formulation is an MDP-like
interaction in which, after contract choice pt(⋅ ; st)
and action at, an outcome
ot is
realized and the next state satisfies
(st + 1, ot) ∼ P(⋅, ⋅ ∣ st, at),
with r(ot)
and pt(ot; st)
paid as before. (The principal may also condition on st if it is
observed.) The agent’s strategic problem is now inherently : choosing
at trades
off current transfers against future state visitation.
In such environments, standard external regret is often the wrong
yardstick because it compares the realized trajectory to a fixed action
that would have been played against the sequence of contracts, even
though changing actions would also change future states and thus future
contract opportunities. This is precisely where stronger notions such as
(or more generally counterfactual regret defined on policies) become
economically meaningful. One stylized target is
$$
\mathrm{PolReg}(T)
:=\max_{\pi_A\in\Pi}\ \mathbb E\Bigl[\sum_{t=1}^T
u_A\bigl(p_t(\cdot;s_t),a_t^{\pi_A};s_t\bigr)\Bigr]
-\mathbb E\Bigl[\sum_{t=1}^T
u_A\bigl(p_t(\cdot;s_t),a_t;s_t\bigr)\Bigr],
$$
where Π is a class of
stationary (or slowly varying) Markov policies for the agent and atπA ∼ πA(⋅ ∣ st)
is the counterfactual action under the state path induced by πA.
The economic content of Proposition~5 (stated earlier as optional) is that if the agent can guarantee PolReg(T) = o(T) against adaptive adversaries, then the principal’s dynamic advantage again collapses to a benchmark over and . The underlying intuition mirrors the swap-regret case: policy-regret control prevents the principal from benefiting from nonstationarities that only exist because the learner is being ``walked’’ through transient states and transient beliefs. When the agent can compare its realized performance to coherent counterfactual policies that account for state evolution, the principal loses the ability to profit from such path dependence except through genuine improvements in long-run welfare.
We view this extension as conceptually important but technically delicate. Unlike the stage game with finite (p, a), MDP interactions require additional regularity—mixing conditions, bounded influence of contract perturbations on state occupancy, or restrictions on how finely the principal can condition pt on history—to avoid degenerate pathologies. The main lesson, however, is robust: in stateful settings, from internal regret to policy regret (or an equivalent stability notion), and the same distribution-versus-efficiency decomposition reappears once we work at the right level of abstraction.
Our baseline discussion implicitly allows full-information feedback to the agent, in the sense that the agent can evaluate UA(t, a) for all a ∈ A given (pt, ot) and knowledge of Fa. In many contracting environments this is unrealistic: the agent may observe only realized payoff pt(ot) − cat and perhaps the contract pt, but not the counterfactual outcome distributions under actions it did not take. This pushes the agent into a bandit-like learning problem.
Partial feedback matters for two reasons.
First, it changes which guarantees are feasible. Achieving no-swap-regret under bandit feedback is substantially harder than achieving external no-regret, and often requires explicit exploration. If the agent is unwilling to explore (because exploration is costly in utility terms), then internal-regret-type protection may be infeasible, and the principal can exploit informational fragility even when the agent to be defensive.
Second, partial feedback can amplify dynamic manipulation. Mean-based dynamics rely on the accumulation of utilities σ̂t(a); a principal who can influence the variance and bias of these estimates (by shaping outcome risk through contract convexity, or by inducing rare but salient payments) may create persistent miscalibration. In this sense, informational constraints create a second manipulation channel beyond pure path dependence: even if the agent’s update rule is well-intentioned, the principal may profit by making some actions hard to evaluate and others easy to learn.
A practical implication is that ``certification of defensive learning’’ in bandit settings cannot merely assert an abstract regret bound; it must also specify the required information structure and exploration regime. For instance, an exploration floor (a lower bound on Pr [at = a]) may protect against certain exploitation patterns, but it also changes welfare by forcing inefficient sampling. This creates an explicit three-way tradeoff between (i) robustness to dynamic incentives, (ii) efficiency, and (iii) informational feasibility.
Finally, we can stress-test dynamic contracting schemes by allowing the interaction to end at an exogenous stopping time. Concretely, let τ be a random horizon (possibly geometric, yielding discounting), observed only when it occurs. The principal’s objective becomes an expected average or discounted payoff, and the agent’s learning problem must perform well under an uncertain horizon.
Stopping-time uncertainty interacts sharply with the dynamic extraction mechanisms in Proposition~2. Many steering policies rely on a front-loaded subsidy phase followed by a harvesting phase. If the game might stop before harvesting, then the principal bears additional risk; conversely, if the agent discounts the future or anticipates termination, then costly actions become harder to sustain, and ``free-fall’’ may accelerate. This makes τ a simple robustness parameter: dynamic manipulation that requires long horizons is fragile to termination risk, while manipulation that operates quickly is more robust but may require larger transfers (and hence may be constrained by limited liability).
From the agent’s perspective, uncertain horizons also affect which defenses are valuable. Swap-regret control is asymptotic, but in stopping-time environments the relevant comparison is finite-sample: how quickly does the defense prevent exploitation relative to expected remaining time? This suggests that algorithm-choice cutoffs (like ) should depend not only on long-run value gaps but also on convergence rates and on the distribution of τ. Put differently, when relationships are short-lived, ``sophistication’’ may not pay even if it is valuable asymptotically; when relationships are persistent, the same investment becomes compelling.
We view stopping-time variants as analytically useful even when one ultimately cares about long horizons. They force us to separate two distinct claims that can otherwise be conflated: (i) whether a defense eliminates dynamic advantage , and (ii) whether it does so . In applications such as platforms iterating rapidly on pricing rules or reward metrics, the second question is often the binding one.
Taken together, these extensions reinforce the core message while sharpening it. Dynamic exploitation is not merely a feature of repeated moral hazard; it is a feature of repeated moral hazard . Once we move to contextual, stateful, or partially observed environments, the appropriate benchmark and the appropriate defense both change, but the same economic logic continues to organize the analysis: stronger stability notions collapse the principal’s dynamic advantage toward a static (or stationary) benchmark, while weaker notions leave transient dynamics available for extraction.
Our results have a simple organizing message for applied contracting:
whether dynamic incentive schemes are
value creating'' orextractive’’ depends less on the
sophistication of the principal and more on the embedded in the agent’s
adaptation rule and the of the resulting interaction. When the agent
effectively enforces internal-consistency constraints (swap regret,
policy regret, or related notions), dynamic contracting largely
collapses to a static or stationary benchmark; when the agent only
guarantees weaker external-regret or mean-based properties, a principal
can sometimes profit from steering the agent through transient regions
of behavior that would not arise under a fully ``defended’’ response. In
this section we translate that logic into implications for (i) platform
and policy design, (ii) the design of agent-side learning systems, and
(iii) a set of open questions that we view as central for making the
theory operational.
A practical obstacle to deploying ``defensive learning’’ is not conceptual but verifiability: many stability properties are defined counterfactually (e.g., deviations that remap actions, or policy comparisons under counterfactual state paths). This makes it hard for regulators, users, or third-party auditors to certify that an agent is protected against dynamic exploitation. We therefore view as a first-class design primitive, alongside limited liability and informational constraints.
Two complementary approaches are plausible.
In repeated contracting on platforms, the principal (platform) can often log the full sequence {(pt, ot)}t ≤ T, while the agent may log additional internal signals. An audit can then check whether observed play is consistent with a class of ``stable’’ agents, in the sense that there exists a sequence of mixed actions consistent with approximate obedience constraints. In finite environments, these constraints are linear in the empirical distribution of (pt, at) and can be expressed in the same spirit as approximate correlated equilibrium. While such audits do not recover at when actions are hidden, they can still be informative in settings where outcomes are sufficiently diagnostic of actions, or where the principal is required to report additional statistics that allow inference. The policy implication is immediate: transparency requirements that force richer reporting (e.g., calibrated outcome summaries by contract) can make manipulation harder by enabling sharper audits.
Many defenses are feasible only if the agent can compute or estimate counterfactual utilities UA(t, a), or at least unbiased estimates thereof. This suggests that ``transparency’’ should not be understood as revealing the principal’s internal objectives, but rather as revealing the needed for stable learning: the mapping from outcomes to payments, the relevant outcome taxonomy, and (when feasible) enough statistical structure to estimate how alternative actions would have performed. In other words, the transparency target is the information required for the agent to implement a stronger regret notion, not the information required for the principal to optimize extraction.
A concrete policy lever is to mandate standardized reporting of outcome models (or validated simulators) in domains such as ad auctions, gig platforms, and content moderation, where the agent’s action space is large and the consequences of actions are noisy. Such requirements are not costless—they may reveal proprietary information or enable gaming by third parties—but our framework clarifies the tradeoff: restricting counterfactual evaluation pushes agents toward weaker, more exploitable learning rules, enlarging the principal’s dynamic advantage set.
A platform designer often plays a dual role: it is a principal vis-`a-vis participants, but it may also be a social planner subject to constraints (fairness, contestability, consumer protection). The theory suggests three actionable guidelines.
When participants plausibly implement sophisticated defenses (swap-regret-like learning, or algorithmically chosen defenses with low complexity cost), dynamic schemes tend not to improve on the best static benchmark in a robust sense. In such environments, frequent contract changes are more likely to increase variance, compliance costs, and perceived unfairness than to increase long-run efficiency. A policy of ``metric stability’’—limiting the rate at which scoring rules or payment weights can change—can therefore be justified not only on administrative grounds but also on incentive grounds: it removes the very degrees of freedom that enable extraction against weaker learners, without sacrificing much against stronger ones.
There are domains (fraud detection, content ranking, safety) where nonstationarity is intrinsic. In those cases, the platform can still reduce manipulation risk by constraining contract updates to be interpretable and monotone in the sense that participants can understand how actions map into outcomes and payments over time. Our model highlights why: mean-based or entropy-regularized agents can be ``walked’’ through transient phases when incentives move in ways that are hard to compare across time. Constraining updates makes it easier for agents to normalize utilities across regimes and makes it easier for auditors to detect systematic steering.
A recurring theme in dynamic extraction constructions is the use of front-loaded subsidies followed by harvesting. Limited liability bounds the available subsidy and therefore limits the speed and magnitude of steering. From a design standpoint, payment caps and escrow requirements can thus play a protective role for participants, beyond preventing insolvency: they restrict the principal’s ability to create large cumulative-utility gaps that weaker learners cannot immediately undo. Put differently, limited liability can substitute (imperfectly) for sophisticated defenses when those defenses are infeasible.
From the agent side, the main design choice is not merely which learning algorithm minimizes regret, but which algorithm under an adaptive principal. This reframes familiar engineering decisions—exploration schedules, regularization, and feedback assumptions—as economic design choices.
If the interaction is well approximated by a repeated stage game with payoffs that can be evaluated (or consistently estimated), then internal-regret minimization is a natural defense: it limits the principal’s ability to benefit from history dependence beyond the static benchmark. Practically, this points toward algorithms that control swap regret or implement calibrated learning dynamics, even if they are computationally heavier than standard no-regret methods. The meta-game logic (algorithm-choice thresholds) suggests that such upgrades should be targeted to environments where the expected exploitation gap is large relative to computational and implementation costs.
In bandit-like environments, an agent that cannot estimate counterfactual payoffs is structurally vulnerable. Agent designers can sometimes mitigate this by investing in instrumentation: logging richer features, building causal models that predict outcomes under alternative actions, or negotiating for additional signals from the platform. Even imperfect counterfactual models can help if they reduce bias and variance enough to approximate the obedience constraints that internal-regret learning requires.
Standard offline evaluation of learning systems often tests performance against fixed environments. Our results suggest adding adversarially adaptive principals (or platform simulators that update pt based on observed behavior) as a stress test. A simple version is to measure the worst-case gap between realized payoff and a static best response under the empirical distribution of contracts. More ambitious versions attempt to certify approximate swap-regret or policy-regret bounds under a family of plausible principal policies.
Even when asymptotic guarantees are strong, finite-horizon behavior can be exploitable. Agent designers should therefore treat convergence speed as economically meaningful: if a defense requires long horizons to ``kick in,’’ it may fail in relationships with churn, termination risk, or regime changes. This suggests using anytime algorithms, explicit learning-rate schedules tied to horizon estimates, and conservative initialization procedures that reduce the profitability of front-loaded steering.
Regulatory debates around platforms frequently emphasize transparency, but the relevant notion is subtle. Revealing more about the platform’s optimization may or may not help participants. By contrast, revealing (or standardizing) the and the directly affects the feasibility of defensive learning.
A regulatory framework aligned with our model would focus on three objects:
Notably, these recommendations do not require regulating the principal’s objective r(⋅) directly. They instead regulate the that determine whether the principal can profit from adaptivity against weaker learners.
We conclude with open questions that, in our view, determine how far this line of work can be pushed in theory and practice.
Mean-based and entropy-regularized learners are stylized. Real systems mix heuristics, constraints, and partial observability. A central challenge is to define learner classes 𝔏 that are both behaviorally realistic and analytically tractable, and then to characterize V(𝔏) beyond existence of gaps. Even in finite settings, the principal’s optimal policy may resemble an optimal control problem over the agent’s score vector σt(⋅); turning this into usable prescriptions remains largely open.
Bandit feedback makes strong defenses expensive because exploration reduces utility. How should an agent optimally trade off immediate payoff against robustness to dynamic extraction? Conversely, how should a platform be allowed to shape information disclosure without creating perverse incentives to obscure feedback? We expect the right model to treat exploration as a choice variable with an explicit economic cost, yielding equilibrium predictions for when ``defensive exploration’’ emerges.
Platforms typically contract with many agents simultaneously. Competition can either discipline the principal (by making extraction harder) or intensify it (by creating relative-performance schemes that magnify steering). Extending robust-value notions to many agents raises new questions: do correlated-equilibrium-like constraints across agents limit dynamic extraction, or can a principal use cross-agent coupling to reintroduce exploitation even when each agent individually controls internal regret?
In practice, platforms can commit imperfectly to update rules (e.g., published policies, versioned APIs, or verifiable smart contracts). Formalizing partial commitment could sharpen the policy relevance of ``static benchmark’’ results: if a platform can credibly commit to a stationary mapping within a version, then participants can rationally invest in defenses tailored to that version. The design of verifiable commitment mechanisms that preserve flexibility while limiting opportunistic adaptivity is an open design problem.
Our analysis emphasizes the principal’s robust payoff. In many domains the policy question is distributive: how much surplus is shifted from agents to platforms by dynamic manipulation, and how does this interact with fairness constraints? Developing welfare decompositions that remain valid under learning dynamics (especially with state and context) would connect the theory more directly to antitrust and labor-market policy.
The broader lesson is that ``learning’’ is not merely a friction that disappears in the limit; it is a strategic interface that can be engineered on both sides. Platforms and regulators can influence which guarantees are feasible through transparency and feedback design, while agent designers can choose defenses that convert repeated interaction from a manipulable process into one that approximates a static contracting problem. Our model does not claim that dynamic incentives are always harmful or always beneficial; rather, it illuminates when dynamic adaptation is likely to improve efficiency and when it is likely to function primarily as an extraction technology.