Many contracting environments of current interest are neither one-shot nor truly long-lived. Digital labor platforms, creator marketplaces, and enterprise procurement systems all exhibit substantial : principals (platforms or buyers) and agents (workers, sellers, service providers) match, interact repeatedly for some time, and then separate for reasons that are often orthogonal to performance—demand shocks, reallocations of attention, policy changes, or idiosyncratic departure. Similar dynamics arise in AI-mediated work. A user may repeatedly delegate tasks to an AI tool, a firm may route requests through an AI proxy, or a platform may deploy automated agents that respond to incentives embedded in APIs and billing rules; in each case the ``relationship length’’ is uncertain because the task stream ends, the user switches tools, or policies change. These settings motivate a repeated principal–agent model in which the interaction ends at a random time and where the agent is not assumed to solve a full dynamic program, but instead updates behavior using a learning algorithm.
Our starting point is the observation that changes the strategic landscape of dynamic contracting in a qualitatively different way than classical discounting or finite-horizon rationality. When an agent best-responds myopically to posted contracts, the principal can optimize within familiar incentive-compatibility constraints. When the agent instead runs a no-regret procedure, the constraints the principal faces are not simply static IC constraints period by period; rather, they arise from the algorithm’s guarantee that actions with persistently lower empirical utility are played rarely. In the ``mean-based’’ class of no-regret learners, this guarantee depends on payoffs. As a consequence, a principal can sometimes profit by shaping the agent’s empirical utility landscape over time: early payments can steer the learner toward a high-cost, high-output action, after which the principal can reduce payments and still retain the desired behavior for a while because the learner’s running averages adjust slowly.
This path dependence under mean-based learning is the engine behind dynamic advantages identified in recent work on algorithmic agents in repeated contracting. In deterministic-horizon models, one can formalize this advantage via continuous-time trajectories that track the evolution of the and the induced best responses. The principal can implement ``free-fall’’ policies: offer a relatively generous linear contract for an initial phase to push the learner toward a desirable action, then abruptly cut incentives (potentially to zero) and harvest profits while the learner continues to play the previously reinforced action until the empirical averages cross critical thresholds. These effects are not artifacts of sophisticated forward-looking reasoning by the agent; they arise precisely because the learner is forward looking, but is instead governed by a regret guarantee that is inherently backward looking.
Yet the practical relevance of such dynamic manipulation hinges on a basic question: In platforms and AI tool deployments, the answer is: not always long. A principal may not know whether the relationship will last for 10 rounds or 106 rounds; and crucially, the termination event is typically exogenous. This motivates the main modeling departure in this paper: we treat the interaction length as a random stopping time S, independent of the realized outcomes and history. Rather than optimizing a fixed-horizon sum, the principal maximizes expected stopped profit, equivalently a survival-weighted integral in continuous time with survival function F̄(t) = Pr [S ≥ t] and hazard $h(t)=-\frac{d}{dt}\log \bar F(t)$.
Introducing stochastic termination does more than add a technical discount factor. It forces a different economic tradeoff. Dynamic contracts against mean-based learners can be viewed as ``front-loading’’ incentives to move the learner’s state (its empirical averages) into a region that yields high principal surplus later. Random termination penalizes such strategies precisely because it truncates the future. When the hazard is high, the principal is unlikely to enjoy the late-stage harvest phase; when the hazard is low and the survival distribution has heavy tails, dynamic steering becomes more valuable. Thus churn acts as a natural force that can dynamic exploitation of learning dynamics.
We develop this intuition through a survival-weighted analogue of the trajectory formulation for mean-based learning dynamics. A key simplification is that exogenous stopping changes , not the feasibility constraints induced by learning: the agent’s best-response-to-average condition still governs which action sequences can arise as the average contract evolves. Consequently, we can express the principal’s problem as a continuous-time control problem over valid trajectories π, but with payoffs weighted by F̄(t). This reduction plays two roles. First, it provides a clean characterization of the best achievable payoff UF̄⋆ against worst-case mean-based learning under a given survival curve. Second, it lets us import structural insights from the deterministic-horizon setting—especially for linear contracts—while making transparent how hazard reshapes what is achievable.
Our main qualitative message is a of dynamic advantage. In linear-contract environments, the deterministic-horizon analysis can be summarized by a finite set of in the linear share parameter α at which the agent’s preferred action changes. The principal’s ability to obtain profits above the best static contract is tied to how far, and how quickly, the principal can move the historical average ᾱ(t) across these breakpoints. This can be captured by a piecewise-linear potential function ψ(α) whose total height Ψ measures the ``budget’’ of manipulation available in the instance. Under stochastic termination, this budget does not disappear, but its value depends on it is spent: spending potential late is less valuable because the relationship is less likely to survive. The hazard therefore translates into an upper bound on the multiplicative improvement over the static benchmark that decreases as churn increases. In high-hazard environments, the principal cannot credibly amortize an initial subsidy over a long harvest period, and dynamic contracts become nearly indistinguishable (in value) from the optimal static contract.
At the same time, we emphasize that churn does not automatically eliminate all dynamic effects. When the survival distribution has sufficient mass on long durations, dynamic policies can still outperform static ones, and one can design policies that are to the hazard. A deterministic ``switch at time τ’’ free-fall policy is brittle: if the relationship ends just before the harvest phase, the principal pays the cost without reaping the benefit. Our constructive results therefore advocate randomized phase policies that spread the switch time across the survival curve. Economically, these policies hedge termination risk by mixing over trajectories with different recoup horizons; algorithmically, they form a one-dimensional family that is tractable to optimize once F̄ is known (and can be numerically optimized in more general hazard models). In the important special case of constant hazard (exponential survival), the survival-weighted objective yields closed-form expressions for the value of a free-fall policy, enabling simple search over an initial incentive level and a switch-time parameter.
This perspective helps connect the theory to practice. In settings
like gig work or API marketplaces, a platform frequently faces uncertain
user lifetimes and must decide whether to offer
boosts,'' bonuses, or temporarily generous terms to induce higher effort or quality. Our results suggest that the profitability of such front-loaded incentives depends sharply on retention: where hazard is high, simple static terms can be near-optimal even when sophisticated dynamic schemes exist in a deterministic-horizon abstraction; where hazard is low, hazard-matched dynamic incentives can be justified and can be tuned using an explicit survival curve. In AI-mediated work, where theagent’’
may literally be a learning system, the relevance is twofold: (i)
dynamic contracts can unintentionally exploit learning dynamics,
producing transient over-performance followed by collapse when
incentives change; (ii) conversely, a principal may deliberately use
phased incentives to rapidly train or steer a deployed model, but churn
(task cessation, tool switching, policy changes) limits the attainable
gains.
We also view stochastic termination as a disciplined way to discuss . Deterministic-horizon dynamic advantages often rely on long recoup periods. In real deployments, relationship length is a moving target: it varies across users, is sensitive to macro conditions, and can be disrupted by exogenous events. Modeling S explicitly allows us to ask how sensitive dynamic contracting benefits are to misspecification of F̄, and it suggests a natural operational statistic—an ``effective hazard’’—that governs whether dynamic schemes are worthwhile. This framing also clarifies an important limitation: our analysis assumes termination is exogenous and independent of the principal’s actions. In many platforms, incentives can affect retention (e.g., an agent stays longer if paid more). Endogenizing churn would introduce an additional channel and could either amplify or dampen dynamic effects. We treat exogenous stopping as a first step that isolates the interaction between learning dynamics and time uncertainty.
Finally, we stress that our model does not claim that real agents are literally mean-based learners, nor that principals can commit to arbitrary dynamic policies without frictions. Rather, the model illuminates a tradeoff: dynamic contracts can leverage the inertia of no-regret learning, but doing so typically requires an upfront investment whose payoff is back-loaded; stochastic termination downweights the back end and thus compresses the scope for gains. This provides a unifying explanation for when algorithmic-contracting pathologies should be expected to matter, and when they should wash out in the face of churn. The remainder of the paper formalizes this logic, derives hazard-dependent bounds, and constructs simple hazard-matched policies that attain these bounds up to instance-dependent constants.
Our setting combines three classical themes—hidden-action contracting, repeated interaction, and learning dynamics—with an explicit representation of churn via an exogenous stopping time. Each ingredient has a substantial literature, and our contribution is best understood as importing a particular viewpoint from recent work on contracting against no-regret learners into a stochastic-termination environment, where the key object is the survival curve (equivalently, the hazard).
The foundational principal–agent literature studies moral hazard under hidden action and stochastic outcomes, typically under one-shot or discounted infinite-horizon formulations (see, e.g., ). In repeated interactions, the theory of emphasizes how dynamic incentives can be sustained by continuation values when parties are long-lived and can condition on histories . That literature features fully rational agents and incentives enforced by equilibrium threats rather than by a learning rule. Our environment differs in two ways. First, the principal here is constrained to simple payment schemes (linear or, more generally, p-scaled contracts), motivated by practical platform rules and the tractability of the breakpoint structure in such classes. Second, the agent is not assumed to compute an equilibrium of the repeated game; instead, behavior is governed by a no-regret algorithm whose guarantees are . This shift changes what ``dynamic incentives’’ mean: rather than manipulating continuation utilities through equilibrium punishments, the principal shapes the agent’s payoff landscape so that certain actions remain attractive relative to the agent’s running averages.
At a high level, one can view our results as complementary to relational contracting: when incentives are implemented via learning dynamics rather than equilibrium reasoning, a principal may obtain short- to medium-run benefits from in the learner’s state (running averages). However, these benefits must be amortized over time, which is precisely where stochastic termination becomes economically first-order.
A growing literature studies economic design problems when one side is algorithmic or boundedly rational, including learning buyers/sellers in auctions, pricing against learning demand, and platform design for learning participants . In many of these models, the designer optimizes against a learning rule rather than a fully strategic opponent, and dynamic policies can exploit predictable features of learning algorithms. Closest in spirit are models where the principal controls an incentive signal (price, wage, ranking weight) and induces a learner to adapt, creating a feedback loop that resembles a control problem.
Within contracting, recent work (including our source) formalizes how a principal can exploit mean-based no-regret learning to obtain dynamic advantages even with very restricted contract classes. The key technical idea is that, for mean-based learners, feasibility of action sequences can be characterized by how the agent best-responds to the contract. This leads to a continuous-time limit in which the state variable is an average contract (or average linear share), and the principal’s policy corresponds to a trajectory through a polyhedral partition of contract space. Our paper builds directly on that perspective, but changes the evaluation criterion: instead of a deterministic horizon (or a worst-case time criterion), we weight payoffs by survival probabilities induced by exogenous churn.
From the online learning side, our agent model belongs to the broad
family of no-regret dynamics . A key conceptual point is that no-regret
is a guarantee; it does not generally preclude substantial transient
behavior. This gap between asymptotic optimality and finite-time path
dependence underlies a number of
regret manipulation'' andlearning in games’’ phenomena .
The mean-based condition used in the source (and here) is deliberately
permissive: it captures many standard algorithms and provides a clean
sufficient condition for the trajectory reduction, while still allowing
the principal to steer behavior through the evolution of empirical
utilities. By contrast, stronger learning notions (e.g., swap regret or
internal regret) typically eliminate some path-dependent exploitation
channels; this mirrors known results that richer deviation constraints
drive play toward correlated equilibrium sets and can reduce the
designer’s ability to ``fool’’ the learner . We highlight this
comparison because it clarifies the role of the learning assumption:
dynamic contracting gains in our setting arise not from sophisticated
intertemporal commitment, but from the coarse way in which mean-based
learning aggregates the past.
Uncertain interaction length has been modeled in several equivalent ways across fields. In economics, geometric discounting and random termination are closely related (a constant hazard corresponds to exponential survival and can be reinterpreted as discounting), and both serve as parsimonious reduced forms for impatience, turnover, or limited commitment . In online learning, random stopping times arise in analyses that require algorithmic guarantees uniform over time, or in settings where the evaluation is a random prefix of play . The source paper studies deterministic-horizon performance and also develops unknown-horizon results by mixing over horizon-dependent policies (or by designing policies that are robust to the realized horizon). Our model takes a different route: rather than treating the horizon as adversarially unknown, we assume a known exogenous survival curve and incorporate it directly into the objective via survival weighting.
This change is not merely cosmetic. A worst-case or minimax
unknown-horizon criterion pushes toward policies that perform reasonably
at all times, while survival-weighting permits deliberately policies
when the tail of the survival distribution is heavy, and penalizes them
when hazard is high. Put differently, the survival curve provides a
disciplined way to interpolate between
must do well immediately'' andcan invest for later,’’ and
it allows us to state comparative statics in terms of hazard. This also
yields a clearer connection to practice in platform settings where churn
is measurable and can be estimated from retention data.
Churn is central in empirical and theoretical work on two-sided
platforms, labor marketplaces, and subscription businesses, where
retention determines the profitability of front-loaded subsidies,
bonuses, or promotions . Our model is intentionally stylized relative to
those environments: we take termination as exogenous and independent of
the contracting path, whereas in many applications incentives affect
retention and selection. Nonetheless, explicitly modeling a stopping
time is useful even as a first pass because it isolates a basic
mechanism: any dynamic incentive scheme that resembles an
investment'' followed by aharvest’’ phase becomes less
attractive as churn increases. In this sense, our hazard-sensitive
bounds provide a theoretical analogue of a common operational heuristic
in platforms: incentives that require long payback periods are hard to
justify when user lifetime is short or volatile.
Finally, our restriction to linear (or p-scaled) contracts connects to a large body of work on simple mechanisms and robust contracting . Linear sharing rules are canonical in environments with risk neutrality and limited observability, and more broadly they provide a tractable design space when the outcome is multi-dimensional but can be summarized by a scalar reward ro. In the deterministic-horizon mean-based learning model, linearity is also what produces the breakpoint structure in the share parameter α, enabling potential-function arguments and explicit ``free-fall’’ trajectories. We keep this structure because it lets us ask a crisp question: Our answer is that the same geometric objects (breakpoints, potentials, action regions) govern feasibility, but the survival curve reweights when traversing those objects is valuable.
Relative to the contracting and learning literatures, our main conceptual move is to treat churn as a primitive that alters the objective but not the learning-induced feasibility constraints. This yields a survival-weighted analogue of the trajectory formulation from the source, and it supports two complementary messages: (i) an upper bound showing that the multiplicative advantage of dynamic contracts over the best static contract decreases with an effective hazard; and (ii) constructive phase-based policies that are tuned to the survival curve and recover much of the attainable advantage when the tail is sufficiently heavy. By placing these statements in a stopping-time model, we aim to bridge the gap between deterministic-horizon abstractions of dynamic manipulation and the operational reality that many principal–agent relationships end abruptly and for reasons unrelated to performance.
We study a repeated hidden-action principal–agent interaction with stochastic outcomes and an exogenous random relationship length. The principal (designer) chooses a payment rule each period; the agent (worker, proxy, or algorithm) chooses an unobserved costly action; an outcome is realized and publicly observed; and then the relationship may end for reasons independent of play (``churn’’). Our goal is to understand how much value a principal can extract from contracting when the agent adapts via a permissive no-regret learning rule, and how that value changes with the survival (hazard) profile of the relationship.
There is a finite action set [n] = {1, …, n}. Action
i has (known) cost ci, with c1 = 0 and ci weakly
increasing in i. Outcomes lie
in a finite set [m] = {1, …, m}. If the
agent takes action i, the
realized outcome o ∈ [m] is drawn from a
known distribution Fi. The
principal’s gross value from outcome o is a known number ro, with r1 = 0 and ro weakly
increasing in o. We write the
principal’s expected gross value under action i as
Ri := 𝔼o ∼ Fi[ro].
Throughout, both parties know (ci)i ∈ [n],
(Fi)i ∈ [n],
and (ro)o ∈ [m].
We abstract from risk aversion by taking both sides to be risk-neutral;
this keeps attention on the dynamic incentive effects generated by
learning and by churn rather than on insurance considerations.
Time is discrete, indexed by t = 1, 2, …. A (possibly history-dependent) contract in period t is a nonnegative payment vector $p_t\in\mathbb{R}^m_{\ge 0$, where pt, o denotes the transfer from principal to agent if outcome o occurs in period t. We impose nonnegativity to reflect limited liability and the practical reality that many platforms cannot levy negative transfers.
The sequence of events within a period is:The principal never observes at directly and can condition future contracts only on publicly observed history (past outcomes and past posted contracts). The agent observes the realized outcome and payment and, of course, internal information about its own action and cost.
Given contract p and action
i, we define per-round
expected utilities
uP(p, i) := Ri − 𝔼o ∼ Fi[po], uA(p, i) := 𝔼o ∼ Fi[po] − ci.
We allow the principal to choose contracts adaptively. This is the
economically relevant class in applications (bonuses, multipliers, and
promotions often respond to realized performance), and it is also the
appropriate benchmark when the agent’s learning rule is treated as part
of the environment rather than as an equilibrium object.
A central restriction in our analysis is that the principal uses a
family of contracts. The leading case is sharing: the principal selects
αt ∈ [0, 1] and
sets
pt, o = αtro for
each o ∈ [m].
Under linear contracts, the per-round utilities simplify to
uP(α, i) = (1 − α)Ri, uA(α, i) = αRi − ci,
so the contract affects only the division of expected surplus, not the
mapping from actions to outcomes. This restriction is motivated by two
considerations. First, linear or proportional rules are common in
practice (revenue shares, commission rates, performance multipliers).
Second, and crucial for our results, linear contracts induce an
analytically tractable structure in α that governs which action is
optimal for the agent.
We also allow a mild generalization, which we refer to as contracts: fix a baseline nonnegative vector p̄ ∈ ℝ ≥ 0m and restrict the principal to contracts of the form p = αp̄ with α ∈ [0, 1]. Linear contracts correspond to the choice p̄ = r. In the main text we present statements for linear contracts for clarity; the p-scaled extension typically requires only notational changes (replacing Ri by the expected baseline payment under Fi where appropriate).
If the agent were myopic and fully optimizing in each period, then
given a contract p it would
choose an action in the best-response correspondence
BR(p) := arg maxi ∈ [n]uA(p, i).
However, the core friction in our setting is that the agent is rather
than solving a full intertemporal optimization problem. We model the
agent as running an arbitrary mean-based no-regret algorithm, in the
sense formalized in the source (Definition 2.2 / B.4). Intuitively, such
algorithms concentrate probability on actions whose payoff is
near-optimal, and they rarely play actions that are empirically
dominated by a large margin.
Concretely, we assume that the learner maintains internal
scores'' $(\sigma_i^t)_{i\in[n]}$ at each time $t$ (these may be realized cumulative payoffs under full information, or unbiased estimates under bandit feedback). The mean-based condition states that there exists a parameter $\gamma(T)=o(1)$ such that, over any horizon $T$, whenever an action $i$ is behind some other action $i'$ by more than $\gamma(T)T$ in score, then action $i$ is played with probability at most $\gamma(T)$ at that time: \[ \sigma_i^t < \sigma_{i'}^t - \gamma(T)T \ \ \Rightarrow\ \ \Pr[a_t=i] \le \gamma(T). \] We emphasize two modeling choices embedded here. First, this condition is deliberately permissive: it captures many standard no-regret procedures while allowing substantial transient dependence on the path of realized payoffs. Second, the condition is \emph{scale-sensitive} in the natural way: what matters is whether an action is worse by an amount that is large relative to the horizon. This is exactly the regime in which a principal might hope tosteer’’
behavior by shaping empirical averages.
A distinctive feature of our model is that the interaction does not
have a fixed deterministic horizon. Instead, the relationship ends at a
random stopping time S ∈ {1, 2, …} that is exogenous and
independent of play. In discrete time we parameterize S by a (possibly time-varying)
hazard sequence (ht)t ≥ 1,
where
ht := Pr [S = t ∣ S ≥ t], F̄t := Pr [S ≥ t]
denotes the corresponding survival probabilities. The independence
assumption is economically restrictive—in many labor and platform
settings incentives can affect retention—but it is analytically useful
because it isolates a basic tradeoff: dynamic incentives may require an
``investment’’ phase that only pays off if the relationship survives
long enough.
We will frequently use the continuous-time representation of the same
survival information. Let F̄(t) = Pr [S ≥ t]
for t ≥ 0 denote the survival
function and let
$$
h(t) := -\frac{d}{dt}\log \bar F(t)
$$
be the (instantaneous) hazard rate when F̄ is differentiable. The
constant-hazard case F̄(t) = e−ht
plays a special role both because it corresponds to geometric stopping
in discrete time and because it yields closed-form expressions for
several quantities of interest.
Given a contract policy {pt}t ≥ 1
and the induced sequence of actions and outcomes, the principal’s
realized total profit up to termination is
$$
\text{Profit}_P(\{p_t\},S) := \sum_{t=1}^{S}\big(r_{o_t}-p_{t,o_t}\big).
$$
The principal evaluates a policy by expected stopped profit,
UtilP({pt}) := 𝔼 [ProfitP({pt}, S)],
where the expectation is taken over outcome draws, the agent’s
randomization (from learning), and the stopping time S. Because S is independent of play, UtilP admits a useful
survival-weighted form:
$$
\text{Util}_P(\{p_t\})
=
\sum_{t=1}^{\infty}\bar F_t \cdot \mathbb{E}\!\left[u_P(p_t,a_t)\right],
$$
which makes transparent how churn reweights the importance of early
versus late periods. We will exploit an analogous integral
representation in continuous time.
To interpret dynamic policies, we fix a static benchmark. A contract posts the same payment rule each period (equivalently, the same α under linear contracts). Under myopic best responses, a static contract induces a single action (or a mixture over best responses) each period. Under a no-regret learner, the relevant notion is that, over long play, empirical frequencies concentrate on actions that are approximately optimal for that fixed contract.
Let R⋆ denote
the principal’s optimal single-round profit against a best-responding
agent within the admissible contract class (linear or p-scaled). We view R⋆ as the correct
per-round baseline because it is achievable by a time-invariant policy
and, under learning, is the natural limit point of what can be
guaranteed without exploiting transient path dependence. Under stopping,
the corresponding static value is simply scaled by expected relationship
length:
UtilF̄static := ∫0∞F̄(t) R⋆ dt = R⋆ 𝔼[S],
with the discrete-time analogue ∑t ≥ 1F̄tR⋆.
Dynamic contracting can do better than this benchmark by leveraging the dependence of a mean-based learner on historical averages. At a high level, the principal may temporarily offer generous incentives to move the learner’s internal state into a region where it continues to select high-reward actions even after incentives are reduced. This logic is inherently intertemporal, and it is precisely here that survival matters: if the relationship ends too quickly, the ``harvest’’ phase may never arrive. Our main object of interest is therefore the optimal expected stopped profit over all admissible dynamic policies against worst-case mean-based learning, and the induced advantage relative to UtilF̄static.
Before proceeding, we record two limitations that will be important when interpreting our conclusions. First, we treat churn as exogenous and independent of play, which rules out screening and retention effects that are central in many empirical environments. Second, our restriction to linear (or p-scaled) contracts abstracts from richer nonlinear incentives that a sophisticated principal might deploy if unconstrained. We adopt these simplifications because they deliver a clean geometric structure (breakpoints and action regions) and because they let us isolate a specific economic mechanism: dynamic advantage is feasible only to the extent that one can profitably trade off early incentives against later rents, and the survival curve determines how that tradeoff is priced.
In the next section we formalize this tradeoff by moving to a continuous-time representation in which the state variable is an average contract parameter and the objective is survival-weighted flow profit. We then show that, against mean-based learners, the discrete-time stopped problem reduces (up to lower-order terms) to an optimization over feasible continuous trajectories.
Our analysis proceeds by replacing the original discrete-time game with a continuous-time control problem whose feasible set captures what a mean-based learner can be induced to do, and whose objective captures how churn discounts late profits. The benefit of this reduction is conceptual as much as technical: it separates (which come from learning dynamics and are essentially unchanged by stopping) from (which is where the survival function enters). Once we have this separation, comparative statics in the hazard profile become transparent, and the later potential-function bounds can be stated in a clean integral form.
Fix any (possibly history-dependent) discrete-time policy {pt}t ≥ 1
and let Xt := uP(pt, at)
denote the principal’s per-round expected utility conditional on the
contract in round t and the
agent’s (possibly randomized) action. Because the stopping time S is independent of play, we can
rewrite expected stopped profit as a survival-weighted sum:
This identity is the discrete-time analogue of the familiar
continuous-time formula 𝔼 [∫0Sg(t) dt] = ∫0∞F̄(t)g(t) dt,
and it is the sole point at which exogenous stopping enters the
objective. Economically, says that churn does not change what happens ;
it changes only the price the principal pays for waiting.
To avoid inessential measurability issues, it is convenient to work
in continuous time and treat discrete periods as unit-length intervals.
Given any piecewise-constant control p(t) (or α(t) under linear
contracts), define the survival-weighted continuous-time objective
where (p(t), a(t))
is induced by a trajectory π
defined below. When F̄(t) = e−ht
(constant hazard), is an exponential discounting of flow profit. When
F̄ has heavier tails, later
profit retains more weight, making intertemporal ``investment–harvest’’
strategies more valuable.
The key feature of mean-based learning in the source is that play is
governed by payoff comparisons, which in turn depend on of contracts.
This suggests the appropriate continuous-time state variable: the
historical average contract up to time t,
$$
\bar p(t) := \frac{1}{t}\int_{0}^{t} p(s)\,ds
\qquad
(\text{or }\ \bar\alpha(t):=\tfrac{1}{t}\int_0^t \alpha(s)\,ds\
\text{under linear contracts}).
$$
Intuitively, if the principal holds p(t) fixed for a while,
then p̄(t) drifts
slowly toward that value. A mean-based learner does not optimize with
respect to the instantaneous contract; rather, it concentrates on
actions that have done well on average, which is why p̄(t) plays the central
role.
We formalize this using the trajectory representation from the
source. A (continuous-time) trajectory is a finite or countable
sequence
π = {(pk, τk, ak)}k = 1K,
interpreted as: for τk units of time
the principal posts contract pk and the agent
plays action ak. Let Tk := ∑ℓ ≤ kτℓ
denote the cumulative time up to segment k, and let
$$
\bar p^{\,k}:=\frac{1}{T^k}\sum_{\ell\le k}\tau_\ell p_\ell
$$
denote the historical average contract at the end of segment k.
The reduction in the source replaces the discrete-time mean-based
condition with a set of deterministic constraints on which action can be
sustained in each segment of a trajectory. The same constraints apply
here because stopping is independent of play: conditional on survival,
the agent’s score updates and payoff comparisons evolve exactly as in
the fixed-horizon model. Concretely, a trajectory π is if for every k ≥ 2,
Condition has a simple economic meaning. Within segment k, the agent is supposed to keep
playing ak
while the historical average moves from p̄ k − 1 to p̄ k. For a
permissive mean-based learner, the principal can sustain this only if
ak is
(approximately) optimal both at the start and at the end of the segment;
otherwise some alternative action accumulates a decisive score lead and
the learner would switch with high probability.
Crucially, does involve F̄ or h. Churn therefore acts like an objective-side discounting of a fixed feasible set: it changes which valid trajectories are desirable, not which trajectories are feasible.
Given a survival function F̄, define the value of the
survival-weighted control problem as
where UtilF̄(π) is
computed by interpreting π as
piecewise-constant functions p(t) and a(t) and applying . The
benchmark corresponding to the optimal static contract is
UtilF̄static = R⋆∫0∞F̄(t) dt = R⋆ 𝔼[S].
Our objective in this section is to justify as the correct
characterization of the principal’s maximal expected stopped profit
against worst-case mean-based learning.
The reduction has two directions, mirroring the fixed-horizon results.
First, given any discrete-time principal policy (even one that adapts to realized outcomes), we can extract a valid continuous-time trajectory whose survival-weighted value upper bounds the policy’s performance (up to lower-order terms). The key step is to note that the mean-based condition implies that the agent’s action can change only when some action’s cumulative advantage becomes large, which occurs only when averages cross boundaries of best-response regions. By grouping time into blocks on which the principal’s contract is (approximately) constant and the agent’s realized play is (approximately) constant, we obtain segments (pk, τk, ak). The historical averages at block boundaries become the p̄ k’s. The mean-based property then enforces in the limit: if ak were not a best response to p̄ k − 1 or p̄ k, then some competing action would have a linear-in-time score lead, contradicting that ak is played for τk time with non-negligible frequency.
The role of stopping is entirely captured by how we evaluate the resulting blocks. Using and passing to the block representation, the principal’s expected stopped profit becomes a Riemann-sum approximation to , with weights F̄(t) (or F̄t) multiplying flow utilities. Because F̄ is exogenous, this approximation is purely analytic: we do not need new incentive arguments beyond those in the source.
Second, given any valid trajectory π, we can construct a discrete-time policy that approximately implements it against any mean-based learner and achieves expected stopped profit close to UtilF̄(π). The construction follows the ``oblivious simulation’’ idea from the source. For each segment (pk, τk, ak), we play the contract pk for a block of ⌈τk/Δ⌉ discrete periods (for a small discretization step Δ), regardless of outcomes. Validity ensures that throughout the block the intended action is not severely dominated in cumulative score by any alternative, so a mean-based learner continues to place nearly all probability on ak. Importantly, because stopping is independent, posting contracts obliviously is without loss for our worst-case guarantee: conditioning on outcomes cannot improve the principal’s ability to force action changes when the agent is only constrained by mean-based regret.
Combining these two directions yields the survival-weighted analogue of the source’s trajectory characterization.
Proposition~ tells us that, once we restrict to linear (or more generally p-scaled) contracts and mean-based learning, churn affects the principal only through the weights F̄(t) in . This is exactly the economic tradeoff we want to isolate. Dynamic contracting typically requires paying ``too much’’ early in order to reshape the agent’s empirical comparisons, and then recouping later by cutting incentives while the agent continues to play a high-reward action. A higher hazard makes the recoup phase less likely, so in it downweights precisely those portions of the trajectory in which the principal hopes to earn rents.
We emphasize a limitation: this clean separation between feasibility and evaluation relies on stopping being exogenous and independent of play. If contracts or outcomes affected retention, then F̄(t) would become an endogenous object and the principal would face an additional intertemporal incentive problem (trading off current profit against future survival). Our framework deliberately abstracts from that channel in order to obtain sharp characterizations of the learning-based channel.
With the survival-weighted control problem in hand, we can now specialize to linear contracts and exploit the breakpoint geometry. In the next section we introduce a potential function that upper bounds how much ``intertemporal slack’’ the principal can extract, and we show how the hazard profile governs the maximal multiplicative advantage over the static benchmark.
We now specialize the survival-weighted control problem to linear
contracts and derive an upper bound on how much a dynamic policy can
outperform the best static contract. The key idea, inherited from the
fixed-horizon analysis in the source, is that dynamic advantage is not
free'': it is paid for by moving the historical average contract across finitely many best-response boundaries. A potential function quantifies this finiteintertemporal
slack,’’ and stochastic stopping enters only through how much of that
slack can be converted into survival-weighted profit.
Under a linear contract po = αro
with α ∈ [0, 1], the agent’s
expected utility from action i
is
uA(α, i) = αRi − ci,
so the best-response correspondence BR(α) is
piecewise-constant in α. For
consecutive actions (i − 1, i), define the
breakpoint
$$
\alpha_{i-1,i}:=\frac{c_i-c_{i-1}}{R_i-R_{i-1}},
$$
interpreting αi − 1, i = +∞
if Ri = Ri − 1.
We adopt the standard genericity convention that breakpoints lie in
[0, 1] and are strictly increasing in
i after removing dominated
actions. Then as α rises, the
agent moves monotonically to higher-cost, higher-reward actions.
Economically, breakpoints represent the incentive intensity α required for the agent to prefer upgrading from i − 1 to i. Dynamic contracting exploits the fact that a mean-based learner compares payoffs: after a long enough period of high α, the historical average ᾱ(t) can remain above key breakpoints even if the principal subsequently cuts incentives, causing the agent to keep choosing a high-reward action for some time. The question is how large a survival-weighted benefit the principal can extract from this mechanism.
We encode the breakpoint structure into a scalar potential function
ψ(α) that is
piecewise-linear in α and
increases only when α crosses
a breakpoint. One convenient normalization (equivalent to the source up
to affine transformations) is
This ψ has a direct
interpretation: each term (Ri − Ri − 1)(α − αi − 1, i)+
measures how far the incentive intensity α sits above the threshold needed to
make action i competitive
against i − 1, scaled by the
incremental principal reward of upgrading from i − 1 to i. The potential height
is an instance-dependent constant determined entirely by (c, R). It is finite (and
typically O(maxiRi)
under bounded rewards), and it is the maximal amount of potential the
principal can ever ``store’’ by pushing incentives as high as
possible.
Two qualitative features of Ψ are worth keeping in mind. First, Ψ is larger when adjacent actions are separated by small breakpoints (so that modest incentives can induce upgrades) and when reward increments Ri − Ri − 1 are large. Second, Ψ is of the stopping distribution: it is a property of the static environment, while churn determines how much of this stored slack can be monetized before termination.
Consider any valid trajectory π under linear contracts, and let ᾱ(t) denote the historical average incentive parameter along the trajectory. The central technical statement is that the principal’s flow profit above the static benchmark is controlled by the rate at which the trajectory spends potential.
Formally, let R⋆
denote the principal’s optimal static per-round profit (under the best
linear contract, anticipating BR(α)), and write
the principal’s instantaneous flow profit as uP(α(t), a(t)) = (1 − α(t))Ra(t).
Then one can adapt the source’s breakpoint-based argument to show an
inequality of the following form: for almost every t,
(t((t))),
\end{equation}
where the derivative is understood in the sense of absolutely continuous
trajectories (equivalently, segment-by-segment for piecewise-constant
controls). Intuitively, when the principal earns unusually high profit
at time t (typically by
offering a low α(t)
while the agent continues to play a high a(t)), the historical
average ᾱ(t) must be
drifting downward, and this drift reduces tψ(ᾱ(t)).
Thus excess profit is ``paid for’’ by potential expenditure.
Multiplying by the survival weight F̄(t) and integrating
yields
_0^F(t), d!(t((t))).
\end{equation}
The first term is precisely the static benchmark R⋆𝔼[S]. The
second term is the dynamic ``bonus’’ term, and it is here that the
hazard profile matters.
To make more interpretable, we integrate by parts. Using dF̄(t) = −h(t)F̄(t) dt
and assuming limt → ∞F̄(t) tψ(ᾱ(t)) = 0
(which holds under mild boundedness conditions, since ψ(ᾱ(t)) ≤ Ψ
and F̄(t) → 0), we
obtain
h(t)F(t), t,dt.
\end{align}
The functional ∫0∞h(t)F̄(t) t dt
is an ``effective recoup factor’’: it measures how much
survival-weighted time mass lies at larger t, where a free-fall phase (or any
delayed harvesting phase) can operate. Heavy-tailed survival curves make
this factor large; front-loaded termination makes it small.
Combining and yields an explicit upper bound:
_0^h(t)F(t), t,dt.
\end{equation}
Since the right-hand side depends on π only through the universal bound
ψ(ᾱ(t)) ≤ Ψ,
it applies uniformly to dynamic linear-contract policies against
mean-based learners (via the reduction in the previous section).
When S is exponentially
distributed with constant hazard h (so F̄(t) = e−ht),
the integral in evaluates in closed form:
$$
\int_0^\infty h e^{-ht}\, t\,dt
=
\frac{1}{h}.
$$
Therefore any valid trajectory π satisfies
il}_{F}()}{R^[S]}
1+.
\end{equation}
This constant-hazard expression is particularly useful for two reasons.
First, it makes clear that (in this normalization) the maximum
improvement over the static benchmark scales at most on the order of
Ψ/h, reflecting that
any attempt to harvest rents late is exponentially discounted. Second,
it isolates all instance dependence in Ψ and R⋆: once those are
computed from (c, R),
the survival effect under exponential churn is immediate.
The potential bound should be read as a sharp statement about the channel, rather than as a complete theory of retention. Because stopping is exogenous here, the principal cannot influence F̄ or h(t) via wages, working conditions, or product quality. In practice, many environments feature endogenous churn: low incentives may directly increase exit, and high incentives may extend the relationship. Incorporating such feedback would couple feasibility and evaluation, and the simple survival-weighted integral calculus above would no longer suffice.
A second limitation is that our bound leverages the path dependence of mean-based learning. If the agent satisfies stronger deviation constraints (e.g., swap regret), the feasible set of trajectories shrinks dramatically, and the dynamic advantage can collapse even without churn. Thus, empirically, the magnitude of Ψ is informative only to the extent that the deployed learning rule is permissive enough to be approximated by the mean-based model.
The structure of also suggests how to design near-optimal policies. The bound is tight only if the principal can convert a large portion of the available potential into early (high-survival) profit mass. This motivates hazard-matched phase policies that randomize the timing of incentive cuts in a way that aligns breakpoint crossings with the survival curve. In the next section we formalize this idea and show that suitably randomized two-phase (and phase-mixture) policies achieve survival-weighted performance that matches the hazard-sensitive upper bounds up to constant and, in some regimes, logarithmic factors.
We now complement the potential-based upper bound with constructive policies that are tailored to the survival profile. The high-level message is that the bound from is not merely a limitation: the same mechanism that makes excess profit possible (temporarily storing slack in the historical average contract) can be converted into a simple, robust whose switching time is chosen to align with the distribution of the stopping time.
We focus on a two-phase family parameterized by an ``investment’’
intensity α ∈ (0, 1] and a
(possibly randomized) switching time τ ≥ 0. In continuous time, the
policy posts the linear contract
i.e., pay αro up
to time τ and then switch to
α = 0 forever. In discrete
time, the analogue is: sample a random round τ ∈ {1, 2, …} at t = 1, play α for t ≤ τ, and then pay 0 for t > τ. Because τ is sampled ex ante and independent
of realized outcomes, this policy is and hence compatible with the
worst-case learning benchmark.
The economic logic is standard: the first phase deliberately
sacrifices flow profit in order to push the learner’s historical average
ᾱ(t) above key
breakpoints; the second phase harvests by cutting incentives while the
learner continues to play a higher action due to path dependence. Under
, the historical average takes the simple form
so after the switch it decays deterministically as 1/t. Consequently, the agent’s
action can only change at the times when ᾱ(t) crosses a breakpoint
αi − 1, i,
i.e., at
Thus, conditional on (α, τ), the induced action
path is piecewise-constant and (up to tie-breaking at breakpoints)
essentially deterministic under the trajectory validity constraints.
If the horizon were deterministic, the source shows that a carefully chosen (often deterministic) switch time can be near-optimal within broad classes of feasible trajectories. With stochastic stopping, a fixed τ becomes fragile: if τ is too large, the relationship often ends before harvesting begins; if τ is too small, the policy fails to move ᾱ(t) into a profitable region. Randomizing τ is a direct way to spread the policy’s ``mass’’ across likely termination times while retaining the same simple structure .
Formally, let FF(α, τ) denote the
free-fall trajectory induced by . For any mixing distribution μ over τ (and, if desired, a finite mixture
over α values), the
survival-weighted objective is linear:
This observation is important: it means we can optimize over of switch
times using convex-analytic tools, and it also implies that sampling
τ at time 0 is without loss relative to any more
elaborate randomization scheme (since the agent only responds to the
realized contract path).
A particularly interpretable hazard-matched rule is to draw τ from a distribution whose density is proportional to the survival-weighted hazard mass h(t)F̄(t) (the termination density in continuous time). Intuitively, this places greater probability on switching at times when termination is likely to occur, ensuring that a nontrivial fraction of policy realizations enters the harvesting phase exit.
One convenient parametrization is to choose a nonnegative weighting
function w(t) with
∫0∞w(t) dt = 1
and set τ ∼ w. The
expected value of the phase policy can then be written as
where α(t) is given
by . Because α(t) is
a threshold function of τ, the
expectation over τ induces a
smooth (and designable) time profile for Pr [α(t) = α] = Pr [τ ≥ t].
In other words, randomizing the switch time is equivalent to choosing a
curve for the incentive intensity itself.
The upper bound in suggests that the relevant scale for the total dynamic bonus is governed by the survival-weighted ``recoup factor’’ ∫0∞h(t)F̄(t) t dt multiplied by an instance-dependent potential height. Our phase mixtures can recover this scale whenever the instance admits a free-fall improvement in the known-horizon model and the survival curve places sufficient weight on horizons where that improvement materializes.
To state this cleanly, let Δ(T) denote the best
dynamic advantage achievable by a free-fall policy up to time T:
Δ(T) := supα, τ ≤ T{∫0TuP(α(t), a(t)) dt − R⋆T},
where the induced (α(t), a(t))
are consistent with trajectory validity and α(t) has the two-phase form
. Then, for any stopping time S independent of play, we obtain the
lower bound
_{T} F(T)(T).
\end{equation}
The inequality follows by considering the deterministic switch time that
is optimal for a given T, and
observing that the incremental profit accumulated up to T is realized whenever S ≥ T (while termination
before T can only reduce
harvesting, not negate already-earned profit). Thus, stochastic stopping
converts a fixed-horizon advantage Δ(T) into a
survival-discounted advantage F̄(T)Δ(T).
Equation highlights when dynamics help: if there exists some horizon T for which Δ(T) > 0 in the underlying instance (equivalently, the known-horizon dynamic optimum strictly exceeds R⋆T within the free-fall family), and the relationship is sufficiently long-lived in the sense that F̄(T) is bounded away from 0, then the stochastic-horizon problem also admits a strict improvement over the static benchmark. Conversely, if the survival curve is so front-loaded that F̄(T) is tiny for all horizons T at which Δ(T) becomes positive, then dynamic contracting cannot reliably reach the harvesting regime, and the best achievable value collapses back toward R⋆𝔼[S].
While already gives a clean sufficient condition for strict improvement, it is generally conservative because it commits to a single horizon T. A phase-mixture policy replaces supTF̄(T)Δ(T) by an over T values, which can be strictly larger when Δ(⋅) is spread over a range of horizons (as is typical when multiple breakpoints are relevant).
Concretely, one can pick a distribution μ over switching times τ and then analyze the realized
action path after switching via –. The expected dynamic bonus becomes an
integral of the form
for an explicitly defined (instance-dependent) kernel Φα(t)
that captures the marginal value of maintaining the high-incentive phase
until time t. Maximizing a
linear functional of the tail μ([ [t, ∞) ]) is a
one-dimensional convex program, and (by standard extreme-point
arguments) admits near-optimal solutions supported on few points. This
is the sense in which phase mixtures remain practically simple: despite
optimizing over distributions, the optimal (or approximately optimal)
policy typically randomizes among a small number of switch times.
Hazard-matched phase policies should be viewed as . They never attempt to infer the agent’s action directly, and they do not require observing ᾱ(t) beyond what the principal herself has posted. All the sophistication is in choosing when to stop paying for incentives, given that (i) after stopping, the induced decay ᾱ(t) = ατ/t deterministically walks the learner back down the breakpoint ladder, and (ii) survival weights F̄(t) determine which segments of that walk are likely to be realized.
At the same time, two caveats are worth emphasizing. First, the guarantee is inherently instance-dependent: if the static optimum already induces the top action (or if breakpoints are such that free-fall cannot create a profitable wedge between the action and the contemporaneous contract), then Δ(T) = 0 for all T and there is nothing to gain. Second, the construction relies on the permissiveness of mean-based learning; under stronger deviation constraints the free-fall path may cease to be feasible, and the entire phase mechanism can disappear.
The remaining task is computational: to deploy hazard-matched phase mixtures, we need to evaluate UtilF̄(FF(α, τ)) efficiently and understand how it depends on (α, τ) and F̄. In the next section we specialize to constant hazard (geometric/exponential survival) and to success/failure environments, where the breakpoint crossing times yield explicit finite-sum formulas and enable direct optimization over (α, τ) (and small mixtures), while more general hazards reduce to numerical integration of survival-weighted segment contributions.
We now specialize the survival profile to the constant-hazard case, both because it is economically canonical (memoryless churn) and because it turns the survival-weighted objective into an analytically tractable transform of the underlying free-fall path. We then further specialize to success/failure environments, where linear contracts coincide with simple ``bonus-on-success’’ schemes and all quantities admit a particularly transparent interpretation.
In continuous time, constant hazard h > 0 corresponds to exponential
survival
$$
\bar F(t)=e^{-ht},\qquad \mathbb{E}[S]=\int_0^\infty
e^{-ht}\,dt=\frac{1}{h}.
$$
In discrete time, the analogue is geometric stopping with parameter
h ∈ (0, 1],
$$
\bar F_t=\Pr[S\ge t]=(1-h)^{t-1},\qquad \mathbb{E}[S]=\sum_{t\ge
1}(1-h)^{t-1}=\frac{1}{h}.
$$
The memoryless property is not merely a modeling convenience: it
captures the operational reality of many principal–agent settings in
which the relationship ends due to exogenous turnover, product cycles,
or organizational reallocation, and it ensures that the marginal value
of delaying a switch can be summarized by a single scalar h.
In the success/failure specialization, there are two outcomes, with
rewards normalized as r1 = 0 (failure) and
r2 = 1 (success).
Each action i induces a
success probability Ri ∈ [0, 1], so
Ri = 𝔼o ∼ Fi[ro]
is literally the success rate. Under a linear contract po = αro,
the agent receives payment α
if and only if success occurs, so
uP(α, i) = (1 − α)Ri, uA(α, i) = αRi − ci.
In this environment, the breakpoints αi − 1, i = (ci − ci − 1)/(Ri − Ri − 1)
have an especially clean meaning: they are the bonus rates at which the
agent is indifferent between neighboring effort levels (when Ri > Ri − 1).
As in the source, we focus on instances satisfying the natural
monotonicity structure (increasing costs and rewards), which implies
that the best response to α
moves ``up the action ladder’’ as α increases.
Fix a two-phase free-fall policy FF(α, τ) as in . The
induced historical average after the switch is ᾱ(t) = ατ/t
for t > τ by . In
a success/failure environment with linear contracts, and away from
knife-edge ties, the agent’s best response depends on the scalar ᾱ(t) through the breakpoint
order: for each t > τ, the action is the
unique i such that
αi − 1, i ≤ ᾱ(t) < αi, i + 1.
Hence the only times at which the action can change are exactly the
breakpoint crossing times ,
$$
t_{i-1,i}(\alpha,\tau)=\tau\cdot \frac{\alpha}{\alpha_{i-1,i}}.
$$
Because t ↦ ᾱ(t) decreases
smoothly for t > τ, the post-switch
dynamics follow a deterministic ``walk down’’ the ladder of actions: the
agent begins at the best response to α (since ᾱ(τ) = α), then
drops to lower actions as ᾱ(t) falls below successive
breakpoints. This determinism is the key to closed-form evaluation under
exponential survival: we can write the total value as a finite sum of
exponential integrals over these breakpoint-delineated intervals.
Let i0 ∈ BR(α) denote the (tie-broken) best response when the incentive is held fixed at α. Under the free-fall policy, action is i0 throughout the investment phase t ∈ [0, τ]. After switching to α(t) = 0, the principal’s flow utility equals Ra(t) (since payment is 0), while the action a(t) is determined by ᾱ(t).
To describe the post-switch intervals, define (for each i ≥ 2) the time at which ᾱ(t) hits the breakpoint
into action i − 1:
$$
T_i(\alpha,\tau):=\tau\cdot \frac{\alpha}{\alpha_{i-1,i}}.
$$
These times satisfy Ti(α, τ) ≥ τ
whenever α ≥ αi − 1, i,
and they are increasing in τ
and in α. If i0 is the action played
at α, then only the
breakpoints below i0 are relevant;
accordingly, the post-switch path consists of a finite sequence of
actions i0, i0 − 1, …, 1
over the intervals
[τ, Ti0(α, τ)), [Ti0(α, τ), Ti0 − 1(α, τ)), …, [T2(α, τ), ∞),
where by convention T1(α, τ) = ∞
(action 1 persists forever once
reached). On each such interval the flow payoff is constant, so under
exponential survival we can integrate explicitly:
Equation is already a closed form: it is a finite sum of exponential
terms whose exponents are affine in τ (because Ti(α, τ)
is linear in τ). In
particular, for any fixed α
and any fixed region in which i0 is constant (i.e.,
α lies strictly between two
breakpoints), the dependence on τ is smooth and unimodal in many
instances, making one-dimensional optimization over τ numerically straightforward.
Two practical points are worth flagging. First, the formula highlights the economic tradeoff in a way that is hard to see from the trajectory definition alone: increasing τ increases the weight on the harvesting integrals (which are paid at α = 0) but simultaneously pushes those integrals later in time, where survival weight e−ht is smaller. Second, the only instance-specific objects entering are {Ri} and the breakpoints {αi − 1, i}; thus, once the breakpoint structure is computed, evaluating (α, τ) ↦ UtilF̄ reduces to a small number of elementary operations.
When time is discrete and F̄t = (1 − h)t − 1,
the same decomposition applies with integrals replaced by sums and with
breakpoint times rounded to integers. If the post-switch action remains
constant over rounds t ∈ {L, L + 1, …, U},
its contribution is
$$
\sum_{t=L}^{U} (1-h)^{t-1}\cdot \text{(flow payoff)}=\text{(flow
payoff)}\cdot (1-h)^{L-1}\cdot \frac{1-(1-h)^{U-L+1}}{h}.
$$
Thus, in discrete time the value is again a finite sum of
geometric-series terms. The main additional bookkeeping is handling the
integer rounding of the breakpoint-crossing rounds ⌈Ti(α, τ)⌉,
which creates small discontinuities in τ; in practice this is benign for
optimization because the discontinuities vanish under mild randomization
of τ (or can be handled by
evaluating neighboring integer candidates).
Closed-form evaluation reduces the design problem for free-fall policies under constant hazard to searching over (α, τ). We emphasize that the only source of non-smoothness is the identity of the induced action ladder, i.e., which action is optimal at α and which breakpoints are crossed after the switch. This suggests a natural computational strategy: enumerate candidate top actions i0, restrict α to the interval (αi0 − 1, i0, αi0, i0 + 1), and within that region optimize the smooth function over τ ≥ 0 (and over α via a one-dimensional line search or a coarse grid). Because n is finite and typically small in stylized models, this yields a simple and robust routine.
Constant hazard and success/failure are the most algebraically friendly case, but two departures quickly reintroduce numerical integration.
First, for survival curves F̄(t), the same interval
decomposition holds—the action changes only at the deterministic times
Ti(α, τ)—but
the segment contributions become
∫abF̄(t) dt,
which is rarely available in closed form. In such cases, evaluating
UtilF̄(FF(α, τ))
reduces to computing a handful of one-dimensional integrals, which can
be done accurately via standard quadrature. The resulting outer
optimization over (α, τ) is still
low-dimensional, but it is no longer an ``elementary-function’’
problem.
Second, even under exponential survival, if we expand the policy class to over switching times (or mixtures over multiple α values), then the expected value involves averaging over the mixing distribution. This remains easy when the distribution has a tractable Laplace transform (since the building blocks are exponentials), but for arbitrary mixing distributions one again falls back on numerical integration. Importantly, this is not a conceptual obstacle: it simply reflects that we are optimizing a linear functional over a continuous design space, and numerical methods are the natural tool once we leave the memoryless/finite-support comfort zone.
Taken together, these closed forms explain why the constant-hazard model is a useful workhorse. It allows us to (i) compute the value of a candidate free-fall policy essentially exactly, (ii) optimize it with minimal numerical overhead, and (iii) directly compare the achieved value to the hazard-sensitive upper bounds from the potential method. In the next section, we step beyond this workhorse case and discuss extensions in which the hazard itself may depend on history, or the contract space is restricted by fairness and minimum-pay constraints, or the principal faces limited feedback—each of which alters either the feasibility of free-fall trajectories or the computational tractability of evaluating them.
This section sketches four extensions that matter for applications and that also clarify which parts of our analysis are structural (coming from mean-based learning and the evolution of historical averages) versus which are artefacts of exogenous, memoryless churn and unconstrained transfers. Throughout, we keep the core friction: the principal observes outcomes but cannot directly condition on the agent’s action, while the agent’s behavior is governed by a mean-based no-regret dynamic. The central message is that most extensions preserve the viewpoint, but they either (i) enlarge the state variables needed to describe a valid trajectory, or (ii) shrink the feasible set of trajectories in ways that can sharply reduce (and sometimes eliminate) free-fall gains.
In many employment and platform settings, the probability of
relationship termination is not exogenous. Agents may quit after low
pay; principals may terminate after poor outcomes; regulators may impose
review events contingent on performance. A reduced-form way to capture
this is to allow the hazard to depend on the realized history Ht:
h(t) = h(Ht), F̄(t) = Pr [S ≥ t ∣ Ht],
in continuous time (with the obvious discrete-time analogue). Two
special cases are particularly natural.
Suppose the agent exits when their realized utility is persistently
low, producing an increasing hazard in the agent’s cumulative (or
discounted) utility shortfall. For example, one may posit
$$
h(t)=h_0+\kappa\cdot \Big(\max\{0,\,\underline u_A - \bar u_A(t)\}\Big),
$$
where ūA(t)
is the running average utility. In such models, an aggressive free-fall
phase with α(t) = 0
can raise the hazard precisely when the principal is trying to harvest
high effort at low pay. This creates a new tradeoff absent under
exogenous churn: dynamic pay can its own early termination. From the
trajectory standpoint, validity constraints remain driven by best
responses to historical average contracts, but the objective becomes
path-dependent because the survival weight F̄(t) is no longer a fixed
function of calendar time. Formally, the principal faces a control
problem in which the state must include both the historical average
contract (the usual state) and whatever statistic drives hazard (e.g.,
ūA(t)).
Even in the linear-contract specialization, the optimal policy need not
be a simple two-phase free-fall because the principal may prefer to
``smooth’’ incentive reductions to avoid triggering quits.
Conversely, suppose the principal can terminate at will (or with some cost), or that poor outcomes mechanically increase churn (e.g., project cancellation). Then the principal’s effective objective resembles an optimal stopping problem intertwined with dynamic contracts. The key conceptual point is that endogenous termination can make dynamic contracts powerful in one direction (the principal can stop right after extracting value) but powerful in another (the agent anticipates termination and discounts late rewards even without exogenous hazard). Under mean-based learning, the agent responds to realized payoffs and thus to any systematic pattern of early termination that correlates with actions; hence, termination policies can act as an additional implicit instrument, but one that is constrained by observability and commitment.
Our main technical limitation here is that the clean reduction to an
exogenously survival-weighted integral,
∫0∞F̄(t) uP(⋅) dt,
relies on S being independent
of play. Once S depends on
history, the analogue is still expressible as an expectation over paths,
but the principal’s problem is no longer a linear functional of the
trajectory. We view this as a promising direction: the same state
compression that makes free-fall analyzable (historical averages) may
still render the enlarged problem tractable, but one should expect
qualitatively new phenomena such as ``retention incentives’’ that cap
the depth of free-fall.
A second set of extensions imposes constraints on transfers. In practice, contracts are often bounded below by minimum wage, budget feasibility, non-negativity and limited liability, internal pay equity rules, or external fairness constraints across demographic groups or tasks. We consider three stylized constraint families.
Suppose $p_o\ge \underline p_o$ for all outcomes o. For linear contracts po = αro with r1 = 0, a strict lower bound on the failure payment forces $\underline p_1=0$ (else infeasible), while a lower bound on success pay forces $\alpha\ge \underline \alpha>0$. This immediately rules out the extreme free-fall step α(t) = 0 and replaces it with a free-fall to $\underline \alpha$. The deterministic breakpoint-crossing picture remains, but the post-switch average becomes $\bar\alpha(t)=\frac{\alpha\tau+\underline\alpha (t-\tau)}{t}$ rather than ατ/t, slowing or even preventing the descent through breakpoints. In the potential-based upper bound, such constraints effectively reduce the potential range Ψ that can be ``spent’’ by moving ᾱ(t) downward, and therefore reduce the maximal dynamic advantage even when hazard is low.
A common alternative is an ex ante individual-rationality constraint
(possibly at each time or in expectation over survival):
$$
\mathbb{E}\Big[\sum_{t=1}^{S} u_A(p_t,a_t)\Big]\ \ge\ 0,
\qquad\text{or}\qquad
\int_0^\infty \bar F(t)\,u_A\big(p(t),a(t)\big)\,dt\ \ge\ 0.
$$
Under exogenous survival, this constraint is linear in the trajectory
and thus fits naturally into our continuous-time formulation.
Economically, it converts some of the principal’s early ``investment’’
payments from a purely strategic instrument into a required transfer to
satisfy participation. This tends to compress the set of profitable
free-fall policies: the principal may still front-load pay to pull the
agent to a high action, but must now compensate (in a survival-weighted
sense) for the low-pay harvesting phase.
In multi-group settings (e.g., different tasks or worker types) one
may require that contracts do not differ too much across groups, or that
expected utility satisfies parity constraints. Even in a single-agent
model, one can interpret such rules as restrictions on the admissible
α path, e.g.,
$$
|\alpha(t)-\alpha(t')|\le L|t-t'|\quad\text{(smoothness)},\qquad
\alpha(t)\in[\underline\alpha,\bar\alpha]\quad\text{(caps)}.
$$
These constraints again shrink the trajectory set. A useful practical
insight is that the of the optimal dynamic policy changes: rather than a
sharp switch, one obtains ramp-downs (when smoothness is enforced) or
bang-bang-with-floor behavior (when only a minimum is enforced). From a
computational viewpoint, closed-form evaluation under exponential
survival may survive for piecewise-constant policies but will generally
be replaced by numerical integration once smoothness is imposed.
Our baseline assumes the principal knows the primitives (c, F, r) and thus can compute breakpoints and expected rewards. In many applications the principal does not know Ri (or even m is large and outcomes are sparse), and must learn from observed outcomes while simultaneously steering an agent who is also learning. This creates a two-sided learning problem with distinct informational constraints: the principal observes outcomes but not actions; the agent observes own realized payoffs. Two implications stand out.
First, is limited. Because the action is unobserved, the mapping from a posted α to observed outcomes depends on the agent’s (learning-driven) response, which depends on the historical average ᾱ(t). Thus, naive estimation of Ri from outcomes is confounded by endogenous action choice. A conservative route is to target robust policy classes whose performance can be certified without precise identification—for instance, restricting to a small menu of α values and using survival-weighted bandit algorithms over that menu. In the constant-hazard case, the objective is naturally discounted, and standard discounted bandit techniques can be adapted, though the non-stationarity induced by mean-based learning still requires care.
Second, even when identification is possible, one should expect an tradeoff. The principal may need to vary incentives to learn which effort levels are achievable and profitable, but such variation itself moves the historical average and can trigger (or undo) free-fall dynamics. A practical design principle is to separate time scales: use short, randomized exploration bursts that minimally perturb ᾱ(t) (hence minimally perturb the agent’s learning state), interleaved with longer exploitation phases. Technically, this suggests extending the reduction-consistency step to allow principal policies that are not fully oblivious but are ``slow’’ relative to the agent’s averaging dynamics.
Finally, many principals contract with multiple agents simultaneously (teams, marketplaces, model ensembles). Multi-agent structure can either amplify or attenuate the dynamic advantage channel.
If the principal can post individualized contracts pt(ℓ) to each agent ℓ, and each agent runs an independent mean-based learner, then the problem largely decomposes: the principal’s value is the sum of per-agent values, each governed by its own historical average. The main coupling comes from shared constraints (budget, fairness) or from aggregate performance objectives that are nonlinear (e.g., a minimum of outputs). In the decomposable case, constant hazard implies 𝔼[S] = 1/h per relationship; with many agents, the principal can diversify churn risk, which makes dynamic policies more attractive in aggregate even if per-agent advantage is modest.
If the principal must post a single contract to all agents (a common bonus rate, a platform-wide revenue share), then the relevant state is no longer one-dimensional. Agents at different learning states imply that the same αt induces heterogeneous responses, and the ``best response to the historical average’’ condition becomes a distributional statement across agents. In large markets, a law-of-large-numbers approximation can make the aggregate response deterministic, but the principal now controls a population dynamical system. The potential method may extend by aggregating potentials across agents, yet free-fall manipulations can be blunted because the principal cannot tailor the investment phase to each individual’s breakpoint structure.
If outcomes depend on multiple agents’ actions (team production, congestion), then an agent’s reward-relevant signal reflects others’ behavior. Mean-based learning in such environments can converge to correlated outcomes that are not easily summarized by a single ᾱ(t). Nevertheless, the central insight remains relevant: any dynamic advantage that relies on steering through historical averages will be fragile to additional strategic externalities, and stronger learning notions (e.g., swap regret) become even more natural on the agent side.
Taken together, these extensions suggest that the constant-hazard, single-agent, known-primitive model is best interpreted as a sharp lens on one mechanism—history dependence induced by mean-based learning—rather than as a complete description of practice. The good news is that the lens is portable: once we identify how an extension changes the trajectory constraints or the survival-weighted objective, we can often recover the same qualitative comparative statics (dynamic gains shrink when effective hazard rises, and when constraints limit downward movement of the average contract), while also generating concrete, testable predictions about when free-fall-like policies should disappear in the presence of retention concerns, wage floors, or platform-wide uniform pricing.
Our analysis is best read as identifying a by which dynamic pay can systematically outperform static pay when the agent adapts via mean-based no-regret learning: the principal can move the agent across action regions by shaping the historical average contract, and then ``harvest’’ during periods in which current incentives are weak but the learner remains pinned (temporarily) to a high-reward action. The mechanism is neither classical screening nor standard moral hazard with full rationality; it is a form of intertemporal that exploits path dependence in learning dynamics. This section distills when this mechanism is likely to matter, what it suggests for auditing and governance of incentive schemes, and what it implies for the choice of agent-side algorithms. We close with open questions that, in our view, separate modeling convenience from the core economics.
A first practical message is that dynamic advantage is generic: it
requires a conjunction of (i) exploitable path dependence (mean-based
learning with best-response-to-average constraints), (ii) enough
room'' in the incentive parameter space to traverse multiple breakpoints (captured by the potential height $\Psi$), and (iii) sufficient expected relationship length to recoup the initialinvestment’’
phase before exogenous termination downweights later harvesting. The
last requirement can be summarized by a timescale comparison. Let Tmix denote an
instance-dependent recoup or mixing time—informally, the time needed for
a front-loaded incentive to translate into a sufficiently low historical
average (and hence sustained high effort) after incentives are reduced.
Then the condition
h ⋅ Tmix ≪ 1
is a useful rule of thumb under constant hazard, and more generally one
expects dynamic gains to be governed by an ``effective hazard’’
functional Λ(F̄) that
places weight on early termination events. When h is high (or F̄ is thin-tailed), free-fall-style
policies become dominated by static or near-static contracts because the
principal cannot reliably reach the profitable region of the trajectory
before stopping occurs.
A second message is that the of the advantage is inherently . Even if the principal posts a one-dimensional linear contract α(t), profitable policies typically feature phases (or mixtures over phases) rather than a single constant α. This has two implications. First, in environments where institutional constraints force near-stationarity (e.g., regulated revenue shares that cannot change frequently), our results predict that the dynamic advantage should largely disappear. Second, where rapid adjustment is feasible (platform bonuses, short-term commissions, dynamic ``quests’’), one should expect large cross-sectional variation in outcomes even with similar long-run average pay: what matters is not only the average level of incentives, but also incentives are delivered relative to the agent’s learning state and the anticipated relationship duration.
Third, the model clarifies when dynamic contracts for structural reasons. Strengthening the agent’s learning notion (e.g., toward swap regret) removes the intertemporal wedge the principal exploits, collapsing the additional value from steering through historical averages. Likewise, any constraint that prevents the principal from reducing pay sufficiently far (minimum-pay rules, fairness floors, limited liability coupled with bounded bonuses) truncates the reachable potential and therefore caps the benefit of dynamic manipulation. These are not merely technicalities: they suggest that policy restrictions and algorithmic safeguards can substitute for one another in limiting exploitative dynamics.
A natural governance concern is that dynamic pay can be used to create high incentives early (to attract or condition behavior) followed by systematically low incentives once the agent is ``locked in’’ by its own adaptation. Because the principal does not need to condition on actions, such schemes can be difficult to detect using standard contract review that focuses on per-period expected payments. Our framework suggests focusing instead on diagnostics that are sensitive to history dependence.
One approach is to audit . In linear settings, this corresponds to identifying paths α(t) with large early mass followed by sharp reductions. Another approach is to compute a and compare realized performance to what would be expected under the best static linear contract (or, more generally, the best static contract in the permissible class). In our notation this benchmark is R⋆𝔼[S], and the policy question becomes whether observed profits materially exceed (or rely on) the predicted dynamic surplus bounds governed by (Ψ, F̄). While a regulator typically does not observe (c, F), many environments admit proxies: one can estimate a platform’s effective hazard from retention data and bound the feasible Ψ from the observed menu of incentive rates and outcome rewards. The comparative statics then yield a falsifiable implication: dynamic advantage should be concentrated in low-hazard segments (long-tenure cohorts) and should attenuate sharply as churn rises.
A complementary auditing lens is . Even in our exogenous-hazard benchmark, dynamic policies that harvest during low-pay phases tend to reduce the agent’s realized utility conditional on surviving to late stages. In practice, when quitting is endogenous, such policies may raise churn among disadvantaged or liquidity-constrained agents. Thus, monitoring should include retention responses to incentive reductions, not merely output. In settings where fairness constraints are salient, one can also check whether dynamic incentives create disparate impact by interacting with heterogeneous learning rates or outside options: two groups facing identical posted incentives may nonetheless experience different effective trajectories because their hazard or adaptation differs.
Finally, auditing should explicitly recognize the possibility of . A principal can randomize phase lengths or use complex bonus schedules that are hard to summarize. Our results suggest that such randomization is not an innocuous modeling flourish: under uncertain stopping times, it can be close to optimal. This motivates transparency requirements that mandate disclosure of incentive evolution rules (or at least bounds on rate-of-change), akin to rules in consumer finance that restrict teaser-rate designs.
From the agent’s perspective, the central vulnerability is that mean-based learning responds to payoffs, which allows the principal to manipulate behavior by manipulating the agent’s running averages. A direct implication is that the choice of learning algorithm is not merely a performance detail; it is a that shapes the feasible set of principal pay trajectories.
Our results therefore support a design principle for algorithmic agents (including AI proxies acting on behalf of workers or users): prefer learning dynamics that satisfy richer no-regret guarantees—notably, variants that eliminate path-dependent exploitation (e.g., swap-regret-like notions)—even if they are computationally heavier. Put differently, robustness to adversarially chosen incentives should be treated as a first-class objective alongside sample efficiency. In institutional settings, this is analogous to recommending that workers (or their representative tools) use decision rules that remain responsive to incentives rather than being overly anchored by historical averages.
A second implication concerns . If an agent reveals that it uses a mean-based learner with a known averaging window, it effectively reveals its Tmix, enabling a principal to calibrate phase lengths and extract more surplus. This creates a strategic tradeoff: transparency can facilitate beneficial coordination, but it can also facilitate exploitation. One practical compromise is to disclose coarse behavioral guarantees (e.g., minimum responsiveness to incentive changes, or a form of policy stability) rather than the precise update rule.
A third implication is the role of internal to the agent. If the
agent can impose a participation constraint on itself (e.g., refuse to
continue when realized utility falls below a threshold), then the
principal’s ability to free-fall is curtailed even absent external
regulation. More broadly, algorithmic agents can incorporate
retention'' orsafety’’ objectives that mimic endogenous
quitting, thereby converting the principal’s dynamic instrument into a
costly lever.
Several directions appear especially important. First, endogenizing churn (history-dependent hazard) is likely to change both optimal policies and welfare conclusions: retention incentives may become a binding constraint that limits free-fall and introduces new equilibria in which principals smooth pay to maintain continuation. Second, principal-side learning remains underexplored. When the principal must infer Ri and breakpoint structure from outcomes confounded by the agent’s learning response, the optimal policy may involve carefully designed experiments that themselves interact with the agent’s state. Third, richer contract spaces (nonlinear bonuses, caps, penalties, or multidimensional outcome signals) raise the question of whether the historical-average state compression survives, and if so, which low-dimensional summaries replace ᾱ(t).
Finally, we see an empirical agenda. The theory predicts that dynamic incentive schemes should be most effective in low-hazard environments, should exhibit front-loading followed by incentive reductions, and should lose effectiveness as agents adopt more sophisticated adaptive rules or as institutions impose floors and smoothness constraints. Testing these predictions—especially disentangling learning dynamics from selection and unobserved heterogeneity—would sharpen the policy relevance of dynamic contracting models and help distinguish benign intertemporal incentives from exploitative manipulation.
Taken together, the broader lesson is that dynamic pay is neither universally harmful nor universally beneficial. It is powerful precisely when adaptation is predictable and path dependent, and when relationships last long enough for investment-and-harvest strategies to pay off. That lens suggests concrete levers for practice: regulate or audit incentive rather than levels, strengthen agent-side learning guarantees when agents are algorithmic, and pay particular attention to environments with low churn and high discretion over time-varying incentives, where the scope for systematic dynamic outperformance is greatest.