← Back

Contracting with Churn: Dynamic Incentives Against Learning Agents Under Stochastic Stopping

Metadata

Table of Contents

  1. 1. Introduction and Motivation: churn/stochastic termination in platforms and AI-mediated work; why learning agents change contracting; summary of hazard-sensitive collapse and constructive policies.
  2. 2. Related Work: repeated contracting and algorithmic contracts; optimizing against learners; unknown-horizon results in the source; churn/discounting models in economics and online learning.
  3. 3. Model: principal–agent primitives; linear/p-scaled contracts; mean-based learning; exogenous stopping time (survival/hazard); objective and benchmarks (static vs dynamic).
  4. 4. Survival-Weighted Continuous-Time Reduction: define survival-weighted trajectory utility; prove reduction from discrete time with stopping to continuous trajectories (analogue of Theorems 2.4 and 5.4).
  5. 5. Potential Function and Upper Bounds: define breakpoint potential height Ψ; derive hazard-sensitive bounds on achievable multiplicative advantage; provide closed-form for constant hazard.
  6. 6. Achievability via Hazard-Matched Phase Policies: construct randomized two-phase/phase-mixture policies; show performance guarantees and identify when dynamic strictly helps.
  7. 7. Specialization and Closed Forms: constant-hazard (geometric/exponential) case; success/failure (linear) case; explicit formulas for expected payoff of free-fall under hazard; when numerical optimization is needed.
  8. 8. Extensions: history-dependent hazard (endogenous churn), minimum-pay/fairness constraints, partial feedback, and multi-agent variants (brief).
  9. 9. Discussion and Policy Implications: when dynamic pay can systematically outperform static; auditing implications; implications for agent-side algorithm choices; open questions.

Content

1. Introduction and Motivation: churn/stochastic termination in platforms and AI-mediated work; why learning agents change contracting; summary of hazard-sensitive collapse and constructive policies.

Many contracting environments of current interest are neither one-shot nor truly long-lived. Digital labor platforms, creator marketplaces, and enterprise procurement systems all exhibit substantial : principals (platforms or buyers) and agents (workers, sellers, service providers) match, interact repeatedly for some time, and then separate for reasons that are often orthogonal to performance—demand shocks, reallocations of attention, policy changes, or idiosyncratic departure. Similar dynamics arise in AI-mediated work. A user may repeatedly delegate tasks to an AI tool, a firm may route requests through an AI proxy, or a platform may deploy automated agents that respond to incentives embedded in APIs and billing rules; in each case the ``relationship length’’ is uncertain because the task stream ends, the user switches tools, or policies change. These settings motivate a repeated principal–agent model in which the interaction ends at a random time and where the agent is not assumed to solve a full dynamic program, but instead updates behavior using a learning algorithm.

Our starting point is the observation that changes the strategic landscape of dynamic contracting in a qualitatively different way than classical discounting or finite-horizon rationality. When an agent best-responds myopically to posted contracts, the principal can optimize within familiar incentive-compatibility constraints. When the agent instead runs a no-regret procedure, the constraints the principal faces are not simply static IC constraints period by period; rather, they arise from the algorithm’s guarantee that actions with persistently lower empirical utility are played rarely. In the ``mean-based’’ class of no-regret learners, this guarantee depends on payoffs. As a consequence, a principal can sometimes profit by shaping the agent’s empirical utility landscape over time: early payments can steer the learner toward a high-cost, high-output action, after which the principal can reduce payments and still retain the desired behavior for a while because the learner’s running averages adjust slowly.

This path dependence under mean-based learning is the engine behind dynamic advantages identified in recent work on algorithmic agents in repeated contracting. In deterministic-horizon models, one can formalize this advantage via continuous-time trajectories that track the evolution of the and the induced best responses. The principal can implement ``free-fall’’ policies: offer a relatively generous linear contract for an initial phase to push the learner toward a desirable action, then abruptly cut incentives (potentially to zero) and harvest profits while the learner continues to play the previously reinforced action until the empirical averages cross critical thresholds. These effects are not artifacts of sophisticated forward-looking reasoning by the agent; they arise precisely because the learner is forward looking, but is instead governed by a regret guarantee that is inherently backward looking.

Yet the practical relevance of such dynamic manipulation hinges on a basic question: In platforms and AI tool deployments, the answer is: not always long. A principal may not know whether the relationship will last for 10 rounds or 106 rounds; and crucially, the termination event is typically exogenous. This motivates the main modeling departure in this paper: we treat the interaction length as a random stopping time S, independent of the realized outcomes and history. Rather than optimizing a fixed-horizon sum, the principal maximizes expected stopped profit, equivalently a survival-weighted integral in continuous time with survival function (t) = Pr [S ≥ t] and hazard $h(t)=-\frac{d}{dt}\log \bar F(t)$.

Introducing stochastic termination does more than add a technical discount factor. It forces a different economic tradeoff. Dynamic contracts against mean-based learners can be viewed as ``front-loading’’ incentives to move the learner’s state (its empirical averages) into a region that yields high principal surplus later. Random termination penalizes such strategies precisely because it truncates the future. When the hazard is high, the principal is unlikely to enjoy the late-stage harvest phase; when the hazard is low and the survival distribution has heavy tails, dynamic steering becomes more valuable. Thus churn acts as a natural force that can dynamic exploitation of learning dynamics.

We develop this intuition through a survival-weighted analogue of the trajectory formulation for mean-based learning dynamics. A key simplification is that exogenous stopping changes , not the feasibility constraints induced by learning: the agent’s best-response-to-average condition still governs which action sequences can arise as the average contract evolves. Consequently, we can express the principal’s problem as a continuous-time control problem over valid trajectories π, but with payoffs weighted by (t). This reduction plays two roles. First, it provides a clean characterization of the best achievable payoff U against worst-case mean-based learning under a given survival curve. Second, it lets us import structural insights from the deterministic-horizon setting—especially for linear contracts—while making transparent how hazard reshapes what is achievable.

Our main qualitative message is a of dynamic advantage. In linear-contract environments, the deterministic-horizon analysis can be summarized by a finite set of in the linear share parameter α at which the agent’s preferred action changes. The principal’s ability to obtain profits above the best static contract is tied to how far, and how quickly, the principal can move the historical average ᾱ(t) across these breakpoints. This can be captured by a piecewise-linear potential function ψ(α) whose total height Ψ measures the ``budget’’ of manipulation available in the instance. Under stochastic termination, this budget does not disappear, but its value depends on it is spent: spending potential late is less valuable because the relationship is less likely to survive. The hazard therefore translates into an upper bound on the multiplicative improvement over the static benchmark that decreases as churn increases. In high-hazard environments, the principal cannot credibly amortize an initial subsidy over a long harvest period, and dynamic contracts become nearly indistinguishable (in value) from the optimal static contract.

At the same time, we emphasize that churn does not automatically eliminate all dynamic effects. When the survival distribution has sufficient mass on long durations, dynamic policies can still outperform static ones, and one can design policies that are to the hazard. A deterministic ``switch at time τ’’ free-fall policy is brittle: if the relationship ends just before the harvest phase, the principal pays the cost without reaping the benefit. Our constructive results therefore advocate randomized phase policies that spread the switch time across the survival curve. Economically, these policies hedge termination risk by mixing over trajectories with different recoup horizons; algorithmically, they form a one-dimensional family that is tractable to optimize once is known (and can be numerically optimized in more general hazard models). In the important special case of constant hazard (exponential survival), the survival-weighted objective yields closed-form expressions for the value of a free-fall policy, enabling simple search over an initial incentive level and a switch-time parameter.

This perspective helps connect the theory to practice. In settings like gig work or API marketplaces, a platform frequently faces uncertain user lifetimes and must decide whether to offer boosts,'' bonuses, or temporarily generous terms to induce higher effort or quality. Our results suggest that the profitability of such front-loaded incentives depends sharply on retention: where hazard is high, simple static terms can be near-optimal even when sophisticated dynamic schemes exist in a deterministic-horizon abstraction; where hazard is low, hazard-matched dynamic incentives can be justified and can be tuned using an explicit survival curve. In AI-mediated work, where theagent’’ may literally be a learning system, the relevance is twofold: (i) dynamic contracts can unintentionally exploit learning dynamics, producing transient over-performance followed by collapse when incentives change; (ii) conversely, a principal may deliberately use phased incentives to rapidly train or steer a deployed model, but churn (task cessation, tool switching, policy changes) limits the attainable gains.

We also view stochastic termination as a disciplined way to discuss . Deterministic-horizon dynamic advantages often rely on long recoup periods. In real deployments, relationship length is a moving target: it varies across users, is sensitive to macro conditions, and can be disrupted by exogenous events. Modeling S explicitly allows us to ask how sensitive dynamic contracting benefits are to misspecification of , and it suggests a natural operational statistic—an ``effective hazard’’—that governs whether dynamic schemes are worthwhile. This framing also clarifies an important limitation: our analysis assumes termination is exogenous and independent of the principal’s actions. In many platforms, incentives can affect retention (e.g., an agent stays longer if paid more). Endogenizing churn would introduce an additional channel and could either amplify or dampen dynamic effects. We treat exogenous stopping as a first step that isolates the interaction between learning dynamics and time uncertainty.

Finally, we stress that our model does not claim that real agents are literally mean-based learners, nor that principals can commit to arbitrary dynamic policies without frictions. Rather, the model illuminates a tradeoff: dynamic contracts can leverage the inertia of no-regret learning, but doing so typically requires an upfront investment whose payoff is back-loaded; stochastic termination downweights the back end and thus compresses the scope for gains. This provides a unifying explanation for when algorithmic-contracting pathologies should be expected to matter, and when they should wash out in the face of churn. The remainder of the paper formalizes this logic, derives hazard-dependent bounds, and constructs simple hazard-matched policies that attain these bounds up to instance-dependent constants.


Our setting combines three classical themes—hidden-action contracting, repeated interaction, and learning dynamics—with an explicit representation of churn via an exogenous stopping time. Each ingredient has a substantial literature, and our contribution is best understood as importing a particular viewpoint from recent work on contracting against no-regret learners into a stochastic-termination environment, where the key object is the survival curve (equivalently, the hazard).

The foundational principal–agent literature studies moral hazard under hidden action and stochastic outcomes, typically under one-shot or discounted infinite-horizon formulations (see, e.g., ). In repeated interactions, the theory of emphasizes how dynamic incentives can be sustained by continuation values when parties are long-lived and can condition on histories . That literature features fully rational agents and incentives enforced by equilibrium threats rather than by a learning rule. Our environment differs in two ways. First, the principal here is constrained to simple payment schemes (linear or, more generally, p-scaled contracts), motivated by practical platform rules and the tractability of the breakpoint structure in such classes. Second, the agent is not assumed to compute an equilibrium of the repeated game; instead, behavior is governed by a no-regret algorithm whose guarantees are . This shift changes what ``dynamic incentives’’ mean: rather than manipulating continuation utilities through equilibrium punishments, the principal shapes the agent’s payoff landscape so that certain actions remain attractive relative to the agent’s running averages.

At a high level, one can view our results as complementary to relational contracting: when incentives are implemented via learning dynamics rather than equilibrium reasoning, a principal may obtain short- to medium-run benefits from in the learner’s state (running averages). However, these benefits must be amortized over time, which is precisely where stochastic termination becomes economically first-order.

A growing literature studies economic design problems when one side is algorithmic or boundedly rational, including learning buyers/sellers in auctions, pricing against learning demand, and platform design for learning participants . In many of these models, the designer optimizes against a learning rule rather than a fully strategic opponent, and dynamic policies can exploit predictable features of learning algorithms. Closest in spirit are models where the principal controls an incentive signal (price, wage, ranking weight) and induces a learner to adapt, creating a feedback loop that resembles a control problem.

Within contracting, recent work (including our source) formalizes how a principal can exploit mean-based no-regret learning to obtain dynamic advantages even with very restricted contract classes. The key technical idea is that, for mean-based learners, feasibility of action sequences can be characterized by how the agent best-responds to the contract. This leads to a continuous-time limit in which the state variable is an average contract (or average linear share), and the principal’s policy corresponds to a trajectory through a polyhedral partition of contract space. Our paper builds directly on that perspective, but changes the evaluation criterion: instead of a deterministic horizon (or a worst-case time criterion), we weight payoffs by survival probabilities induced by exogenous churn.

From the online learning side, our agent model belongs to the broad family of no-regret dynamics . A key conceptual point is that no-regret is a guarantee; it does not generally preclude substantial transient behavior. This gap between asymptotic optimality and finite-time path dependence underlies a number of regret manipulation'' andlearning in games’’ phenomena . The mean-based condition used in the source (and here) is deliberately permissive: it captures many standard algorithms and provides a clean sufficient condition for the trajectory reduction, while still allowing the principal to steer behavior through the evolution of empirical utilities. By contrast, stronger learning notions (e.g., swap regret or internal regret) typically eliminate some path-dependent exploitation channels; this mirrors known results that richer deviation constraints drive play toward correlated equilibrium sets and can reduce the designer’s ability to ``fool’’ the learner . We highlight this comparison because it clarifies the role of the learning assumption: dynamic contracting gains in our setting arise not from sophisticated intertemporal commitment, but from the coarse way in which mean-based learning aggregates the past.

Uncertain interaction length has been modeled in several equivalent ways across fields. In economics, geometric discounting and random termination are closely related (a constant hazard corresponds to exponential survival and can be reinterpreted as discounting), and both serve as parsimonious reduced forms for impatience, turnover, or limited commitment . In online learning, random stopping times arise in analyses that require algorithmic guarantees uniform over time, or in settings where the evaluation is a random prefix of play . The source paper studies deterministic-horizon performance and also develops unknown-horizon results by mixing over horizon-dependent policies (or by designing policies that are robust to the realized horizon). Our model takes a different route: rather than treating the horizon as adversarially unknown, we assume a known exogenous survival curve and incorporate it directly into the objective via survival weighting.

This change is not merely cosmetic. A worst-case or minimax unknown-horizon criterion pushes toward policies that perform reasonably at all times, while survival-weighting permits deliberately policies when the tail of the survival distribution is heavy, and penalizes them when hazard is high. Put differently, the survival curve provides a disciplined way to interpolate between must do well immediately'' andcan invest for later,’’ and it allows us to state comparative statics in terms of hazard. This also yields a clearer connection to practice in platform settings where churn is measurable and can be estimated from retention data.

Churn is central in empirical and theoretical work on two-sided platforms, labor marketplaces, and subscription businesses, where retention determines the profitability of front-loaded subsidies, bonuses, or promotions . Our model is intentionally stylized relative to those environments: we take termination as exogenous and independent of the contracting path, whereas in many applications incentives affect retention and selection. Nonetheless, explicitly modeling a stopping time is useful even as a first pass because it isolates a basic mechanism: any dynamic incentive scheme that resembles an investment'' followed by aharvest’’ phase becomes less attractive as churn increases. In this sense, our hazard-sensitive bounds provide a theoretical analogue of a common operational heuristic in platforms: incentives that require long payback periods are hard to justify when user lifetime is short or volatile.

Finally, our restriction to linear (or p-scaled) contracts connects to a large body of work on simple mechanisms and robust contracting . Linear sharing rules are canonical in environments with risk neutrality and limited observability, and more broadly they provide a tractable design space when the outcome is multi-dimensional but can be summarized by a scalar reward ro. In the deterministic-horizon mean-based learning model, linearity is also what produces the breakpoint structure in the share parameter α, enabling potential-function arguments and explicit ``free-fall’’ trajectories. We keep this structure because it lets us ask a crisp question: Our answer is that the same geometric objects (breakpoints, potentials, action regions) govern feasibility, but the survival curve reweights when traversing those objects is valuable.

Relative to the contracting and learning literatures, our main conceptual move is to treat churn as a primitive that alters the objective but not the learning-induced feasibility constraints. This yields a survival-weighted analogue of the trajectory formulation from the source, and it supports two complementary messages: (i) an upper bound showing that the multiplicative advantage of dynamic contracts over the best static contract decreases with an effective hazard; and (ii) constructive phase-based policies that are tuned to the survival curve and recover much of the attainable advantage when the tail is sufficiently heavy. By placing these statements in a stopping-time model, we aim to bridge the gap between deterministic-horizon abstractions of dynamic manipulation and the operational reality that many principal–agent relationships end abruptly and for reasons unrelated to performance.


3. Model: principal–agent primitives; linear/p-scaled contracts; mean-based learning; exogenous stopping time (survival/hazard); objective and benchmarks (static vs dynamic).

We study a repeated hidden-action principal–agent interaction with stochastic outcomes and an exogenous random relationship length. The principal (designer) chooses a payment rule each period; the agent (worker, proxy, or algorithm) chooses an unobserved costly action; an outcome is realized and publicly observed; and then the relationship may end for reasons independent of play (``churn’’). Our goal is to understand how much value a principal can extract from contracting when the agent adapts via a permissive no-regret learning rule, and how that value changes with the survival (hazard) profile of the relationship.

There is a finite action set [n] = {1, …, n}. Action i has (known) cost ci, with c1 = 0 and ci weakly increasing in i. Outcomes lie in a finite set [m] = {1, …, m}. If the agent takes action i, the realized outcome o ∈ [m] is drawn from a known distribution Fi. The principal’s gross value from outcome o is a known number ro, with r1 = 0 and ro weakly increasing in o. We write the principal’s expected gross value under action i as
Ri := 𝔼o ∼ Fi[ro].
Throughout, both parties know (ci)i ∈ [n], (Fi)i ∈ [n], and (ro)o ∈ [m]. We abstract from risk aversion by taking both sides to be risk-neutral; this keeps attention on the dynamic incentive effects generated by learning and by churn rather than on insurance considerations.

Time is discrete, indexed by t = 1, 2, …. A (possibly history-dependent) contract in period t is a nonnegative payment vector $p_t\in\mathbb{R}^m_{\ge 0$, where pt, o denotes the transfer from principal to agent if outcome o occurs in period t. We impose nonnegativity to reflect limited liability and the practical reality that many platforms cannot levy negative transfers.

The sequence of events within a period is:

The principal never observes at directly and can condition future contracts only on publicly observed history (past outcomes and past posted contracts). The agent observes the realized outcome and payment and, of course, internal information about its own action and cost.

Given contract p and action i, we define per-round expected utilities
uP(p, i) := Ri − 𝔼o ∼ Fi[po],   uA(p, i) := 𝔼o ∼ Fi[po] − ci.
We allow the principal to choose contracts adaptively. This is the economically relevant class in applications (bonuses, multipliers, and promotions often respond to realized performance), and it is also the appropriate benchmark when the agent’s learning rule is treated as part of the environment rather than as an equilibrium object.

A central restriction in our analysis is that the principal uses a family of contracts. The leading case is sharing: the principal selects αt ∈ [0, 1] and sets
pt, o = αtro   for each o ∈ [m].
Under linear contracts, the per-round utilities simplify to
uP(α, i) = (1 − α)Ri,   uA(α, i) = αRi − ci,
so the contract affects only the division of expected surplus, not the mapping from actions to outcomes. This restriction is motivated by two considerations. First, linear or proportional rules are common in practice (revenue shares, commission rates, performance multipliers). Second, and crucial for our results, linear contracts induce an analytically tractable structure in α that governs which action is optimal for the agent.

We also allow a mild generalization, which we refer to as contracts: fix a baseline nonnegative vector  ∈ ℝ ≥ 0m and restrict the principal to contracts of the form p = α with α ∈ [0, 1]. Linear contracts correspond to the choice  = r. In the main text we present statements for linear contracts for clarity; the p-scaled extension typically requires only notational changes (replacing Ri by the expected baseline payment under Fi where appropriate).

If the agent were myopic and fully optimizing in each period, then given a contract p it would choose an action in the best-response correspondence
BR(p) := arg maxi ∈ [n]uA(p, i).
However, the core friction in our setting is that the agent is rather than solving a full intertemporal optimization problem. We model the agent as running an arbitrary mean-based no-regret algorithm, in the sense formalized in the source (Definition 2.2 / B.4). Intuitively, such algorithms concentrate probability on actions whose payoff is near-optimal, and they rarely play actions that are empirically dominated by a large margin.

Concretely, we assume that the learner maintains internal scores'' $(\sigma_i^t)_{i\in[n]}$ at each time $t$ (these may be realized cumulative payoffs under full information, or unbiased estimates under bandit feedback). The mean-based condition states that there exists a parameter $\gamma(T)=o(1)$ such that, over any horizon $T$, whenever an action $i$ is behind some other action $i'$ by more than $\gamma(T)T$ in score, then action $i$ is played with probability at most $\gamma(T)$ at that time: \[ \sigma_i^t < \sigma_{i'}^t - \gamma(T)T \ \ \Rightarrow\ \ \Pr[a_t=i] \le \gamma(T). \] We emphasize two modeling choices embedded here. First, this condition is deliberately permissive: it captures many standard no-regret procedures while allowing substantial transient dependence on the path of realized payoffs. Second, the condition is \emph{scale-sensitive} in the natural way: what matters is whether an action is worse by an amount that is large relative to the horizon. This is exactly the regime in which a principal might hope tosteer’’ behavior by shaping empirical averages.

A distinctive feature of our model is that the interaction does not have a fixed deterministic horizon. Instead, the relationship ends at a random stopping time S ∈ {1, 2, …} that is exogenous and independent of play. In discrete time we parameterize S by a (possibly time-varying) hazard sequence (ht)t ≥ 1, where
ht := Pr [S = t ∣ S ≥ t],   t := Pr [S ≥ t]
denotes the corresponding survival probabilities. The independence assumption is economically restrictive—in many labor and platform settings incentives can affect retention—but it is analytically useful because it isolates a basic tradeoff: dynamic incentives may require an ``investment’’ phase that only pays off if the relationship survives long enough.

We will frequently use the continuous-time representation of the same survival information. Let (t) = Pr [S ≥ t] for t ≥ 0 denote the survival function and let
$$ h(t) := -\frac{d}{dt}\log \bar F(t) $$
be the (instantaneous) hazard rate when is differentiable. The constant-hazard case (t) = eht plays a special role both because it corresponds to geometric stopping in discrete time and because it yields closed-form expressions for several quantities of interest.

Given a contract policy {pt}t ≥ 1 and the induced sequence of actions and outcomes, the principal’s realized total profit up to termination is
$$ \text{Profit}_P(\{p_t\},S) := \sum_{t=1}^{S}\big(r_{o_t}-p_{t,o_t}\big). $$
The principal evaluates a policy by expected stopped profit,
UtilP({pt}) := 𝔼 [ProfitP({pt}, S)],
where the expectation is taken over outcome draws, the agent’s randomization (from learning), and the stopping time S. Because S is independent of play, UtilP admits a useful survival-weighted form:
$$ \text{Util}_P(\{p_t\}) = \sum_{t=1}^{\infty}\bar F_t \cdot \mathbb{E}\!\left[u_P(p_t,a_t)\right], $$
which makes transparent how churn reweights the importance of early versus late periods. We will exploit an analogous integral representation in continuous time.

To interpret dynamic policies, we fix a static benchmark. A contract posts the same payment rule each period (equivalently, the same α under linear contracts). Under myopic best responses, a static contract induces a single action (or a mixture over best responses) each period. Under a no-regret learner, the relevant notion is that, over long play, empirical frequencies concentrate on actions that are approximately optimal for that fixed contract.

Let R denote the principal’s optimal single-round profit against a best-responding agent within the admissible contract class (linear or p-scaled). We view R as the correct per-round baseline because it is achievable by a time-invariant policy and, under learning, is the natural limit point of what can be guaranteed without exploiting transient path dependence. Under stopping, the corresponding static value is simply scaled by expected relationship length:
Utilstatic := ∫0(t) Rdt = R 𝔼[S],
with the discrete-time analogue t ≥ 1tR.

Dynamic contracting can do better than this benchmark by leveraging the dependence of a mean-based learner on historical averages. At a high level, the principal may temporarily offer generous incentives to move the learner’s internal state into a region where it continues to select high-reward actions even after incentives are reduced. This logic is inherently intertemporal, and it is precisely here that survival matters: if the relationship ends too quickly, the ``harvest’’ phase may never arrive. Our main object of interest is therefore the optimal expected stopped profit over all admissible dynamic policies against worst-case mean-based learning, and the induced advantage relative to Utilstatic.

Before proceeding, we record two limitations that will be important when interpreting our conclusions. First, we treat churn as exogenous and independent of play, which rules out screening and retention effects that are central in many empirical environments. Second, our restriction to linear (or p-scaled) contracts abstracts from richer nonlinear incentives that a sophisticated principal might deploy if unconstrained. We adopt these simplifications because they deliver a clean geometric structure (breakpoints and action regions) and because they let us isolate a specific economic mechanism: dynamic advantage is feasible only to the extent that one can profitably trade off early incentives against later rents, and the survival curve determines how that tradeoff is priced.

In the next section we formalize this tradeoff by moving to a continuous-time representation in which the state variable is an average contract parameter and the objective is survival-weighted flow profit. We then show that, against mean-based learners, the discrete-time stopped problem reduces (up to lower-order terms) to an optimization over feasible continuous trajectories.


4. Survival-Weighted Continuous-Time Reduction: define survival-weighted trajectory utility; prove reduction from discrete time with stopping to continuous trajectories (analogue of Theorems 2.4 and 5.4).

Our analysis proceeds by replacing the original discrete-time game with a continuous-time control problem whose feasible set captures what a mean-based learner can be induced to do, and whose objective captures how churn discounts late profits. The benefit of this reduction is conceptual as much as technical: it separates (which come from learning dynamics and are essentially unchanged by stopping) from (which is where the survival function enters). Once we have this separation, comparative statics in the hazard profile become transparent, and the later potential-function bounds can be stated in a clean integral form.

Fix any (possibly history-dependent) discrete-time policy {pt}t ≥ 1 and let Xt := uP(pt, at) denote the principal’s per-round expected utility conditional on the contract in round t and the agent’s (possibly randomized) action. Because the stopping time S is independent of play, we can rewrite expected stopped profit as a survival-weighted sum:

This identity is the discrete-time analogue of the familiar continuous-time formula 𝔼 [∫0Sg(t) dt] = ∫0(t)g(t) dt, and it is the sole point at which exogenous stopping enters the objective. Economically, says that churn does not change what happens ; it changes only the price the principal pays for waiting.

To avoid inessential measurability issues, it is convenient to work in continuous time and treat discrete periods as unit-length intervals. Given any piecewise-constant control p(t) (or α(t) under linear contracts), define the survival-weighted continuous-time objective

where (p(t), a(t)) is induced by a trajectory π defined below. When (t) = eht (constant hazard), is an exponential discounting of flow profit. When has heavier tails, later profit retains more weight, making intertemporal ``investment–harvest’’ strategies more valuable.

The key feature of mean-based learning in the source is that play is governed by payoff comparisons, which in turn depend on of contracts. This suggests the appropriate continuous-time state variable: the historical average contract up to time t,
$$ \bar p(t) := \frac{1}{t}\int_{0}^{t} p(s)\,ds \qquad (\text{or }\ \bar\alpha(t):=\tfrac{1}{t}\int_0^t \alpha(s)\,ds\ \text{under linear contracts}). $$
Intuitively, if the principal holds p(t) fixed for a while, then (t) drifts slowly toward that value. A mean-based learner does not optimize with respect to the instantaneous contract; rather, it concentrates on actions that have done well on average, which is why (t) plays the central role.

We formalize this using the trajectory representation from the source. A (continuous-time) trajectory is a finite or countable sequence
π = {(pk, τk, ak)}k = 1K,
interpreted as: for τk units of time the principal posts contract pk and the agent plays action ak. Let Tk := ∑ ≤ kτ denote the cumulative time up to segment k, and let
$$ \bar p^{\,k}:=\frac{1}{T^k}\sum_{\ell\le k}\tau_\ell p_\ell $$
denote the historical average contract at the end of segment k.

The reduction in the source replaces the discrete-time mean-based condition with a set of deterministic constraints on which action can be sustained in each segment of a trajectory. The same constraints apply here because stopping is independent of play: conditional on survival, the agent’s score updates and payoff comparisons evolve exactly as in the fixed-horizon model. Concretely, a trajectory π is if for every k ≥ 2,

Condition has a simple economic meaning. Within segment k, the agent is supposed to keep playing ak while the historical average moves from k − 1 to k. For a permissive mean-based learner, the principal can sustain this only if ak is (approximately) optimal both at the start and at the end of the segment; otherwise some alternative action accumulates a decisive score lead and the learner would switch with high probability.

Crucially, does involve or h. Churn therefore acts like an objective-side discounting of a fixed feasible set: it changes which valid trajectories are desirable, not which trajectories are feasible.

Given a survival function , define the value of the survival-weighted control problem as

where Util(π) is computed by interpreting π as piecewise-constant functions p(t) and a(t) and applying . The benchmark corresponding to the optimal static contract is
Utilstatic = R0(t) dt = R 𝔼[S].
Our objective in this section is to justify as the correct characterization of the principal’s maximal expected stopped profit against worst-case mean-based learning.

The reduction has two directions, mirroring the fixed-horizon results.

First, given any discrete-time principal policy (even one that adapts to realized outcomes), we can extract a valid continuous-time trajectory whose survival-weighted value upper bounds the policy’s performance (up to lower-order terms). The key step is to note that the mean-based condition implies that the agent’s action can change only when some action’s cumulative advantage becomes large, which occurs only when averages cross boundaries of best-response regions. By grouping time into blocks on which the principal’s contract is (approximately) constant and the agent’s realized play is (approximately) constant, we obtain segments (pk, τk, ak). The historical averages at block boundaries become the k’s. The mean-based property then enforces in the limit: if ak were not a best response to k − 1 or k, then some competing action would have a linear-in-time score lead, contradicting that ak is played for τk time with non-negligible frequency.

The role of stopping is entirely captured by how we evaluate the resulting blocks. Using and passing to the block representation, the principal’s expected stopped profit becomes a Riemann-sum approximation to , with weights (t) (or t) multiplying flow utilities. Because is exogenous, this approximation is purely analytic: we do not need new incentive arguments beyond those in the source.

Second, given any valid trajectory π, we can construct a discrete-time policy that approximately implements it against any mean-based learner and achieves expected stopped profit close to Util(π). The construction follows the ``oblivious simulation’’ idea from the source. For each segment (pk, τk, ak), we play the contract pk for a block of τk/Δ discrete periods (for a small discretization step Δ), regardless of outcomes. Validity ensures that throughout the block the intended action is not severely dominated in cumulative score by any alternative, so a mean-based learner continues to place nearly all probability on ak. Importantly, because stopping is independent, posting contracts obliviously is without loss for our worst-case guarantee: conditioning on outcomes cannot improve the principal’s ability to force action changes when the agent is only constrained by mean-based regret.

Combining these two directions yields the survival-weighted analogue of the source’s trajectory characterization.

Proposition~ tells us that, once we restrict to linear (or more generally p-scaled) contracts and mean-based learning, churn affects the principal only through the weights (t) in . This is exactly the economic tradeoff we want to isolate. Dynamic contracting typically requires paying ``too much’’ early in order to reshape the agent’s empirical comparisons, and then recouping later by cutting incentives while the agent continues to play a high-reward action. A higher hazard makes the recoup phase less likely, so in it downweights precisely those portions of the trajectory in which the principal hopes to earn rents.

We emphasize a limitation: this clean separation between feasibility and evaluation relies on stopping being exogenous and independent of play. If contracts or outcomes affected retention, then (t) would become an endogenous object and the principal would face an additional intertemporal incentive problem (trading off current profit against future survival). Our framework deliberately abstracts from that channel in order to obtain sharp characterizations of the learning-based channel.

With the survival-weighted control problem in hand, we can now specialize to linear contracts and exploit the breakpoint geometry. In the next section we introduce a potential function that upper bounds how much ``intertemporal slack’’ the principal can extract, and we show how the hazard profile governs the maximal multiplicative advantage over the static benchmark.


5. Potential Function and Upper Bounds: define breakpoint potential height Ψ; derive hazard-sensitive bounds on achievable multiplicative advantage; provide closed-form for constant hazard.

We now specialize the survival-weighted control problem to linear contracts and derive an upper bound on how much a dynamic policy can outperform the best static contract. The key idea, inherited from the fixed-horizon analysis in the source, is that dynamic advantage is not free'': it is paid for by moving the historical average contract across finitely many best-response boundaries. A potential function quantifies this finiteintertemporal slack,’’ and stochastic stopping enters only through how much of that slack can be converted into survival-weighted profit.

Under a linear contract po = αro with α ∈ [0, 1], the agent’s expected utility from action i is
uA(α, i) = αRi − ci,
so the best-response correspondence BR(α) is piecewise-constant in α. For consecutive actions (i − 1, i), define the breakpoint
$$ \alpha_{i-1,i}:=\frac{c_i-c_{i-1}}{R_i-R_{i-1}}, $$
interpreting αi − 1, i = +∞ if Ri = Ri − 1. We adopt the standard genericity convention that breakpoints lie in [0, 1] and are strictly increasing in i after removing dominated actions. Then as α rises, the agent moves monotonically to higher-cost, higher-reward actions.

Economically, breakpoints represent the incentive intensity α required for the agent to prefer upgrading from i − 1 to i. Dynamic contracting exploits the fact that a mean-based learner compares payoffs: after a long enough period of high α, the historical average ᾱ(t) can remain above key breakpoints even if the principal subsequently cuts incentives, causing the agent to keep choosing a high-reward action for some time. The question is how large a survival-weighted benefit the principal can extract from this mechanism.

We encode the breakpoint structure into a scalar potential function ψ(α) that is piecewise-linear in α and increases only when α crosses a breakpoint. One convenient normalization (equivalent to the source up to affine transformations) is

This ψ has a direct interpretation: each term (Ri − Ri − 1)(α − αi − 1, i)+ measures how far the incentive intensity α sits above the threshold needed to make action i competitive against i − 1, scaled by the incremental principal reward of upgrading from i − 1 to i. The potential height

is an instance-dependent constant determined entirely by (c, R). It is finite (and typically O(maxiRi) under bounded rewards), and it is the maximal amount of potential the principal can ever ``store’’ by pushing incentives as high as possible.

Two qualitative features of Ψ are worth keeping in mind. First, Ψ is larger when adjacent actions are separated by small breakpoints (so that modest incentives can induce upgrades) and when reward increments Ri − Ri − 1 are large. Second, Ψ is of the stopping distribution: it is a property of the static environment, while churn determines how much of this stored slack can be monetized before termination.

Consider any valid trajectory π under linear contracts, and let ᾱ(t) denote the historical average incentive parameter along the trajectory. The central technical statement is that the principal’s flow profit above the static benchmark is controlled by the rate at which the trajectory spends potential.

Formally, let R denote the principal’s optimal static per-round profit (under the best linear contract, anticipating BR(α)), and write the principal’s instantaneous flow profit as uP(α(t), a(t)) = (1 − α(t))Ra(t). Then one can adapt the source’s breakpoint-based argument to show an inequality of the following form: for almost every t,
(t((t))),
\end{equation}
where the derivative is understood in the sense of absolutely continuous trajectories (equivalently, segment-by-segment for piecewise-constant controls). Intuitively, when the principal earns unusually high profit at time t (typically by offering a low α(t) while the agent continues to play a high a(t)), the historical average ᾱ(t) must be drifting downward, and this drift reduces tψ(ᾱ(t)). Thus excess profit is ``paid for’’ by potential expenditure.

Multiplying by the survival weight (t) and integrating yields
_0^F(t), d!(t((t))).
\end{equation}
The first term is precisely the static benchmark R𝔼[S]. The second term is the dynamic ``bonus’’ term, and it is here that the hazard profile matters.

To make more interpretable, we integrate by parts. Using d(t) = −h(t)(t) dt and assuming limt → ∞(t) tψ(ᾱ(t)) = 0 (which holds under mild boundedness conditions, since ψ(ᾱ(t)) ≤ Ψ and (t) → 0), we obtain
h(t)F(t), t,dt.
\end{align}
The functional 0h(t)(t) tdt is an ``effective recoup factor’’: it measures how much survival-weighted time mass lies at larger t, where a free-fall phase (or any delayed harvesting phase) can operate. Heavy-tailed survival curves make this factor large; front-loaded termination makes it small.

Combining and yields an explicit upper bound:
_0^h(t)F(t), t,dt.
\end{equation}
Since the right-hand side depends on π only through the universal bound ψ(ᾱ(t)) ≤ Ψ, it applies uniformly to dynamic linear-contract policies against mean-based learners (via the reduction in the previous section).

When S is exponentially distributed with constant hazard h (so (t) = eht), the integral in evaluates in closed form:
$$ \int_0^\infty h e^{-ht}\, t\,dt = \frac{1}{h}. $$
Therefore any valid trajectory π satisfies
il}_{F}()}{R^[S]}
 

1+.
\end{equation}
This constant-hazard expression is particularly useful for two reasons. First, it makes clear that (in this normalization) the maximum improvement over the static benchmark scales at most on the order of Ψ/h, reflecting that any attempt to harvest rents late is exponentially discounted. Second, it isolates all instance dependence in Ψ and R: once those are computed from (c, R), the survival effect under exponential churn is immediate.

The potential bound should be read as a sharp statement about the channel, rather than as a complete theory of retention. Because stopping is exogenous here, the principal cannot influence or h(t) via wages, working conditions, or product quality. In practice, many environments feature endogenous churn: low incentives may directly increase exit, and high incentives may extend the relationship. Incorporating such feedback would couple feasibility and evaluation, and the simple survival-weighted integral calculus above would no longer suffice.

A second limitation is that our bound leverages the path dependence of mean-based learning. If the agent satisfies stronger deviation constraints (e.g., swap regret), the feasible set of trajectories shrinks dramatically, and the dynamic advantage can collapse even without churn. Thus, empirically, the magnitude of Ψ is informative only to the extent that the deployed learning rule is permissive enough to be approximated by the mean-based model.

The structure of also suggests how to design near-optimal policies. The bound is tight only if the principal can convert a large portion of the available potential into early (high-survival) profit mass. This motivates hazard-matched phase policies that randomize the timing of incentive cuts in a way that aligns breakpoint crossings with the survival curve. In the next section we formalize this idea and show that suitably randomized two-phase (and phase-mixture) policies achieve survival-weighted performance that matches the hazard-sensitive upper bounds up to constant and, in some regimes, logarithmic factors.


6. Achievability via Hazard-Matched Phase Policies: construct randomized two-phase/phase-mixture policies; show performance guarantees and identify when dynamic strictly helps.

We now complement the potential-based upper bound with constructive policies that are tailored to the survival profile. The high-level message is that the bound from is not merely a limitation: the same mechanism that makes excess profit possible (temporarily storing slack in the historical average contract) can be converted into a simple, robust whose switching time is chosen to align with the distribution of the stopping time.

We focus on a two-phase family parameterized by an ``investment’’ intensity α ∈ (0, 1] and a (possibly randomized) switching time τ ≥ 0. In continuous time, the policy posts the linear contract

i.e., pay αro up to time τ and then switch to α = 0 forever. In discrete time, the analogue is: sample a random round τ ∈ {1, 2, …} at t = 1, play α for t ≤ τ, and then pay 0 for t > τ. Because τ is sampled ex ante and independent of realized outcomes, this policy is and hence compatible with the worst-case learning benchmark.

The economic logic is standard: the first phase deliberately sacrifices flow profit in order to push the learner’s historical average ᾱ(t) above key breakpoints; the second phase harvests by cutting incentives while the learner continues to play a higher action due to path dependence. Under , the historical average takes the simple form

so after the switch it decays deterministically as 1/t. Consequently, the agent’s action can only change at the times when ᾱ(t) crosses a breakpoint αi − 1, i, i.e., at

Thus, conditional on (α, τ), the induced action path is piecewise-constant and (up to tie-breaking at breakpoints) essentially deterministic under the trajectory validity constraints.

If the horizon were deterministic, the source shows that a carefully chosen (often deterministic) switch time can be near-optimal within broad classes of feasible trajectories. With stochastic stopping, a fixed τ becomes fragile: if τ is too large, the relationship often ends before harvesting begins; if τ is too small, the policy fails to move ᾱ(t) into a profitable region. Randomizing τ is a direct way to spread the policy’s ``mass’’ across likely termination times while retaining the same simple structure .

Formally, let FF(α, τ) denote the free-fall trajectory induced by . For any mixing distribution μ over τ (and, if desired, a finite mixture over α values), the survival-weighted objective is linear:

This observation is important: it means we can optimize over of switch times using convex-analytic tools, and it also implies that sampling τ at time 0 is without loss relative to any more elaborate randomization scheme (since the agent only responds to the realized contract path).

A particularly interpretable hazard-matched rule is to draw τ from a distribution whose density is proportional to the survival-weighted hazard mass h(t)(t) (the termination density in continuous time). Intuitively, this places greater probability on switching at times when termination is likely to occur, ensuring that a nontrivial fraction of policy realizations enters the harvesting phase exit.

One convenient parametrization is to choose a nonnegative weighting function w(t) with 0w(t) dt = 1 and set τ ∼ w. The expected value of the phase policy can then be written as

where α(t) is given by . Because α(t) is a threshold function of τ, the expectation over τ induces a smooth (and designable) time profile for Pr [α(t) = α] = Pr [τ ≥ t]. In other words, randomizing the switch time is equivalent to choosing a curve for the incentive intensity itself.

The upper bound in suggests that the relevant scale for the total dynamic bonus is governed by the survival-weighted ``recoup factor’’ 0h(t)(t) tdt multiplied by an instance-dependent potential height. Our phase mixtures can recover this scale whenever the instance admits a free-fall improvement in the known-horizon model and the survival curve places sufficient weight on horizons where that improvement materializes.

To state this cleanly, let Δ(T) denote the best dynamic advantage achievable by a free-fall policy up to time T:
Δ(T) := supα, τ ≤ T{∫0TuP(α(t), a(t)) dt − RT},
where the induced (α(t), a(t)) are consistent with trajectory validity and α(t) has the two-phase form . Then, for any stopping time S independent of play, we obtain the lower bound
_{T} F(T)(T).
\end{equation}
The inequality follows by considering the deterministic switch time that is optimal for a given T, and observing that the incremental profit accumulated up to T is realized whenever S ≥ T (while termination before T can only reduce harvesting, not negate already-earned profit). Thus, stochastic stopping converts a fixed-horizon advantage Δ(T) into a survival-discounted advantage (T)Δ(T).

Equation highlights when dynamics help: if there exists some horizon T for which Δ(T) > 0 in the underlying instance (equivalently, the known-horizon dynamic optimum strictly exceeds RT within the free-fall family), and the relationship is sufficiently long-lived in the sense that (T) is bounded away from 0, then the stochastic-horizon problem also admits a strict improvement over the static benchmark. Conversely, if the survival curve is so front-loaded that (T) is tiny for all horizons T at which Δ(T) becomes positive, then dynamic contracting cannot reliably reach the harvesting regime, and the best achievable value collapses back toward R𝔼[S].

While already gives a clean sufficient condition for strict improvement, it is generally conservative because it commits to a single horizon T. A phase-mixture policy replaces supT(T)Δ(T) by an over T values, which can be strictly larger when Δ(⋅) is spread over a range of horizons (as is typical when multiple breakpoints are relevant).

Concretely, one can pick a distribution μ over switching times τ and then analyze the realized action path after switching via –. The expected dynamic bonus becomes an integral of the form

for an explicitly defined (instance-dependent) kernel Φα(t) that captures the marginal value of maintaining the high-incentive phase until time t. Maximizing a linear functional of the tail μ([ [t, ∞) ]) is a one-dimensional convex program, and (by standard extreme-point arguments) admits near-optimal solutions supported on few points. This is the sense in which phase mixtures remain practically simple: despite optimizing over distributions, the optimal (or approximately optimal) policy typically randomizes among a small number of switch times.

Hazard-matched phase policies should be viewed as . They never attempt to infer the agent’s action directly, and they do not require observing ᾱ(t) beyond what the principal herself has posted. All the sophistication is in choosing when to stop paying for incentives, given that (i) after stopping, the induced decay ᾱ(t) = ατ/t deterministically walks the learner back down the breakpoint ladder, and (ii) survival weights (t) determine which segments of that walk are likely to be realized.

At the same time, two caveats are worth emphasizing. First, the guarantee is inherently instance-dependent: if the static optimum already induces the top action (or if breakpoints are such that free-fall cannot create a profitable wedge between the action and the contemporaneous contract), then Δ(T) = 0 for all T and there is nothing to gain. Second, the construction relies on the permissiveness of mean-based learning; under stronger deviation constraints the free-fall path may cease to be feasible, and the entire phase mechanism can disappear.

The remaining task is computational: to deploy hazard-matched phase mixtures, we need to evaluate Util(FF(α, τ)) efficiently and understand how it depends on (α, τ) and . In the next section we specialize to constant hazard (geometric/exponential survival) and to success/failure environments, where the breakpoint crossing times yield explicit finite-sum formulas and enable direct optimization over (α, τ) (and small mixtures), while more general hazards reduce to numerical integration of survival-weighted segment contributions.


7. Specialization and Closed Forms: constant-hazard (geometric/exponential) case; success/failure (linear) case; explicit formulas for expected payoff of free-fall under hazard; when numerical optimization is needed.

We now specialize the survival profile to the constant-hazard case, both because it is economically canonical (memoryless churn) and because it turns the survival-weighted objective into an analytically tractable transform of the underlying free-fall path. We then further specialize to success/failure environments, where linear contracts coincide with simple ``bonus-on-success’’ schemes and all quantities admit a particularly transparent interpretation.

In continuous time, constant hazard h > 0 corresponds to exponential survival
$$ \bar F(t)=e^{-ht},\qquad \mathbb{E}[S]=\int_0^\infty e^{-ht}\,dt=\frac{1}{h}. $$
In discrete time, the analogue is geometric stopping with parameter h ∈ (0, 1],
$$ \bar F_t=\Pr[S\ge t]=(1-h)^{t-1},\qquad \mathbb{E}[S]=\sum_{t\ge 1}(1-h)^{t-1}=\frac{1}{h}. $$
The memoryless property is not merely a modeling convenience: it captures the operational reality of many principal–agent settings in which the relationship ends due to exogenous turnover, product cycles, or organizational reallocation, and it ensures that the marginal value of delaying a switch can be summarized by a single scalar h.

In the success/failure specialization, there are two outcomes, with rewards normalized as r1 = 0 (failure) and r2 = 1 (success). Each action i induces a success probability Ri ∈ [0, 1], so Ri = 𝔼o ∼ Fi[ro] is literally the success rate. Under a linear contract po = αro, the agent receives payment α if and only if success occurs, so
uP(α, i) = (1 − α)Ri,   uA(α, i) = αRi − ci.
In this environment, the breakpoints αi − 1, i = (ci − ci − 1)/(Ri − Ri − 1) have an especially clean meaning: they are the bonus rates at which the agent is indifferent between neighboring effort levels (when Ri > Ri − 1). As in the source, we focus on instances satisfying the natural monotonicity structure (increasing costs and rewards), which implies that the best response to α moves ``up the action ladder’’ as α increases.

Fix a two-phase free-fall policy FF(α, τ) as in . The induced historical average after the switch is ᾱ(t) = ατ/t for t > τ by . In a success/failure environment with linear contracts, and away from knife-edge ties, the agent’s best response depends on the scalar ᾱ(t) through the breakpoint order: for each t > τ, the action is the unique i such that
αi − 1, i ≤ ᾱ(t) < αi, i + 1.
Hence the only times at which the action can change are exactly the breakpoint crossing times ,
$$ t_{i-1,i}(\alpha,\tau)=\tau\cdot \frac{\alpha}{\alpha_{i-1,i}}. $$
Because t ↦ ᾱ(t) decreases smoothly for t > τ, the post-switch dynamics follow a deterministic ``walk down’’ the ladder of actions: the agent begins at the best response to α (since ᾱ(τ) = α), then drops to lower actions as ᾱ(t) falls below successive breakpoints. This determinism is the key to closed-form evaluation under exponential survival: we can write the total value as a finite sum of exponential integrals over these breakpoint-delineated intervals.

Let i0 ∈ BR(α) denote the (tie-broken) best response when the incentive is held fixed at α. Under the free-fall policy, action is i0 throughout the investment phase t ∈ [0, τ]. After switching to α(t) = 0, the principal’s flow utility equals Ra(t) (since payment is 0), while the action a(t) is determined by ᾱ(t).

To describe the post-switch intervals, define (for each i ≥ 2) the time at which ᾱ(t) hits the breakpoint into action i − 1:
$$ T_i(\alpha,\tau):=\tau\cdot \frac{\alpha}{\alpha_{i-1,i}}. $$
These times satisfy Ti(α, τ) ≥ τ whenever α ≥ αi − 1, i, and they are increasing in τ and in α. If i0 is the action played at α, then only the breakpoints below i0 are relevant; accordingly, the post-switch path consists of a finite sequence of actions i0, i0 − 1, …, 1 over the intervals
[τ, Ti0(α, τ)), [Ti0(α, τ), Ti0 − 1(α, τ)), …, [T2(α, τ), ∞),
where by convention T1(α, τ) = ∞ (action 1 persists forever once reached). On each such interval the flow payoff is constant, so under exponential survival we can integrate explicitly:

Equation is already a closed form: it is a finite sum of exponential terms whose exponents are affine in τ (because Ti(α, τ) is linear in τ). In particular, for any fixed α and any fixed region in which i0 is constant (i.e., α lies strictly between two breakpoints), the dependence on τ is smooth and unimodal in many instances, making one-dimensional optimization over τ numerically straightforward.

Two practical points are worth flagging. First, the formula highlights the economic tradeoff in a way that is hard to see from the trajectory definition alone: increasing τ increases the weight on the harvesting integrals (which are paid at α = 0) but simultaneously pushes those integrals later in time, where survival weight eht is smaller. Second, the only instance-specific objects entering are {Ri} and the breakpoints {αi − 1, i}; thus, once the breakpoint structure is computed, evaluating (α, τ) ↦ Util reduces to a small number of elementary operations.

When time is discrete and t = (1 − h)t − 1, the same decomposition applies with integrals replaced by sums and with breakpoint times rounded to integers. If the post-switch action remains constant over rounds t ∈ {L, L + 1, …, U}, its contribution is
$$ \sum_{t=L}^{U} (1-h)^{t-1}\cdot \text{(flow payoff)}=\text{(flow payoff)}\cdot (1-h)^{L-1}\cdot \frac{1-(1-h)^{U-L+1}}{h}. $$
Thus, in discrete time the value is again a finite sum of geometric-series terms. The main additional bookkeeping is handling the integer rounding of the breakpoint-crossing rounds Ti(α, τ)⌉, which creates small discontinuities in τ; in practice this is benign for optimization because the discontinuities vanish under mild randomization of τ (or can be handled by evaluating neighboring integer candidates).

Closed-form evaluation reduces the design problem for free-fall policies under constant hazard to searching over (α, τ). We emphasize that the only source of non-smoothness is the identity of the induced action ladder, i.e., which action is optimal at α and which breakpoints are crossed after the switch. This suggests a natural computational strategy: enumerate candidate top actions i0, restrict α to the interval (αi0 − 1, i0, αi0, i0 + 1), and within that region optimize the smooth function over τ ≥ 0 (and over α via a one-dimensional line search or a coarse grid). Because n is finite and typically small in stylized models, this yields a simple and robust routine.

Constant hazard and success/failure are the most algebraically friendly case, but two departures quickly reintroduce numerical integration.

First, for survival curves (t), the same interval decomposition holds—the action changes only at the deterministic times Ti(α, τ)—but the segment contributions become
ab(t) dt,
which is rarely available in closed form. In such cases, evaluating Util(FF(α, τ)) reduces to computing a handful of one-dimensional integrals, which can be done accurately via standard quadrature. The resulting outer optimization over (α, τ) is still low-dimensional, but it is no longer an ``elementary-function’’ problem.

Second, even under exponential survival, if we expand the policy class to over switching times (or mixtures over multiple α values), then the expected value involves averaging over the mixing distribution. This remains easy when the distribution has a tractable Laplace transform (since the building blocks are exponentials), but for arbitrary mixing distributions one again falls back on numerical integration. Importantly, this is not a conceptual obstacle: it simply reflects that we are optimizing a linear functional over a continuous design space, and numerical methods are the natural tool once we leave the memoryless/finite-support comfort zone.

Taken together, these closed forms explain why the constant-hazard model is a useful workhorse. It allows us to (i) compute the value of a candidate free-fall policy essentially exactly, (ii) optimize it with minimal numerical overhead, and (iii) directly compare the achieved value to the hazard-sensitive upper bounds from the potential method. In the next section, we step beyond this workhorse case and discuss extensions in which the hazard itself may depend on history, or the contract space is restricted by fairness and minimum-pay constraints, or the principal faces limited feedback—each of which alters either the feasibility of free-fall trajectories or the computational tractability of evaluating them.


8. Extensions: history-dependent hazard (endogenous churn), minimum-pay/fairness constraints, partial feedback, and multi-agent variants (brief).

This section sketches four extensions that matter for applications and that also clarify which parts of our analysis are structural (coming from mean-based learning and the evolution of historical averages) versus which are artefacts of exogenous, memoryless churn and unconstrained transfers. Throughout, we keep the core friction: the principal observes outcomes but cannot directly condition on the agent’s action, while the agent’s behavior is governed by a mean-based no-regret dynamic. The central message is that most extensions preserve the viewpoint, but they either (i) enlarge the state variables needed to describe a valid trajectory, or (ii) shrink the feasible set of trajectories in ways that can sharply reduce (and sometimes eliminate) free-fall gains.

In many employment and platform settings, the probability of relationship termination is not exogenous. Agents may quit after low pay; principals may terminate after poor outcomes; regulators may impose review events contingent on performance. A reduced-form way to capture this is to allow the hazard to depend on the realized history Ht:
h(t) = h(Ht),   (t) = Pr [S ≥ t ∣ Ht],
in continuous time (with the obvious discrete-time analogue). Two special cases are particularly natural.

Suppose the agent exits when their realized utility is persistently low, producing an increasing hazard in the agent’s cumulative (or discounted) utility shortfall. For example, one may posit
$$ h(t)=h_0+\kappa\cdot \Big(\max\{0,\,\underline u_A - \bar u_A(t)\}\Big), $$
where A(t) is the running average utility. In such models, an aggressive free-fall phase with α(t) = 0 can raise the hazard precisely when the principal is trying to harvest high effort at low pay. This creates a new tradeoff absent under exogenous churn: dynamic pay can its own early termination. From the trajectory standpoint, validity constraints remain driven by best responses to historical average contracts, but the objective becomes path-dependent because the survival weight (t) is no longer a fixed function of calendar time. Formally, the principal faces a control problem in which the state must include both the historical average contract (the usual state) and whatever statistic drives hazard (e.g., A(t)). Even in the linear-contract specialization, the optimal policy need not be a simple two-phase free-fall because the principal may prefer to ``smooth’’ incentive reductions to avoid triggering quits.

Conversely, suppose the principal can terminate at will (or with some cost), or that poor outcomes mechanically increase churn (e.g., project cancellation). Then the principal’s effective objective resembles an optimal stopping problem intertwined with dynamic contracts. The key conceptual point is that endogenous termination can make dynamic contracts powerful in one direction (the principal can stop right after extracting value) but powerful in another (the agent anticipates termination and discounts late rewards even without exogenous hazard). Under mean-based learning, the agent responds to realized payoffs and thus to any systematic pattern of early termination that correlates with actions; hence, termination policies can act as an additional implicit instrument, but one that is constrained by observability and commitment.

Our main technical limitation here is that the clean reduction to an exogenously survival-weighted integral,
0(t) uP(⋅) dt,
relies on S being independent of play. Once S depends on history, the analogue is still expressible as an expectation over paths, but the principal’s problem is no longer a linear functional of the trajectory. We view this as a promising direction: the same state compression that makes free-fall analyzable (historical averages) may still render the enlarged problem tractable, but one should expect qualitatively new phenomena such as ``retention incentives’’ that cap the depth of free-fall.

A second set of extensions imposes constraints on transfers. In practice, contracts are often bounded below by minimum wage, budget feasibility, non-negativity and limited liability, internal pay equity rules, or external fairness constraints across demographic groups or tasks. We consider three stylized constraint families.

Suppose $p_o\ge \underline p_o$ for all outcomes o. For linear contracts po = αro with r1 = 0, a strict lower bound on the failure payment forces $\underline p_1=0$ (else infeasible), while a lower bound on success pay forces $\alpha\ge \underline \alpha>0$. This immediately rules out the extreme free-fall step α(t) = 0 and replaces it with a free-fall to $\underline \alpha$. The deterministic breakpoint-crossing picture remains, but the post-switch average becomes $\bar\alpha(t)=\frac{\alpha\tau+\underline\alpha (t-\tau)}{t}$ rather than ατ/t, slowing or even preventing the descent through breakpoints. In the potential-based upper bound, such constraints effectively reduce the potential range Ψ that can be ``spent’’ by moving ᾱ(t) downward, and therefore reduce the maximal dynamic advantage even when hazard is low.

A common alternative is an ex ante individual-rationality constraint (possibly at each time or in expectation over survival):
$$ \mathbb{E}\Big[\sum_{t=1}^{S} u_A(p_t,a_t)\Big]\ \ge\ 0, \qquad\text{or}\qquad \int_0^\infty \bar F(t)\,u_A\big(p(t),a(t)\big)\,dt\ \ge\ 0. $$
Under exogenous survival, this constraint is linear in the trajectory and thus fits naturally into our continuous-time formulation. Economically, it converts some of the principal’s early ``investment’’ payments from a purely strategic instrument into a required transfer to satisfy participation. This tends to compress the set of profitable free-fall policies: the principal may still front-load pay to pull the agent to a high action, but must now compensate (in a survival-weighted sense) for the low-pay harvesting phase.

In multi-group settings (e.g., different tasks or worker types) one may require that contracts do not differ too much across groups, or that expected utility satisfies parity constraints. Even in a single-agent model, one can interpret such rules as restrictions on the admissible α path, e.g.,
$$ |\alpha(t)-\alpha(t')|\le L|t-t'|\quad\text{(smoothness)},\qquad \alpha(t)\in[\underline\alpha,\bar\alpha]\quad\text{(caps)}. $$
These constraints again shrink the trajectory set. A useful practical insight is that the of the optimal dynamic policy changes: rather than a sharp switch, one obtains ramp-downs (when smoothness is enforced) or bang-bang-with-floor behavior (when only a minimum is enforced). From a computational viewpoint, closed-form evaluation under exponential survival may survive for piecewise-constant policies but will generally be replaced by numerical integration once smoothness is imposed.

Our baseline assumes the principal knows the primitives (c, F, r) and thus can compute breakpoints and expected rewards. In many applications the principal does not know Ri (or even m is large and outcomes are sparse), and must learn from observed outcomes while simultaneously steering an agent who is also learning. This creates a two-sided learning problem with distinct informational constraints: the principal observes outcomes but not actions; the agent observes own realized payoffs. Two implications stand out.

First, is limited. Because the action is unobserved, the mapping from a posted α to observed outcomes depends on the agent’s (learning-driven) response, which depends on the historical average ᾱ(t). Thus, naive estimation of Ri from outcomes is confounded by endogenous action choice. A conservative route is to target robust policy classes whose performance can be certified without precise identification—for instance, restricting to a small menu of α values and using survival-weighted bandit algorithms over that menu. In the constant-hazard case, the objective is naturally discounted, and standard discounted bandit techniques can be adapted, though the non-stationarity induced by mean-based learning still requires care.

Second, even when identification is possible, one should expect an tradeoff. The principal may need to vary incentives to learn which effort levels are achievable and profitable, but such variation itself moves the historical average and can trigger (or undo) free-fall dynamics. A practical design principle is to separate time scales: use short, randomized exploration bursts that minimally perturb ᾱ(t) (hence minimally perturb the agent’s learning state), interleaved with longer exploitation phases. Technically, this suggests extending the reduction-consistency step to allow principal policies that are not fully oblivious but are ``slow’’ relative to the agent’s averaging dynamics.

Finally, many principals contract with multiple agents simultaneously (teams, marketplaces, model ensembles). Multi-agent structure can either amplify or attenuate the dynamic advantage channel.

If the principal can post individualized contracts pt() to each agent , and each agent runs an independent mean-based learner, then the problem largely decomposes: the principal’s value is the sum of per-agent values, each governed by its own historical average. The main coupling comes from shared constraints (budget, fairness) or from aggregate performance objectives that are nonlinear (e.g., a minimum of outputs). In the decomposable case, constant hazard implies 𝔼[S] = 1/h per relationship; with many agents, the principal can diversify churn risk, which makes dynamic policies more attractive in aggregate even if per-agent advantage is modest.

If the principal must post a single contract to all agents (a common bonus rate, a platform-wide revenue share), then the relevant state is no longer one-dimensional. Agents at different learning states imply that the same αt induces heterogeneous responses, and the ``best response to the historical average’’ condition becomes a distributional statement across agents. In large markets, a law-of-large-numbers approximation can make the aggregate response deterministic, but the principal now controls a population dynamical system. The potential method may extend by aggregating potentials across agents, yet free-fall manipulations can be blunted because the principal cannot tailor the investment phase to each individual’s breakpoint structure.

If outcomes depend on multiple agents’ actions (team production, congestion), then an agent’s reward-relevant signal reflects others’ behavior. Mean-based learning in such environments can converge to correlated outcomes that are not easily summarized by a single ᾱ(t). Nevertheless, the central insight remains relevant: any dynamic advantage that relies on steering through historical averages will be fragile to additional strategic externalities, and stronger learning notions (e.g., swap regret) become even more natural on the agent side.

Taken together, these extensions suggest that the constant-hazard, single-agent, known-primitive model is best interpreted as a sharp lens on one mechanism—history dependence induced by mean-based learning—rather than as a complete description of practice. The good news is that the lens is portable: once we identify how an extension changes the trajectory constraints or the survival-weighted objective, we can often recover the same qualitative comparative statics (dynamic gains shrink when effective hazard rises, and when constraints limit downward movement of the average contract), while also generating concrete, testable predictions about when free-fall-like policies should disappear in the presence of retention concerns, wage floors, or platform-wide uniform pricing.


9. Discussion and Policy Implications: when dynamic pay can systematically outperform static; auditing implications; implications for agent-side algorithm choices; open questions.

Our analysis is best read as identifying a by which dynamic pay can systematically outperform static pay when the agent adapts via mean-based no-regret learning: the principal can move the agent across action regions by shaping the historical average contract, and then ``harvest’’ during periods in which current incentives are weak but the learner remains pinned (temporarily) to a high-reward action. The mechanism is neither classical screening nor standard moral hazard with full rationality; it is a form of intertemporal that exploits path dependence in learning dynamics. This section distills when this mechanism is likely to matter, what it suggests for auditing and governance of incentive schemes, and what it implies for the choice of agent-side algorithms. We close with open questions that, in our view, separate modeling convenience from the core economics.

A first practical message is that dynamic advantage is generic: it requires a conjunction of (i) exploitable path dependence (mean-based learning with best-response-to-average constraints), (ii) enough room'' in the incentive parameter space to traverse multiple breakpoints (captured by the potential height $\Psi$), and (iii) sufficient expected relationship length to recoup the initialinvestment’’ phase before exogenous termination downweights later harvesting. The last requirement can be summarized by a timescale comparison. Let Tmix denote an instance-dependent recoup or mixing time—informally, the time needed for a front-loaded incentive to translate into a sufficiently low historical average (and hence sustained high effort) after incentives are reduced. Then the condition
h ⋅ Tmix ≪ 1
is a useful rule of thumb under constant hazard, and more generally one expects dynamic gains to be governed by an ``effective hazard’’ functional Λ() that places weight on early termination events. When h is high (or is thin-tailed), free-fall-style policies become dominated by static or near-static contracts because the principal cannot reliably reach the profitable region of the trajectory before stopping occurs.

A second message is that the of the advantage is inherently . Even if the principal posts a one-dimensional linear contract α(t), profitable policies typically feature phases (or mixtures over phases) rather than a single constant α. This has two implications. First, in environments where institutional constraints force near-stationarity (e.g., regulated revenue shares that cannot change frequently), our results predict that the dynamic advantage should largely disappear. Second, where rapid adjustment is feasible (platform bonuses, short-term commissions, dynamic ``quests’’), one should expect large cross-sectional variation in outcomes even with similar long-run average pay: what matters is not only the average level of incentives, but also incentives are delivered relative to the agent’s learning state and the anticipated relationship duration.

Third, the model clarifies when dynamic contracts for structural reasons. Strengthening the agent’s learning notion (e.g., toward swap regret) removes the intertemporal wedge the principal exploits, collapsing the additional value from steering through historical averages. Likewise, any constraint that prevents the principal from reducing pay sufficiently far (minimum-pay rules, fairness floors, limited liability coupled with bounded bonuses) truncates the reachable potential and therefore caps the benefit of dynamic manipulation. These are not merely technicalities: they suggest that policy restrictions and algorithmic safeguards can substitute for one another in limiting exploitative dynamics.

A natural governance concern is that dynamic pay can be used to create high incentives early (to attract or condition behavior) followed by systematically low incentives once the agent is ``locked in’’ by its own adaptation. Because the principal does not need to condition on actions, such schemes can be difficult to detect using standard contract review that focuses on per-period expected payments. Our framework suggests focusing instead on diagnostics that are sensitive to history dependence.

One approach is to audit . In linear settings, this corresponds to identifying paths α(t) with large early mass followed by sharp reductions. Another approach is to compute a and compare realized performance to what would be expected under the best static linear contract (or, more generally, the best static contract in the permissible class). In our notation this benchmark is R𝔼[S], and the policy question becomes whether observed profits materially exceed (or rely on) the predicted dynamic surplus bounds governed by (Ψ, ). While a regulator typically does not observe (c, F), many environments admit proxies: one can estimate a platform’s effective hazard from retention data and bound the feasible Ψ from the observed menu of incentive rates and outcome rewards. The comparative statics then yield a falsifiable implication: dynamic advantage should be concentrated in low-hazard segments (long-tenure cohorts) and should attenuate sharply as churn rises.

A complementary auditing lens is . Even in our exogenous-hazard benchmark, dynamic policies that harvest during low-pay phases tend to reduce the agent’s realized utility conditional on surviving to late stages. In practice, when quitting is endogenous, such policies may raise churn among disadvantaged or liquidity-constrained agents. Thus, monitoring should include retention responses to incentive reductions, not merely output. In settings where fairness constraints are salient, one can also check whether dynamic incentives create disparate impact by interacting with heterogeneous learning rates or outside options: two groups facing identical posted incentives may nonetheless experience different effective trajectories because their hazard or adaptation differs.

Finally, auditing should explicitly recognize the possibility of . A principal can randomize phase lengths or use complex bonus schedules that are hard to summarize. Our results suggest that such randomization is not an innocuous modeling flourish: under uncertain stopping times, it can be close to optimal. This motivates transparency requirements that mandate disclosure of incentive evolution rules (or at least bounds on rate-of-change), akin to rules in consumer finance that restrict teaser-rate designs.

From the agent’s perspective, the central vulnerability is that mean-based learning responds to payoffs, which allows the principal to manipulate behavior by manipulating the agent’s running averages. A direct implication is that the choice of learning algorithm is not merely a performance detail; it is a that shapes the feasible set of principal pay trajectories.

Our results therefore support a design principle for algorithmic agents (including AI proxies acting on behalf of workers or users): prefer learning dynamics that satisfy richer no-regret guarantees—notably, variants that eliminate path-dependent exploitation (e.g., swap-regret-like notions)—even if they are computationally heavier. Put differently, robustness to adversarially chosen incentives should be treated as a first-class objective alongside sample efficiency. In institutional settings, this is analogous to recommending that workers (or their representative tools) use decision rules that remain responsive to incentives rather than being overly anchored by historical averages.

A second implication concerns . If an agent reveals that it uses a mean-based learner with a known averaging window, it effectively reveals its Tmix, enabling a principal to calibrate phase lengths and extract more surplus. This creates a strategic tradeoff: transparency can facilitate beneficial coordination, but it can also facilitate exploitation. One practical compromise is to disclose coarse behavioral guarantees (e.g., minimum responsiveness to incentive changes, or a form of policy stability) rather than the precise update rule.

A third implication is the role of internal to the agent. If the agent can impose a participation constraint on itself (e.g., refuse to continue when realized utility falls below a threshold), then the principal’s ability to free-fall is curtailed even absent external regulation. More broadly, algorithmic agents can incorporate retention'' orsafety’’ objectives that mimic endogenous quitting, thereby converting the principal’s dynamic instrument into a costly lever.

Several directions appear especially important. First, endogenizing churn (history-dependent hazard) is likely to change both optimal policies and welfare conclusions: retention incentives may become a binding constraint that limits free-fall and introduces new equilibria in which principals smooth pay to maintain continuation. Second, principal-side learning remains underexplored. When the principal must infer Ri and breakpoint structure from outcomes confounded by the agent’s learning response, the optimal policy may involve carefully designed experiments that themselves interact with the agent’s state. Third, richer contract spaces (nonlinear bonuses, caps, penalties, or multidimensional outcome signals) raise the question of whether the historical-average state compression survives, and if so, which low-dimensional summaries replace ᾱ(t).

Finally, we see an empirical agenda. The theory predicts that dynamic incentive schemes should be most effective in low-hazard environments, should exhibit front-loading followed by incentive reductions, and should lose effectiveness as agents adopt more sophisticated adaptive rules or as institutions impose floors and smoothness constraints. Testing these predictions—especially disentangling learning dynamics from selection and unobserved heterogeneity—would sharpen the policy relevance of dynamic contracting models and help distinguish benign intertemporal incentives from exploitative manipulation.

Taken together, the broader lesson is that dynamic pay is neither universally harmful nor universally beneficial. It is powerful precisely when adaptation is predictable and path dependent, and when relationships last long enough for investment-and-harvest strategies to pay off. That lens suggests concrete levers for practice: regulate or audit incentive rather than levels, strengthen agent-side learning guarantees when agents are algorithmic, and pay particular attention to environments with low churn and high discretion over time-varying incentives, where the scope for systematic dynamic outperformance is greatest.