Learning Contracts from Telemetry via Implementability Complexity

Metadata

Total Words: 14,130
Export Date: 2026-01-14 23:46:56
Description: We study online contract design when the principal repeatedly interacts with a fixed (or slowly drifting) agent but only observes performance telemetry. Existing regret bounds for learning general bounded contracts can be exponential in the number of outcomes, and tight lower bounds show that without structure, the dependence on outcome dimension is unavoidable. We propose an alternative viewpoint tailored to modern 2026 settings where outcomes are multi-metric and high-dimensional but incentives are specified by a low-dimensional contract template (e.g., a few weights on evaluation metrics). Under an affine contract parametrization, the agent’s best response partitions parameter space into polyhedral regions; within each region the principal’s expected utility is an affine function of contract parameters. We define the implementability complexity as the number of regions in this partition (or equivalently, the number of distinct best-response regimes accessible under the contract class). Our main results give regret bounds that scale with implementability complexity rather than the raw number of outcomes, yielding polynomial regret even with massive outcome spaces. The analysis unifies the geometry of linear contracts (upper envelopes and breakpoints) from algorithmic contract theory with VC-style complexity control. We also discuss drifting environments and stronger-feedback regimes motivated by modern benchmarking pipelines, where expected utility can be estimated accurately from large evaluation sets.

1. Introduction and motivation: why outcome dimension is the wrong complexity parameter in 2026; summary of negative learnability results for general contracts; overview of our ‘implementability complexity’ lens.
2. Model: repeated contracting with affine contract templates; payoff feedback vs outcome-only feedback; fixed agent vs drift; examples from multi-metric AI evaluation and pay-for-performance platforms.
3. Implementability partition: polyhedral best-response regions induced by IC constraints under affine parametrizations; definitions of K (region count), L (slope bound), and alternative complexity measures (pseudo-dimension).
4. Warm-up (d=1): learning piecewise-linear utilities with K breakpoints; algorithms and tight regret rates; connection to one-sided Lipschitz results and upper-envelope geometry.
5. Main regret bounds (general d): adaptive partition / optimistic search algorithms for piecewise-affine bandit optimization; regret bound in terms of (K, d, L, T) under payoff feedback.
6. Stronger feedback regime: when the principal can estimate expected utility (e.g., evaluation suite of size M per round); reduction to stochastic optimization over K pieces; improved regret rates.
7. Drift and nonstationarity: variation-budget model; dynamic regret bounds scaling with K and V_T; discussion of robustness vs learning.
8. Lower bounds: constructions showing dependence on K is necessary; separation from dependence on m; implications for contract class design.
9. Implications for contract design: choosing contract templates to minimize implementability complexity; links to pseudo-dimension of contract classes; practical guidance.
10. Discussion and open problems: hidden-action feedback limitations; extensions to menus, audits, multiple agents/principals; empirical validation plan.

Content

1. Introduction and motivation: why outcome dimension is the wrong complexity parameter in 2026; summary of negative learnability results for general contracts; overview of our ‘implementability complexity’ lens.

In 2026, it has become increasingly clear that the ambient outcome dimension m—the number of observable telemetry bins, evaluation categories, or performance states—is usually the wrong complexity parameter for repeated contracting problems. Modern principals routinely observe and log thousands (or millions) of outcome features: multi-metric scorecards for AI systems, granular event streams in gig platforms, compliance signals in regulated lending, and fine-grained quality measures in procurement. Yet the contract itself is almost never allowed to depend freely on all of these coordinates. For reasons of auditability, legal compliance, operational simplicity, and governance, real-world payment rules are typically constrained to a low-dimensional template: a handful of weights on a handful of metrics, perhaps with caps, floors, or simple affine transformations. The core methodological point of this paper is that learnability should be indexed by the effective complexity of the incentive response induced by the template, not by the raw number of outcome bins recorded by the principal.

To see why m is a misleading yardstick, it helps to separate “measurement richness” from “contractual degrees of freedom.” A principal may observe a very high-dimensional outcome j (or a discretization thereof) and may even have a reward model that maps that outcome to value. But if the principal is only permitted to choose a contract from a restricted family, then the agent’s incentives depend on that family through a small set of parameters. In other words, there is an asymmetry: the outcome space may be huge, but the policy lever is small. Treating m as the complexity parameter implicitly assumes that the principal can set an arbitrary payment t_j for each outcome j, i.e., can choose from the full simplex of outcome-contingent transfers. That assumption is both unrealistic and analytically pessimistic: it pushes us into a regime where learning is dominated by a curse of dimensionality that has little to do with the actual institutional constraints faced by principals.

This pessimism is not merely aesthetic; it is formalized by negative learnability results for general contracts. When the principal can post an arbitrary outcome-contingent payment vector, the learning problem resembles nonparametric optimization of an unknown function observed through extremely sparse feedback: each round reveals only one realized outcome and one realized payoff. With a hidden action and an unobserved mapping from actions to outcome distributions, the principal faces an identification bottleneck. Intuitively, changing one coordinate of the payment vector affects incentives only through the probability that the induced action places on that outcome—probabilities which themselves are unknown and entangled with the agent’s best-response behavior. In the worst case, these entanglements allow many distinct actions to be separated only by tiny differences in high-dimensional outcome probabilities, so that distinguishing “good” from “bad” contracts requires exploration that scales with m (or worse). Put differently, if we allow contracts to vary freely across m outcomes, then we must pay an information-theoretic price for learning a high-dimensional object under bandit feedback. The resulting regret rates and sample complexities scale poorly and, more importantly, predict failure precisely in settings where practice has nonetheless converged on workable low-dimensional contracting schemes.

Our starting point is therefore a reframing: rather than parameterizing difficulty by the number of outcomes, we parameterize it by the number of distinct implementable incentive regimes that can arise within the chosen template. The key object is the principal’s induced expected utility surface over the parameter space. Even when outcomes are high-dimensional and the agent’s action set is large, the principal does not need to learn the full mapping from actions to outcome distributions. To optimize within a restricted family, the principal only needs to learn enough to compare the finitely many payoff “pieces” that arise as the agent switches best responses. This perspective leads us to what we call implementability complexity, denoted K: the number of regions in parameter space on which the agent’s best response is constant (equivalently, the number of affine pieces of the induced utility function). In our view, K is the correct analogue of the number of “effective arms” in the problem: it counts how many qualitatively different behaviors the template can elicit, not how many outcomes the principal can record.

This lens also helps reconcile two facts that otherwise appear in tension. First, outcome spaces in modern applications are exploding, and richer telemetry is generally viewed as beneficial for monitoring and evaluation. Second, contractual governance is becoming more constrained, not less: institutions increasingly require that payment rules be interpretable, stable, and auditable. The implementability complexity framework says that these trends are not contradictory. Rich telemetry can improve the reward model (how the principal values outcomes) and can support ex post auditing, while the contract can remain low-dimensional. Learnability then depends on how many distinct best-response regimes that low-dimensional contract can induce. In favorable cases, K can remain modest even as m grows, so the principal’s learning problem remains tractable.

We build intuition for K through two complementary channels. The first is geometric. Under an affine contract template, each action’s expected utility to the agent is affine in the contract parameters, and the agent’s best response is the upper envelope of these affine functions. The parameter space is therefore partitioned by the hyperplanes where the agent is indifferent between two actions. Each cell of this partition corresponds to a region where a particular action is uniquely optimal (given tie-breaking), and within that region the principal’s expected utility is itself affine. Thus the induced objective is piecewise-affine, with kinks only at action-switching boundaries. The number of cells in this arrangement—our K—captures how many times the identity of the induced action can change as the principal varies the contract parameters. This is a much finer measure than the cardinality of the action set: many nominal actions may never be optimal under the template, and many may be indistinguishable because they generate the same projected statistics relevant for incentive comparisons.

The second channel is institutional. In practice, contract templates are chosen precisely to limit the number of distinct incentive regimes the agent can select among. Consider multi-metric AI evaluation, where a deployment decision may track hundreds of safety and performance metrics, but a compensation rule might be a weighted combination of a small subset (e.g., accuracy, latency, and a safety score), possibly with thresholds. Or consider pay-for-performance platforms, where a worker’s compensation depends on a few standardized performance indicators rather than the full telemetry of their work process. In these settings, it is not m that governs the number of distinct behaviors the agent can profitably adopt; it is the expressivity of the payment template and the heterogeneity of actions as seen through that template. Our measure K is designed to quantify exactly this expressivity/heterogeneity interaction.

Adopting K as the complexity parameter has three implications that guide the rest of the paper. First, it suggests a positive learnability story that is independent of the outcome dimension: if the induced utility surface has only K affine pieces and bounded slopes, then one can design exploration strategies whose sample complexity scales with K (and the parameter dimension d), not with m. This aligns with the empirical observation that principals can often learn effective incentive weights despite enormous telemetry spaces. Second, it clarifies why unrestricted contracts are hard: allowing arbitrary outcome-contingent transfers makes K potentially enormous, because the principal can create many finely separated incentive regimes, and learning must then pay for that expressivity. Third, it highlights a trade-off that is central for policy and design: restricting contract form can improve learnability and stability (lower K), but may reduce attainable performance by ruling out high-powered or finely targeted incentives. Our model is meant to illuminate this trade-off rather than to declare a universally optimal level of contract complexity.

We also acknowledge limitations up front. Our focus on implementability complexity abstracts from several issues that matter in applications: multi-agent interactions, dynamic moral hazard, and endogenous participation constraints, among others. Moreover, K is a property of the induced best-response geometry, which depends on how agents map contracts to behavior; in practice, that mapping may be nonstationary, strategically manipulated, or influenced by unmodeled constraints. Nonetheless, we view K as a useful organizing concept precisely because it remains meaningful across many such extensions: whenever the agent’s response partitions the principal’s feasible set into a finite number of regimes, the number and geometry of those regimes should govern learning difficulty.

The remainder of the paper formalizes this perspective. We begin by laying out the repeated contracting model with affine templates and the feedback structures we consider, including the baseline outcome-only payoff feedback and a stronger evaluation-feedback variant motivated by batch testing in AI deployments. We then develop the geometric characterization of best-response regions and introduce K as the central complexity proxy, before turning to algorithms and regret bounds that scale with K rather than m. This sequence is deliberate: we want the reader to see, early and clearly, why “more telemetry” does not automatically make contracting harder—and why the right complexity notion is the number of implementable incentive regimes, not the number of logged outcomes.

2. Model: repeated contracting with affine contract templates; payoff feedback vs outcome-only feedback; fixed agent vs drift; examples from multi-metric AI evaluation and pay-for-performance platforms.

We study a principal who repeatedly posts a contract from a restricted (and known) affine template, facing an agent who privately chooses among finitely many costly actions. The purpose of the model is to isolate what the principal can realistically control—low-dimensional contract parameters—while allowing the outcome space to be arbitrarily rich.

Outcomes, rewards, and contracts. In each round an observable outcome j ∈ [m] := {1, …, m} is realized. We interpret j broadly: it can be a discrete performance category, a telemetry bin obtained by discretizing a high-dimensional event stream, or an index into a scorecard of evaluation results. The principal derives a known reward r_j ∈ [0, 1] from outcome j. (In applications, this corresponds to a valuation model the principal has committed to ex ante—e.g., a weighted safety/performance score for an AI system, or an internal quality score in procurement.)

Payments are restricted to an affine template parameterized by a vector x ∈ 𝒳 ⊂ ℝ^d, where 𝒳 is compact and convex. The induced outcome-contingent transfer is
t(x) = Ax + b ∈ ℝ₊^m,
where A ∈ ℝ^m × d and b ∈ ℝ^m are known to the principal. We impose boundedness 0 ≤ t_j(x) ≤ 1 for all j and x ∈ 𝒳, which serves two roles: it captures institutional constraints (budgets, caps, and legal limits on “high-powered” incentives), and it ensures payoffs are uniformly bounded for learning.

Hidden actions and outcome distributions. The agent has a finite action set 𝒜 (potentially large). Each action i ∈ 𝒜 is associated with (i) an unknown cost c_i ∈ [0, 1] and (ii) an unknown outcome distribution q_i ∈ Δ([m]). We write q_ij for the probability of outcome j under action i. The principal does not observe the chosen action and does not know {q_i, c_i}_{i ∈ 𝒜}; this is the informational friction that makes contracting nontrivial.

Given a posted parameter x, action i induces an expected payment
$$ T_i(x) \;:=\; \sum_{j=1}^m q_{ij}\, t_j(x) \;=\; q_i \cdot t(x), $$
and an agent expected utility
U_A(i ∣ x) := T_i(x) − c_i = q_i ⋅ (Ax + b) − c_i.
The agent best-responds:
i^*(x) ∈ arg max_{i ∈ 𝒜}{U_A(i ∣ x)},
with ties broken by a fixed rule (so i^*(x) is single-valued). This tie-breaking is innocuous economically—any selection can be justified by perturbations or conventions—but it is convenient analytically because it makes the induced objective well-defined everywhere.

Timing and feedback. For rounds t = 1, …, T:
1. The principal chooses x_t ∈ 𝒳 and thereby posts payments t(x_t).
2. The agent chooses i_t = i^*(x_t).
3. An outcome j_t ∼ q_{i_t} is realized and observed by the principal.
4. The principal’s realized payoff is
u_t = r_{j_t} − t_{j_t}(x_t) ∈ [−1, 1].
Because r and the template are known, observing j_t is equivalent to observing u_t; however, it is useful conceptually to remember that the data are “one realized outcome per round,” not a full draw from the distribution q_{i_t}. We emphasize that the principal never sees i_t, so even repeated observations at the same x are confounded by the fact that the agent’s action is an endogenous function of the contract.

The principal’s expected utility from posting x is
$$ U(x) \;:=\; \mathbb{E}[u_t \mid x_t=x] \;=\; R_{i^*(x)} - T_{i^*(x)}(x), \quad\text{where}\quad R_i := \sum_{j=1}^m q_{ij} r_j. $$
Equivalently, U(x) = q_i^*(x) ⋅ r − q_i^*(x) ⋅ t(x). The learning problem is to choose {x_t} to compete with the best fixed parameter in hindsight, despite observing only the induced outcomes and payoffs.

Regret objectives. In the stationary case (fixed {q_i, c_i}), we evaluate an online algorithm by static regret relative to the best single contract parameter:
$$ \mathrm{Reg}(T) \;=\; \max_{x \in \mathcal{X}} \sum_{t=1}^T \Big(U(x) - \mathbb{E}[U(x_t)]\Big). $$
This notion formalizes a principal who wants to learn a stable contract that performs well over the whole horizon.

We also consider nonstationarity, motivated by settings where the mapping from contracts to behavior changes over time: agents adapt, technologies shift, platforms change monitoring, or the population composition drifts. Formally, we allow the induced expected utility to change with t, writing U_t(x) for the principal’s expected utility in round t if x were posted. We then measure performance by dynamic regret,
$$ \mathrm{DReg}(T) \;=\; \sum_{t=1}^T \Big(\max_{x\in\mathcal{X}} U_t(x) - \mathbb{E}[U_t(x_t)]\Big), $$
and control the extent of nonstationarity via a variation budget $V_T = \sum_{t=2}^T \sup_x |U_t(x)-U_{t-1}(x)|$. This is deliberately agnostic about why drift occurs; it captures the operational reality that “what works” may gradually change even when the contract template itself is fixed.

Two feedback regimes: outcome-only (bandit) versus evaluative payoff queries. The baseline regime is what many contracting environments naturally provide: each round yields a single realized outcome j_t (hence a single payoff u_t). This is bandit feedback with endogenous confounding through i^*(x_t). In particular, the principal cannot directly estimate q_i, cannot directly compare actions, and cannot decompose realized payoffs into “reward” and “payment effects” beyond the identity u_t = r_{j_t} − t_{j_t}(x_t). Economically, this corresponds to paying and observing performance in the field: each period produces one draw from whatever behavior the agent found optimal.

In some applications, however, the principal can obtain stronger information about U(x) by running repeated evaluations at a chosen parameter x. For example, before deploying an AI system with a given incentive scheme, a firm may run a large offline test suite; a platform may pilot a compensation rule on a randomized cohort; a regulator may require stress tests under standardized scenarios. In these cases, the principal can (at some cost) obtain an approximately unbiased estimate of U(x) with controllable variance—operationally, by averaging many i.i.d. instances of the outcome/payoff process under the same x. We treat this as an “evaluation-feedback” variant: the principal can spend additional measurement effort to reduce noise in utility estimation, and the learning guarantees can improve accordingly. Importantly, even this stronger feedback does not reveal the hidden action; it simply provides cleaner access to the induced expected payoff surface.

Examples and interpretation. In multi-metric AI evaluation, the “outcome” j can encode a discretized vector of audit results (e.g., pass/fail across a battery of safety checks, latency buckets, calibration categories). The reward r_j captures the principal’s downstream value from deploying a model with that evaluation profile, while the payment t_j(x) could be a bonus schedule for a model developer or an internal team, constrained to be an affine function of a small set of headline metrics (dimension d). The agent’s actions correspond to development choices—training recipes, data filtering, alignment interventions—each with unknown cost and unknown distribution over evaluation outcomes. Repeated contracting arises when the principal iteratively updates incentive weights as more deployments and evaluations occur.

In pay-for-performance platforms, outcomes j can represent realized service quality categories derived from telemetry (on-time rates, complaint flags, customer ratings, compliance checks). The template x represents the platform’s adjustable levers (a few bonus multipliers, thresholds, or linear weights), while the agent’s actions capture effort and strategic responses (e.g., route choice, time allocation, selective acceptance), which shift the distribution of observable outcomes in ways that are not known ex ante. Again, the platform observes outcomes but not the worker’s internal action, and learning is about tuning a low-dimensional payment rule to induce high-value behavior.

Modeling choices and limitations. We deliberately take 𝒜 to be finite and outcomes to be discretized into m bins. This is not because we believe behavior or measurement is truly discrete, but because the central difficulty here is endogenous response under a restricted payment template, not statistical estimation of a smooth response function. Discretization lets us represent rich telemetry while keeping payoffs bounded and the incentive problem well-defined. Similarly, we assume the principal’s reward vector r is known and stable; in many domains r is itself estimated and contested. Our results should therefore be read as describing the learning problem conditional on a chosen evaluation/valuation scheme, rather than solving the full problem of learning values and incentives simultaneously.

With these primitives in place, the next step is to describe how the affine template and best-response behavior impose geometry on the principal’s induced objective U(⋅). That geometry—rather than the raw outcome dimension—will determine what can be learned efficiently from the feedback described above.

3. Implementability partition: polyhedral best-response regions induced by IC constraints under affine parametrizations; definitions of K (region count), L (slope bound), and alternative complexity measures (pseudo-dimension).

The key structural object that makes the learning problem tractable is the implementability partition: the decomposition of the parameter space 𝒳 into regions on which the agent’s best response is constant. Economically, these regions are exactly the sets of contract parameters that satisfy a fixed collection of incentive constraints; geometrically, under an affine template they form a polyhedral subdivision of 𝒳. Our learning rates will scale with the complexity of this subdivision rather than with the raw outcome dimension m or the size of the action set |𝒜|.

Incentive comparisons become linear under affine templates

Fix two actions i, i^′ ∈ 𝒜. Under parameter x, the agent compares
U_A(i ∣ x) − U_A(i^′ ∣ x) = (q_i − q_i^′) ⋅ t(x) − (c_i − c_i^′).
With an affine contract t(x) = Ax + b, this difference is affine in x:
$$ U_A(i\mid x)-U_A(i'\mid x) = \underbrace{\big(A^\top(q_i-q_{i'})\big)}_{\in\mathbb{R}^d}\cdot x + \underbrace{(q_i-q_{i'})\cdot b - (c_i-c_{i'})}_{\in\mathbb{R}} . $$
Thus, the indifference locus between i and i^′ is a hyperplane (when the coefficient vector is nonzero), and strict preference for i over i^′ is a halfspace. This is the multivariate analogue of the familiar “upper-envelope of lines” picture for one-dimensional linear contracts: rather than slopes and intercepts in ℝ, we have normal vectors and offsets in ℝ^d.

A convenient way to package what matters for incentives is to define, for each action i,
w_i := A^⊤q_i ∈ ℝ^d, α_i := q_i ⋅ b − c_i ∈ ℝ,
so that
U_A(i ∣ x) = w_i ⋅ x + α_i.
From the perspective of contract design within the template, the potentially high-dimensional outcome distribution q_i influences incentives only through the projected statistic w_i = A^⊤q_i and the scalar α_i. In particular, two actions with different underlying q_i’s can be incentive-equivalent if they share the same (w_i, α_i), meaning the agent is indifferent between them for all x. This observation already hints at why m should not appear in the learning rates: the template collapses the outcome space into a d-dimensional set of incentive-relevant directions.

Best-response regions as polyhedra (the implementability partition)

For any action i, define its best-response region (or implementability set)
𝒳_i := {x ∈ 𝒳: U_A(i ∣ x) ≥ U_A(i^′ ∣ x) ∀i^′ ∈ 𝒜}.
Substituting the affine representation, we obtain
𝒳_i = 𝒳 ∩ ⋂_{i^′ ∈ 𝒜}{x: (w_i − w_i^′) ⋅ x ≥ α_i^′ − α_i}.
That is, 𝒳_i is the intersection of 𝒳 with finitely many halfspaces, hence a (possibly empty) convex polyhedron. The collection {𝒳_i}_{i ∈ 𝒜} covers 𝒳, and with a fixed tie-breaking rule we may view it as a genuine partition up to shared boundaries: each x is assigned a unique best response i^*(x), and the set of points inducing a given best response is polyhedral.

This partition is the formal version of an implementability story: a contract parameter x “implements” action i precisely when all incentive constraints U_A(i ∣ x) ≥ U_A(i^′ ∣ x) hold. Under affine templates these constraints are linear, so implementability is described by linear inequalities. This is not merely a mathematical convenience; it captures an important practical rigidity. When organizations restrict attention to a small family of scoring rules, bonus schemes, or linear-weight contracts, they implicitly restrict the set of incentive constraints they can move. The result is that action choice changes only when the parameter crosses a boundary hyperplane where some alternative becomes equally attractive.

The complexity parameter K: how many behavioral regimes can the template induce?

We measure the effective “behavioral richness” of a template by counting how many distinct best-response regions it induces on 𝒳. Let
K := |{ i^*(x) : x ∈ 𝒳 }|
be the number of actions that are optimal somewhere in 𝒳 (equivalently, the number of nonempty cells 𝒳_i). One can also define K as the number of cells in the polyhedral subdivision after accounting for tie-breaking on boundaries; for our purposes these notions coincide up to irrelevant measure-zero issues.

Two features of K are worth emphasizing.

K can be far smaller than |𝒜|. Even if the agent has a huge menu of actions, only a limited number may be implementable within the template. Indeed, if many actions share the same projected vector w_i = A^⊤q_i (and comparable intercepts), the template cannot separate them. In contracting terms: a low-dimensional scorecard cannot finely discriminate among high-dimensional behaviors.
K is the right notion of complexity for learning. The principal’s induced objective U(⋅) is nonsmooth only because the agent switches actions; every kink corresponds to crossing a boundary between best-response regions. Thus, the number of such regions governs how hard it is to locate the global maximizer from noisy payoff feedback. In our regret bounds, K plays the role that “number of arms” plays in bandits, but in a geometrically structured way: regions are not independent arms, they are adjacent polyhedra carved out by hyperplanes.

A useful auxiliary quantity is the number M of distinct projected vectors w_i = A^⊤q_i. Since comparisons depend on differences w_i − w_i^′, the arrangement complexity is controlled by how many distinct normals appear. Standard hyperplane arrangement bounds imply that, in d dimensions, the number of cells formed by M affine functions is at most on the order of M^d (up to constants and lower-order terms). This provides an interpretable proxy: increasing template dimension d or increasing the variety of incentive-relevant outcome profiles (captured by M) can explode the number of behavioral regimes the principal may face.

The slope bound L: controlling sensitivity within a regime

Once an action i is fixed, the principal’s expected utility is affine:
U(x) = R_i − q_i ⋅ (Ax + b) = R_i − (A^⊤q_i) ⋅ x − q_i ⋅ b = β_i − w_i ⋅ x,
where β_i := R_i − q_i ⋅ b is constant in x. Hence on any region where i^*(x) = i, the gradient exists and equals
∇U(x) = −w_i = −A^⊤q_i.
We encode uniform control of these slopes by assuming a known bound
∥∇U(x)∥ ≤ L whenever the gradient is defined,
equivalently max_i∥A^⊤q_i∥ ≤ L in the relevant norm. This bound is natural in two ways. First, it is implied by bounded payments together with bounded template coefficients (e.g., via ∥A^⊤q_i∥ ≤ ∥A^⊤∥⋅∥q_i∥ and ∥q_i∥ ≤ 1 under appropriate norms). Second, it has a clear economic meaning: within a fixed behavioral regime, how quickly does the principal’s expected payoff change as we tweak contract parameters? Steeper slopes mean small parameter mistakes translate into large utility losses, which forces more precise learning near optima and near region boundaries.

Alternative complexity measures (and why we foreground K)

One could also summarize the difficulty of learning U(⋅) using more classical statistical capacity notions, such as the pseudo-dimension of the function class. Recall that the pseudo-dimension Pdim(ℱ) of a real-valued class ℱ is the largest n such that there exist points x₁, …, x_n and thresholds θ₁, …, θ_n for which the sign patterns of {f(x_s) − θ_s}_s = 1ⁿ can realize all 2ⁿ labelings as f ranges over ℱ. For classes defined by maxima of affine functions, pseudo-dimension is typically on the order of dlog (number of pieces). In our setting, if we consider the induced utilities consistent with some unknown {(w_i, α_i, β_i)} but with at most K regions and bounded slopes, one expects Pdim scaling like O(dlog K) (up to constants and technical conditions).

We nevertheless prefer K (and L) as our primary structural parameters for two reasons. First, K is directly tied to the economics of implementability: it counts how many qualitatively distinct agent behaviors the template can induce. Second, the geometry induced by the implementability partition enables algorithms that exploit adjacency and piecewise affinity, yielding regret bounds that depend polynomially on K and only mildly on d (typically through logarithms or fractional powers), without ever estimating the full m-dimensional outcome distributions.

This sets up the next step. To build intuition for how “kinks” and “regions” drive learning, we begin with the warm-up case d = 1, where the implementability partition becomes intervals, U(⋅) becomes piecewise-linear with K breakpoints, and the upper-envelope geometry can be drawn—and learned—from data.

4. Warm-up (d=1): learning piecewise-linear utilities with K breakpoints; algorithms and tight regret rates; connection to one-sided Lipschitz results and upper-envelope geometry.

Warm-up: d = 1 and the geometry of K breakpoints

To see most transparently how the implementability partition drives learnability, it is useful to collapse the template to a single parameter. Think of a one-dimensional “intensity knob” (a commission rate, a single score weight, a bonus multiplier): 𝒳 = [0, 1] ⊂ ℝ and x ∈ [0, 1] is scalar. In this case each projected statistic w_i = A^⊤q_i is a real number, and the agent’s expected utility for action i becomes a line
U_A(i ∣ x) = w_ix + α_i.
The agent best response i^*(x) is therefore determined by the upper envelope of lines max_i{w_ix + α_i}. Two classical geometric facts (familiar from upper-envelope arguments in screening and mechanism design) become immediate:

Best-response regions are intervals. With a fixed tie-breaking rule, the set of x inducing a given action is an interval (possibly empty). Intuitively, if action i beats i^′ at two points on the line, it beats it everywhere between those points because the difference is linear.
There are finitely many switching points. The map x ↦ i^*(x) can change only at intersection points of pairs of lines. Thus the implementability partition of [0, 1] is an ordered sequence of intervals separated by at most K − 1 breakpoints, where K counts how many actions appear on the envelope over [0, 1].

This is the one-dimensional analogue of the polyhedral partition described earlier: hyperplanes become points, and cells become intervals. Economically, we can interpret each breakpoint as a threshold at which a marginal change in the contract parameter flips which incentive constraint binds.

On any interval I where the best response is constant, say i^*(x) = i for all x ∈ I, the principal’s induced utility is also linear:
U(x) = R_i − q_i ⋅ t(x) = β_i − w_ix, with |w_i| ≤ L.
So in d = 1, U is a piecewise-linear function with at most K linear pieces and slope bounded by L, with kinks only at the (unknown) best-response breakpoints.

Where can the maximizer lie?

A small but important simplification occurs in one dimension: on any interval where U is linear, its maximum is achieved at an endpoint of that interval (unless the slope is exactly zero, in which case every point is optimal). Consequently, if we knew the breakpoints, we could restrict attention to a finite candidate set: the domain endpoints {0, 1} together with the K − 1 internal breakpoints. The learning difficulty is precisely that we do not know these switching points ex ante and observe only noisy payoffs u_t at the parameters we choose.

This perspective already clarifies why K is the right complexity parameter. A generic Lipschitz function on [0, 1] can hide its maximizer anywhere, forcing exploration on an ε-grid. By contrast, a K-piecewise-linear function can “hide” its maximizer only in the vicinity of a small number of unknown kinks. Learning is therefore a problem of discovering and comparing a small number of regimes, rather than uniformly covering the continuum.

An adaptive “split-on-kink” view of learning in d = 1

The most useful mental model for algorithms in this warm-up is an adaptive partition of [0, 1] into sub-intervals, combined with two statistical tasks:

Value estimation at a point: by repeatedly posting the same x, we can estimate U(x) from noisy payoffs.
Kink detection inside an interval: if U were truly linear on an interval [a, b], then its value at the midpoint m = (a + b)/2 would satisfy
$$ U(m)=\tfrac12 U(a)+\tfrac12 U(b). $$
Significant deviation from this affine consistency check is statistical evidence that a breakpoint lies inside [a, b], i.e., that the agent’s action switches somewhere within the interval.

This suggests a natural adaptive routine (we state it informally because in later sections we will use a higher-dimensional analogue):

Start with the whole interval [0, 1] active.
For an active interval [a, b], obtain sufficiently many payoff samples at a, m, and b to form confidence intervals for U(a), U(m), U(b).
If the midpoint estimate is statistically consistent with linear interpolation between the endpoints, treat [a, b] as “likely single-regime” and use the slope bound |∇U| ≤ L to upper-bound how much improvement can be obtained by further refinement.
If the midpoint is inconsistent, split [a, b] into [a, m] and [m, b] and recurse.

Because there are only K − 1 true kinks, only O(K) intervals will ever need to be split “for structural reasons”; all other refinement is purely to sharpen estimates near the best candidate.

One way to summarize what this buys us is an offline sample complexity statement: with i.i.d. payoff queries, one can return x̂ such that
U(x̂) ≥ max_{x ∈ [0, 1]}U(x) − ε
using on the order of Õ(K ε⁻²) payoff samples (with constants depending on the noise level and on L). The key point is conceptual: the dependence is on the number of pieces K, not on the outcome dimension m, and not on any fine discretization of [0, 1].

From offline ε-optimization to online regret: the T^2/3 rate

To convert the warm-up intuition into an online regret rate under payoff feedback, we can apply the standard exploration–exploitation balancing that reappears in the general d-dimensional analysis.

Exploration phase: spend S(ε) rounds running the offline adaptive routine to identify an ε-optimal parameter x̂. In d = 1 and for the piecewise-linear class, we may take S(ε) = Õ (K/ε²).
Exploitation phase: play x̂ for the remaining rounds, suffering at most ε expected loss per round relative to the true maximizer.

This gives the familiar regret decomposition
$$ \mathrm{Reg}(T) \;\lesssim\; S(\varepsilon) + \varepsilon T \;\approx\; \frac{K}{\varepsilon^2} + \varepsilon T . $$
Optimizing over ε yields ε ≍ (K/T)^1/3 and therefore
Reg(T) = Õ (L K^1/3T^2/3),
where L enters as the natural scale factor because it converts parameter error into utility loss along a linear piece.

This T^2/3 behavior is exactly the hallmark of learning problems where one must both localize a small set of unknown structural features (here, breakpoints) and estimate payoffs with noise. In practice, it corresponds to the common situation where a decision-maker tunes a single contract knob over time: most of the horizon is spent deciding “which regime we are in,” not fine-tuning within a regime.

Why the rate is tight (and how it relates to one-sided Lipschitz structure)

The same geometry also explains why we should not expect to do substantially better in the worst case. A lower-bound construction can embed a K-armed identification problem into [0, 1] by associating each “arm” with an interval and choosing piecewise-linear functions whose maxima occur near the boundaries between intervals. Informally, unless we sample near the right breakpoint, we cannot tell whether the best segment lies to the left or right, and the noise forces repeated sampling. This yields a minimax Ω(K^1/3T^2/3) barrier (up to logarithmic factors) for the class, even in one dimension.

It is also useful to place this warm-up next to the continuum-armed bandit literature. Because slopes are bounded, U is globally L-Lipschitz on [0, 1]. But the piecewise-linear structure is stronger than mere Lipschitz continuity: away from the K − 1 breakpoints, the function is perfectly linear, so all “curvature” is concentrated on a small set. This resembles the one-sided/weak Lipschitz conditions used to obtain dimension-dependent rates via zooming arguments: local complexity near the optimum, not global covering numbers, governs performance. Our setting differs only in that the locations of the kinks are themselves unknown and must be learned from bandit feedback, which is what restores the T^2/3 exploration–exploitation tradeoff.

Finally, the upper-envelope picture provides a contracting interpretation for these statistical statements. When we restrict to a one-parameter template, the agent’s behavior changes only at a finite set of incentive thresholds; learning is hard precisely because the principal must discover which threshold is relevant. This is the core intuition we will carry into the general d-dimensional case, where “thresholds” become hyperplane boundaries and “interval refinement” becomes adaptive partitioning of 𝒳.

5. Main regret bounds (general d): adaptive partition / optimistic search algorithms for piecewise-affine bandit optimization; regret bound in terms of (K, d, L, T) under payoff feedback.

5. Main regret bounds in general dimension: adaptive partition and optimistic search

When we move from a single contract “knob” to a d-dimensional template x ∈ 𝒳 ⊂ ℝ^d, the geometry becomes richer but the economic source of structure is unchanged: the agent’s best response is still an upper envelope of affine utilities, and the principal’s induced payoff U(⋅) is therefore affine within regions and non-smooth only where the induced action switches. The implementability partition now consists of at most K polyhedral cells in 𝒳, separated by hyperplanes defined by pairwise incentive comparisons.

From a learning perspective, the central difficulty under payoff feedback is that we do not observe the agent’s action nor the region identity. We only observe noisy realizations u_t ∈ [−1, 1] whose expectation equals U(x_t). Thus the problem is a bandit optimization task over a function class that is globally Lipschitz (with constant L) but has a far stronger “low non-smoothness complexity” captured by K.

A key geometric simplification: within a region, only extreme points matter

Fix any cell P of the implementability partition and consider the set P ∩ 𝒳, which remains convex. On that set, U(x) has the affine form
U(x) = γ − g ⋅ x, ∥g∥ ≤ L,
for some (γ, g) determined by the induced action. Maximizing an affine function over a convex set is achieved at an extreme point; hence, if we knew the cell, the principal could restrict attention to finitely many candidate points (the vertices of P ∩ 𝒳, or more generally extreme points). In economic terms: once the relevant incentive constraint is known, optimizing the contract within that regime is “linear programming,” not a nonparametric search.

Of course, we do not know the cells, and in higher dimensions we cannot hope to explicitly enumerate all vertices of all regions. The algorithmic goal is therefore more modest and more statistical: we adaptively refine 𝒳 into smaller sets until (i) each active set is likely contained in a single regime, and (ii) the remaining uncertainty in value is too small to justify further refinement.

Affine-consistency tests in d dimensions

The one-dimensional midpoint identity has a clean analogue in ℝ^d. Suppose we focus on a small convex set C ⊂ 𝒳 (think of a simplex or a box) and select d + 1 affinely independent “anchor” points v₀, …, v_d ∈ C. Any point y in the simplex conv{v₀, …, v_d} admits barycentric coordinates λ(y) ∈ Δ^d such that $y=\sum_{k=0}^d \lambda_k(y) v_k$. If U is affine on this simplex, then it must satisfy the identity
$$ U(y) \;=\; \sum_{k=0}^d \lambda_k(y) \, U(v_k). $$
This gives a concrete statistical “kink detector”: by sampling payoffs at v₀, …, v_d and at one or more interior test points y, we can check whether the observed values are consistent with a single affine function up to the noise tolerance. If not, then with high probability C intersects a region boundary (an action switch), and we should split C into smaller sets.

Two aspects are worth emphasizing.

The test is local and does not require estimating q_i or c_i. We never attempt to infer the agent’s primitives; we only test whether the induced payoff surface behaves like one affine piece on a candidate neighborhood.
The test is scale-sensitive. If C is very small (diameter diam(C)), then even if C crosses a boundary, the maximum possible payoff difference across C is at most L diam(C). At sufficiently fine scales, it becomes statistically unnecessary to resolve whether C contains a kink, because the regret impact is negligible.

These two facts make it plausible that an adaptive refinement strategy can pay only for the “important” boundaries—those near the optimum and at scales where boundary uncertainty translates into material payoff loss.

An optimistic adaptive partition algorithm (conceptual description)

The online procedure can be viewed as maintaining a growing tree of cells that cover 𝒳. Each cell C maintains:

a set of anchor points (e.g., d + 1 vertices of a simplex representation of C);
empirical mean payoff estimates Û(⋅) at those anchors (from repeated plays of the same contract parameters);
confidence radii derived from sub-Gaussian concentration, so that with high probability U(⋅) lies within Û(⋅) ± rad(⋅);
an optimistic upper bound UCB(C) on max_x ∈ CU(x).

A simple and robust form of UCB(C) is the maximum over anchors of their upper confidence plus a Lipschitz slack for moving within the cell:
UCB(C) = max_{v ∈ Anchors(C)}(Û(v) + rad(v)) + L diam(C).
This bound is intentionally conservative: it does not assume affinity inside C. When the algorithm has statistical evidence that U is affine on C, it can tighten UCB(C) using an estimated affine model (fit from the anchor points) plus confidence; but the Lipschitz-based bound suffices to explain the rate and is stable to misspecification near boundaries.

The online decision rule is then: at each round t, choose a cell C_t with maximal UCB(C), and within that cell choose a point to sample that most efficiently reduces uncertainty—typically an anchor point whose confidence radius is largest, or a designated interior test point used for affine-consistency checks. If the consistency test fails at cell C, split it into smaller cells (e.g., bisect along a coordinate axis, or refine a simplex into subsimplices) and continue.

Economically, this “optimism + splitting” routine has a transparent interpretation: we keep experimenting with contract variants only in those neighborhoods that could still plausibly contain a better incentive regime, and we stop refining regions that cannot outperform the current best by more than a small margin given the slope bound.

Why K controls the number of “structural” splits

The implementability partition has at most K regions; equivalently, there are at most K − 1 action-switching boundaries (in a suitable sense) that matter for the payoff function. An adaptive partition will repeatedly split cells that straddle these boundaries, but crucially:

Cells that lie entirely within one region will eventually pass the affine-consistency test and stop generating boundary-driven splits.
Only cells intersecting the union of boundaries can keep failing the test across scales, and the number of such problematic cells at a given resolution is controlled by the combinatorics of the underlying hyperplane arrangement.

This is where the d-dependence enters. In one dimension, boundaries are points and the count is literal. In ℝ^d, boundaries are (d − 1)-dimensional faces, and the number of small cells intersecting them grows with dimension. The analysis packages this growth into a factor on the order of dlog K (up to constants and polylogarithms), reflecting that a K-cell polyhedral complex cannot “shatter” space arbitrarily: the arrangement has limited combinatorial richness relative to K and d.

Regret bound under payoff feedback

Putting these ingredients together yields the main performance guarantee for the bandit-feedback regime. Under bounded noise (or sub-Gaussian payoffs) and the assumptions that U is K-piecewise-affine with per-piece gradient norm bounded by L, there exists an online learning algorithm using only realized payoffs u_t such that
Reg(T) = Õ (L (dlog K)^1/3 K^1/3 T^2/3),
where Õ(⋅) hides polylogarithmic factors in (T, K, d) (and standard confidence parameters).

The economic content of this expression is worth separating from the algebra:

the T^2/3 rate is the familiar exploration–exploitation frontier when we must both identify structural features (which regime we are in) and estimate payoffs with noise from bandit feedback;
the K^1/3 term quantifies how many incentive regimes could plausibly be optimal ex ante;
the d-dependence enters mildly (relative to generic continuum-armed bandits) because within a regime the function is exactly affine, so the algorithm does not need to cover 𝒳 uniformly—only to locate boundaries and compare a controlled number of locally optimal candidates.

The same structural viewpoint also explains why the dependence on the outcome dimension m disappears: we never estimate q_i or the full vector q_i ⋅ t(x); the payoff observation is already a one-dimensional unbiased signal of U(x).

Finally, the rate is essentially unimprovable for this feedback model: even in d = 1, one can construct families of K-piecewise-affine payoff functions for which any algorithm must incur regret at least Ω(LK^1/3T^2/3) up to logarithmic terms. Since the lower bound holds in the simplest geometry, it carries over to higher dimensions.

Practical and conceptual limitations

Two caveats are important if we interpret this as guidance for contract design in practice.

Statistical rate versus computational tractability. The guarantee is information-theoretic: it shows what is learnable from payoff feedback given the structural restriction. Implementing the adaptive partition efficiently in high dimension may require additional assumptions (e.g., using special families of partitions, exploiting sparsity in A, or relying on oracles for optimization over 𝒳).
Why the T^2/3 rate persists. With only one realized outcome per round, estimating U(x) is noisy, and boundary detection is statistically expensive. This is not a mere artifact of analysis; it is a fundamental limitation of the feedback channel. In the next section we explain how the picture changes when the principal can, at chosen times, obtain a much more accurate estimate of U(x) (for example, by evaluating the contract on a large test suite), effectively reducing the problem to stochastic optimization over a finite-complexity surface and improving the regret rate accordingly.

6. Stronger feedback regime: when the principal can estimate expected utility (e.g., evaluation suite of size M per round); reduction to stochastic optimization over K pieces; improved regret rates.

6. Stronger feedback: estimating expected utility via evaluation suites

The T^2/3 regret rate in the pure payoff-feedback model is not driven by geometry alone; it is driven by information scarcity. When we observe only a single realized outcome j_t per posted contract, the principal’s signal about U(x_t) is extremely noisy, and distinguishing “this neighborhood is a single affine regime” from “this neighborhood straddles an action switch” becomes statistically expensive.

In many practical settings, however, the principal can obtain a substantially cleaner estimate of the expected payoff of a candidate contract before committing to it at scale. Conceptually, this corresponds to an evaluation suite: after choosing a parameter x, we can run the same contract on M independent instances (or M i.i.d. agents, or M independent tasks presented to the same agent when incentive constraints are stationary) and observe the average realized payoff. This moves us from a one-sample bandit signal to a controlled-precision oracle for U(x).

A simple statistical model for “stronger feedback”

Fix a query point x ∈ 𝒳. Suppose we can obtain M i.i.d. payoff samples
u⁽¹⁾(x), …, u^(M)(x), u^(s)(x) = r_j^(s) − t_j^(s)(x),
where each j^(s) is drawn from the outcome distribution induced by the agent’s best response to x. Then the empirical mean
$$ \widehat U_M(x) \;:=\; \frac{1}{M}\sum_{s=1}^M u^{(s)}(x) $$
satisfies 𝔼[Û_M(x)] = U(x), and since u^(s)(x) ∈ [−1, 1], standard concentration implies sub-Gaussian noise with scale $\sigma \asymp 1/\sqrt{M}$:
Pr (|Û_M(x) − U(x)| ≥ ε) ≤ 2exp (−c Mε²).
Equivalently, one can phrase the model as: we have access to an oracle that returns an unbiased estimate Û(x) with variance proxy σ², and achieving accuracy ±ε costs on the order of σ²/ε² samples. This captures not only fixed-size suites M per round, but also an adaptive regime where the principal chooses evaluation effort.

Economically, this stronger feedback corresponds to a setting where the principal can pilot a contract: A/B testing in a platform, offline policy evaluation on logged data (when assumptions justify it), or repeated deployment to a stable pool of tasks where the induced action distribution does not drift during the evaluation window.

Why cleaner estimates change the learning problem qualitatively

Under our structural assumptions, U is piecewise-affine over 𝒳 with at most K regions. The key implication is that the global maximizer of U must lie in (at least) one of these regions, and within any fixed region the function has the form
U(x) = γ − g ⋅ x with ∥g∥ ≤ L.
Thus the “hard part” is not estimating a complicated smooth surface; it is identifying which affine piece is best (and where within that piece the maximum lies over 𝒳). With only one noisy payoff per round, we pay heavily to resolve both value uncertainty and boundary uncertainty. With an evaluation suite, we can reduce value uncertainty quickly, and boundary uncertainty becomes manageable because we can afford sharper tests of whether two nearby points belong to the same affine regime.

This is the sense in which stronger feedback turns our problem into something much closer to stochastic optimization over finitely many structured alternatives. At a high level, we can think of each implementability region as an “expert” that recommends the best contract in that region; learning then becomes selecting among ≤ K experts, rather than performing a fully nonparametric search over 𝒳.

A conceptual reduction: “regions as experts,” discovered adaptively

There is an important subtlety: we do not know the regions ex ante, and we do not observe the agent’s action. So we cannot literally index experts by actions i, nor can we enumerate the K cells directly. The algorithmic idea is instead to discover candidate regions on the fly using the same geometric logic as in the payoff-feedback case (adaptive partitioning and affine-consistency checks), but to use the evaluation oracle to make these checks sharp and to enable aggressive elimination.

A representative scheme looks like this:

Maintain a partition of 𝒳 into active cells 𝒞 (e.g., a tree of simplices or boxes). For each cell C, we keep a small set of anchor points v₀, …, v_d ∈ C that are affinely independent.
Estimate U accurately at anchors using evaluation suites, producing confidence intervals
$$ U(v_k) \in \big[\widehat U(v_k) \pm \mathrm{rad}(v_k)\big], \qquad \mathrm{rad}(v_k) \asymp \sigma\sqrt{\frac{\log(TK/\delta)}{n_k}}, $$
where n_k is the number of suite-evaluations performed at v_k.
Test affinity inside C: choose one or more interior points y ∈ C with known barycentric coordinates λ(y) relative to the anchors, and check whether
$$ U(y) \approx \sum_{k=0}^d \lambda_k(y)\, U(v_k) $$
within the combined confidence radii. If the test fails, we split C, because it likely intersects a boundary between regions.
Compute an upper confidence bound for the best value in C. When C passes the affinity test, we can fit an affine model to the anchor estimates and (approximately) optimize it over C ∩ 𝒳 to obtain a candidate maximizer x̃(C) and an upper bound UCB(C). When C does not pass (or has not yet been tested), we can always fall back on a conservative bound such as
UCB(C) = max_{v ∈ Anchors(C)}(Û(v) + rad(v)) + L diam(C).
Eliminate suboptimal cells: if a cell’s upper bound falls below another cell’s lower bound by more than a tolerance, discard it from the active set. This is the successive-elimination step.

The economic interpretation mirrors a standard “pilot then scale” workflow: we evaluate a small menu of promising contract neighborhoods with enough statistical power to rule out those that cannot be optimal, and we progressively concentrate evaluation budget on the remaining plausible incentive regimes.

Resulting regret: from T^2/3 down to $\sqrt{T}$ (up to logs)

With sufficiently accurate per-query estimates, the exploration–exploitation tradeoff changes. Instead of spending many rounds to average out noise at each point, we can reduce noise quickly via suites and run an elimination-style algorithm whose dominant cost is essentially the number of alternatives that remain plausible.

Formally, in the stronger feedback model where each query returns an unbiased estimate with sub-Gaussian scale σ, we can achieve regret on the order of
$$ \mathrm{Reg}(T) \;=\; \tilde O\!\big(\sigma \sqrt{K T}\big), $$
where the Õ(⋅) hides polylogarithmic factors (confidence parameters, mild dependence on d, and bookkeeping for adaptive partitioning). The key point is the $\sqrt{T}$ dependence: once value estimates are cheap, the problem behaves like learning the best among ≈ K structured candidates rather than like a continuum-armed bandit.

If the stronger feedback comes from a per-round suite of size M, then $\sigma \asymp 1/\sqrt{M}$ and the bound becomes
$$ \mathrm{Reg}(T) \;=\; \tilde O\!\Big(\sqrt{\tfrac{K T}{M}}\Big), $$
highlighting a direct tradeoff between evaluation effort and learning speed.

Practical scope and limitations

Two qualifications matter for interpreting this regime as a prescription for real contracting problems.

Suites assume i.i.d. evaluation under a stable best response. To average M outcomes at the same x, we implicitly assume the agent’s induced action distribution is stable across the M instances and that the instances are comparable (no hidden selection effects). In platform settings, this may require randomization and careful experimental design; in offline evaluation, it may require strong assumptions to justify counterfactual estimates.
We shift burden from time to measurement/computation. The $\tilde O(\sigma\sqrt{KT})$ regret rate is attractive, but it presumes the principal can (i) run large evaluations at chosen x’s and (ii) maintain and optimize over candidate cells (often requiring an optimization oracle over 𝒳). Thus, stronger feedback improves statistical efficiency, but may increase operational complexity.

Within those limits, the message is clear: when the principal can cheaply approximate U(x), the learning problem is no longer bottlenecked by bandit noise, and the relevant complexity parameter becomes the number of implementability regimes K rather than the size of the outcome space m or the sheer richness of 𝒳.

7. Drift and nonstationarity: variation-budget model; dynamic regret bounds scaling with K and V_T; discussion of robustness vs learning.

7. Drift and nonstationarity: a variation-budget model and dynamic regret

So far we have treated the induced payoff surface U(⋅) as fixed across rounds. In practice, this is often the wrong abstraction. The mapping from contracts to outcomes may drift because the agent’s environment changes (new tools, shifting demand, seasonality), because the population of agents changes (turnover, composition effects), or because the agent learns how to better exploit the contract template. Even if the principal posts parameters x_t from a fixed compact set 𝒳, the effective function we are optimizing can become time-indexed:
U_t(x) := 𝔼[u_t ∣ x_t = x],
where the expectation is taken under the (time-t) best response and outcome distribution induced by x. This section formalizes a minimal stability condition under which learning remains possible, and it clarifies the robustness–adaptation tradeoff that emerges when contracts are re-optimized in a drifting environment.

Why some restriction on drift is unavoidable

Dynamic regret benchmarks the learner against a moving target,
$$ \mathrm{DReg}(T) \;=\; \sum_{t=1}^T \Big( \max_{x\in\mathcal X} U_t(x) \;-\; \mathbb{E}[U_t(x_t)] \Big). $$
Without any constraint on how {U_t}_t = 1^T evolves, this benchmark is essentially unattainable: an adversarial environment could make today’s maximizer uninformative about tomorrow’s, forcing linear regret even with perfect computation. Economically, this corresponds to a world where incentive constraints and technology can change arbitrarily fast relative to the principal’s ability to experiment—no learning rule can keep up.

Accordingly, we adopt a drift model that is intentionally agnostic about why primitives change (costs c_i, distributions q_i, even rewards r), and instead constrains the observable object that governs payoffs: the induced expected utility function U_t(⋅).

The variation budget

We measure nonstationarity via a sup-norm variation budget
$$ V_T \;:=\; \sum_{t=2}^T \sup_{x\in\mathcal X} \big|U_t(x) - U_{t-1}(x)\big|. $$
This definition has two advantages. First, it is directly tied to the learning signal (payoffs) rather than to unobserved primitives; two very different underlying changes are treated as equivalent if they produce the same change in induced payoffs. Second, it allows drift that is irregular in time—quiet periods punctuated by occasional jumps—so long as the total movement is controlled.

We maintain our structural assumption each round: U_t is piecewise-affine over 𝒳 with at most K polyhedral regions and slope bound L. Importantly, we do not assume the partition itself is fixed. Boundaries between implementability regions may move over time as incentive comparisons change. The role of K is simply to upper bound per-round geometric complexity.

A restarting principle: re-learn only when the world has moved

Conceptually, drift turns a one-shot learning problem into a sequence of “local” learning problems. When the environment is approximately stable for a while, we would like to run our static-regret procedure and harvest its guarantees; when the environment has moved substantially, we must pay to re-identify the best region and re-optimize within it.

A standard and robust way to operationalize this idea is a restarting algorithm:

Partition the horizon into contiguous blocks B₁, …, B_M of length H (except possibly the last).
At the start of each block, reset the internal state of the static algorithm and treat the block as a fresh H-round instance, using only payoff feedback u_t within the block.

This procedure does not require us to observe the drift directly. The only design choice is the block length H: short blocks adapt quickly but waste samples re-exploring; long blocks exploit more but risk chasing a stale optimum.

The proof strategy mirrors the economics of “pilot then re-pilot.” Within each block we compare performance to the best fixed contract for that block (a stability benchmark), and then we account for the gap between that blockwise benchmark and the fully dynamic optimum max_xU_t(x). The variation budget controls the second gap: if {U_t} does not move much over the block, then the best fixed contract in that block is close to the best per-round contract.

Dynamic regret bound and its scaling

Combining (i) the per-block static regret guarantee for our piecewise-affine class and (ii) a drift-to-restart reduction yields a dynamic regret bound of the form
DReg(T) = Õ (L (KdV_T)^1/3 T^2/3),
where Õ(⋅) hides polylogarithmic factors (confidence parameters and mild bookkeeping terms). Two comparative statics are worth emphasizing.

When V_T = 0 (stationarity), we recover the static-learning regime up to logs: the restart schedule can take H = T, and the bound reduces to the familiar T^2/3 scaling with K and d. In other words, the drift extension does not penalize us when there is no drift.
When V_T increases, the factor V_T^1/3 captures the additional price of adaptation. Intuitively, higher variation forces more frequent “recalibration” of which incentive regime is currently optimal. This is exactly the robustness–learning tension: a learner that optimizes aggressively for the current regime must be ready to pay again when the regime changes.

It is also useful to interpret the rate in terms of time scales. A small V_T means that the environment only drifts materially over long horizons, so the principal can afford longer blocks (more exploitation per relearning episode). A large V_T compresses the horizon over which information remains valid, shrinking the effective sample size for identifying the best region.

What this does—and does not—assume about primitives

Because V_T is defined directly on U_t, we do not need to specify whether drift comes from q_i, from c_i, or from changes in the reward mapping r. This abstraction is deliberate: in many applications, the principal cannot separately identify these components anyway. At the same time, it is a limitation: the model allows drift that may be economically implausible (payoffs changing in ways that could not arise from any reasonable change in agent technology), and it does not attempt to exploit additional structure that might exist in the primitives (e.g., a parametric trend in costs).

The key maintained restriction is per-round implementability complexity: each U_t is still a K-piece piecewise-affine function with slopes bounded by L. Thus the outcome alphabet size m remains irrelevant for learning rates; drift makes learning harder by changing which piece is active, not by increasing the dimensionality of outcome distributions.

Robustness versus learning: an economic interpretation

From a contracting perspective, the variation-budget result formalizes an often implicit operational norm: contracts are revised periodically, not continuously. A firm might recalibrate sales commissions quarterly, a platform might update its scoring-and-payment rule monthly, and a regulator might adjust procurement scoring rules annually. The bound above gives a stylized justification: in a drifting environment, continually chasing the estimated optimum can be wasteful because the optimum itself is moving, yet never updating leaves persistent money on the table.

There is also a subtler point about robust contracts. When V_T is large, the dynamic benchmark max_xU_t(x) is extremely demanding—it rewards rapid adaptation even if this requires heavy exploration. In contrast, a principal may prefer a contract that is slightly suboptimal today but performs reasonably well across plausible regimes (a kind of maximin or worst-case criterion). Our framework does not hard-code such preferences; instead it quantifies the statistical cost of pursuing the moving optimum. This makes explicit a practical design lever: one can trade off responsiveness (lower bias to current conditions) against stability (lower variance and fewer costly restarts).

Practical considerations and limitations

Finally, two implementation issues deserve acknowledgment.

Choosing H without knowing V_T. The cleanest tuning uses V_T (or an upper bound) to select a block length. If V_T is unknown, one can use standard meta-techniques—e.g., a grid of candidate block lengths combined with expert aggregation, or a doubling-style schedule that gradually increases restart frequency when performance degrades. These add logarithmic factors but preserve the essential scaling.
Endogenous nonstationarity. Our drift model treats {U_t} as exogenous. In many real systems, however, the principal’s learning and the agent’s adaptation are intertwined: agents may change behavior in response to observed experimentation, or the population may sort in or out under different contracts. The variation budget can be read as a reduced-form way to bound these effects, but it does not explain them. Where such strategic feedback loops are central, one would want a richer dynamic model; our claim here is narrower—if the induced payoff surface does not move too much in total, then the principal can track it with provable dynamic-regret guarantees.

Taken together, the variation-budget extension shows that our geometric learning picture is not confined to fully stationary settings. As long as the induced payoff function evolves with bounded total movement, we can remain near the moving optimum at a cost that scales transparently with implementability complexity K, dimension d, slope bound L, and drift magnitude V_T.

8. Lower bounds: constructions showing dependence on K is necessary; separation from dependence on m; implications for contract class design.

8. Lower bounds: why the dependence on K is unavoidable (and why m is not the right yardstick)

Our upper bounds make a strong claim: once we restrict attention to an affine template t(x) = Ax + b, the relevant statistical complexity is not the size of the outcome alphabet m, nor even the number of underlying actions |𝒜|, but the implementability complexity K—the number of best-response regions induced on 𝒳. A natural question is whether this is merely an artifact of our analysis and algorithms. In this section we explain why it is not: even in one dimension, and even for extremely benign-looking payoff functions, any learner with only noisy payoff feedback must incur regret scaling like K^1/3T^2/3 (up to logarithmic factors). We then show how to embed such hard instances inside our principal–agent model using only a constant number of outcomes, which cleanly separates “hardness due to incentives” (captured by K) from “hardness due to telemetry richness” (captured by m).

8.1 Intuition: K regimes behave like K latent arms

The core difficulty is that when U(⋅) is piecewise-affine with K pieces, it can contain K distinct regimes separated by unknown boundaries. If the principal knew the partition, learning would be much easier: one would optimize a single affine function on the relevant cell. But the partition is induced by the agent’s (unobserved) best response, and therefore must be inferred from payoff data. In the bandit feedback model (one realized payoff u_t per round), inferring which regime we are in is informationally similar to identifying which “arm” is best among K alternatives when the identity of the arm depends on where we query.

A useful mental picture is the following 1D special case. Let 𝒳 = [0, 1] and suppose U(x) is piecewise-constant on K disjoint intervals I₁, …, I_K that partition [0, 1]:
U(x) = μ_k for x ∈ I_k, μ_k ∈ [0, 1].
Choosing x_t simply selects an interval I_k and yields a noisy sample of μ_k. The learner does not know the μ_k’s nor which interval contains which mean (since the interval boundaries themselves are unknown). This is at least as hard as a K-armed bandit problem, but with the extra complication that exploration must also localize the intervals. Lower bounds for such “regime identification” problems precisely produce the K^1/3T^2/3 scaling: we can neither avoid exploring many candidate regimes (a K effect) nor avoid averaging enough samples within a regime to overcome noise (a T effect).

8.2 A canonical minimax lower bound for K-piecewise-affine functions

We state a representative lower bound in a simplified statistical model that captures the feedback structure we face: on each round t, the learner chooses x_t ∈ [0, 1] and observes a noisy payoff
y_t = U(x_t) + η_t,
where η_t is mean-zero sub-Gaussian noise (or simply bounded noise). The benchmark is max_{x ∈ [0, 1]}U(x). Consider the class ℱ(K, L) of functions U : [0, 1] → ℝ that are piecewise-affine with at most K pieces and slope bounded by L.

A standard construction (which we can think of as “K disjoint wells with one slightly higher than the others”) yields:

inf_alg sup_{U ∈ ℱ(K, L)} 𝔼[Reg(T)] ≥ Ω (L K^1/3T^2/3) (up to logarithmic factors).

The proof idea is more important than the exact constants. We choose K disjoint intervals of equal length and define U to be (nearly) flat on each interval. One interval—unknown to the learner—is “good” with mean larger by a gap Δ, and the remaining K − 1 intervals are “bad.” If the learner fails to spend enough rounds sampling from the good interval, it suffers regret roughly Δ per round. But to reliably tell which interval is good under noise, the learner needs on the order of 1/Δ² samples from each plausible interval. Balancing these two forces gives the usual bandit-style tradeoff:
$$ \text{exploration cost} \;\approx\; \frac{K}{\Delta^2}, \qquad \text{exploitation cost} \;\approx\; \Delta T. $$
Optimizing over Δ yields Δ ≍ (K/T)^1/3 and regret ≍ K^1/3T^2/3. Two remarks are worth stressing.

The lower bound already holds for piecewise-constant functions, which are a special case of piecewise-affine functions (take slope 0 on each cell). Thus, allowing nonzero slopes does not create the hardness; the hardness comes from the unknown regime boundaries.
The bound does not rely on high dimension: it holds even when d = 1. Higher-dimensional parameter spaces can only make the inference problem harder, but d is not the driver of the T^2/3 phenomenon.

This argument is also the reason we should not expect $\sqrt{T}$-type regret under single-sample payoff feedback without stronger structural assumptions (e.g., global concavity, Lipschitz continuity without kinks, or a margin condition guaranteeing that regime boundaries are rarely near-optimal). In our setting, kinks are endogenous outcomes of incentive constraints, and in worst case they can appear anywhere.

8.3 Embedding the hard instance in contract learning (so the bound is truly about incentives)

The previous lower bound is stated directly in terms of payoff functions U(⋅). To justify our emphasis on K as an implementability parameter, we also want to know that such hard U(⋅) can actually arise from a principal–agent environment with affine contracts and best responses.

This embedding is conceptually straightforward. In one dimension (d = 1), the agent’s expected utility for action i under parameter x is affine:
U_A(i ∣ x) = T_i(x) − c_i = (q_i ⋅ A) x + (q_i ⋅ b − c_i).
Thus each action corresponds to a line in x, and the best response i^*(x) is the upper envelope of these lines. By choosing a family of actions whose lines form an upper envelope with K segments, we induce exactly K best-response regions. On the interior of each region, the principal’s expected utility is affine as well:
U(x) = R_i^*(x) − q_i^*(x) ⋅ (Ax + b).
By selecting rewards and payments so that U is nearly flat on each region and differs slightly across regions, we can match the “one good interval” construction above. Importantly, none of this requires a large outcome space: we can realize distinct slopes q_i ⋅ A even with m = 2 outcomes by varying the Bernoulli probability q_i1 ∈ [0, 1], since q_i is then determined by a single scalar. In other words, the lower bound can be implemented with constant m, while K grows through the geometry of best responses.

This is the cleanest sense in which the hardness is about incentive-induced non-smoothness rather than about estimating a high-dimensional distribution over outcomes. The agent’s private information (c_i and q_i) only matters insofar as it shapes where the envelope switches, i.e., how many distinct implementability cells exist and how hard they are to localize.

8.4 Separation from m: telemetry richness is not the same as learning hardness

Because the hard instance can be implemented with m = 2, any regret lower bound that grows with m cannot be fundamental for our template-based problem—it would reflect a different modeling choice (for example, insisting on learning q_i in full, or allowing contracts to condition arbitrarily on all m outcomes). Our results—and the lower bound above—support a sharper message:

If the template t(x) = Ax + b compresses outcomes into a low-dimensional statistic, then learning rates should depend on the induced geometric complexity K, not on the raw number of bins m.
Conversely, increasing m does not necessarily make learning harder unless it increases the number of implementability regions (for example, by allowing the contract to respond to many outcomes in ways that create many distinct incentive comparisons).

This distinction matters in practice. A platform might collect thousands of telemetry features, but if it commits to paying as a function of a low-dimensional score, then the relevant learning problem is governed by the score’s induced regime structure, not by the full feature space.

8.5 What the lower bound implies for contract class design

Lower bounds are ultimately design guidance: they tell us which “knobs” cannot be turned away by better algorithms. Here, the unavoidable dependence on K says that overly expressive templates can be statistically expensive, even if they look innocuous from a mechanism-design perspective. Each additional degree of freedom in x and each additional way of mapping outcomes into payments can create new best-response regions, increasing K and thereby increasing the exploration cost required to locate the optimal regime.

That does not mean “simpler is always better.” A richer template may raise max_{x ∈ 𝒳}U(x) by enabling better incentives; the point is that richer templates also raise the price of learning this optimum from data. In operational terms, we should read K as a form of endogenous capacity control: it measures how many qualitatively distinct behavioral responses the template can induce. The next section takes this perspective seriously and discusses how one might choose templates to keep K small (or at least controlled) while retaining enough flexibility to perform well—connecting implementability complexity to more familiar notions of statistical capacity such as pseudo-dimension.

9. Implications for contract design: choosing contract templates to minimize implementability complexity; links to pseudo-dimension of contract classes; practical guidance.

9. Implications for contract design: templates as capacity control

The lower bound in the previous section is not merely a pessimistic statement about bandit feedback; it is a design constraint. Once we commit to learning a contract from data—rather than computing it from a fully specified model—the contract class itself becomes an object of statistical choice. Our results suggest that the right “complexity knob” is not the outcome alphabet size m nor the number of latent actions |𝒜|, but the implementability complexity K: the number of best-response regions induced on 𝒳 by the affine template t(x) = Ax + b. Put differently, a template is not only a mapping from outcomes to payments; it is also a mapping from the agent’s private model (q_i, c_i) into a particular geometry of kinks in U(⋅). Learning performance is governed by that geometry.

9.1 The core tradeoff: expressiveness versus learnability

Contract designers typically reason in terms of expressiveness: a richer template (larger d, more degrees of freedom in A, more outcome-contingencies) can implement a wider set of incentives and potentially increase max_{x ∈ 𝒳}U(x). Our regret bounds emphasize a second, orthogonal axis: learnability. Richer templates can increase K, which raises the exploration cost required to identify the right regime. This generates an endogenous bias–variance style tradeoff:
richer template ⇒ higher attainable optimum but larger K ⇒ slower learning.
The point is not that one should minimize K in isolation. Rather, K is the “price of adaptation”: if we anticipate limited data, short horizons, or noisy outcomes, then a template that keeps K controlled may dominate a more expressive alternative even if the latter has a higher first-best value.

This perspective aligns with practical contracting constraints. Platforms and organizations rarely pay on a fully outcome-contingent vector in ℝ^m; they commit to a scorecard, a rubric, or a small set of performance indicators. Our framework gives a sharp reason: low-dimensional templates compress the incentive comparisons the agent can exploit, thereby compressing the number of distinct best-response regimes the principal must learn.

9.2 Designing A to collapse “indistinguishable” actions

A helpful way to operationalize “keep K small” is to revisit what drives best-response boundaries under affine templates. Incentive comparisons depend on the projected statistics w_i := A^Tq_i ∈ ℝ^d and the intercept terms q_i ⋅ b − c_i. When many actions share the same (or similar) w_i, the template makes them hard to distinguish from the agent’s perspective, and the upper envelope has fewer distinct linear forms competing. In the most extreme case, if A projects all q_i’s onto a one-dimensional score and that score takes only a few distinct values across economically relevant actions, then K is small even if |𝒜| is large.

This suggests a concrete design heuristic: choose A to emphasize outcome directions that are (i) strongly aligned with the principal’s rewards r and (ii) likely to separate “good” from “bad” behavior, but avoid adding directions that merely create fine-grained distinctions among many near-equivalent behaviors. Since the principal does not know {q_i}, this cannot be done perfectly; still, domain knowledge often provides coarse priors. For example, in fraud prevention, paying on a small number of robust fraud signals may induce only a few qualitatively different response regimes, whereas paying on hundreds of correlated telemetry features can create many knife-edge switches among manipulation strategies, inflating K without commensurate gains in value.

Formally, our arrangement-style bound K ≤ O(M^d), where M is the number of distinct w_i’s, should be read as guidance: it is not the raw action count that matters, but how many distinct incentive-relevant projections the template creates. A template that is expressive in the ambient outcome space can still be learnable if it induces a small M (and hence small K) over the relevant action set.

9.3 Why “more telemetry” need not help (and can hurt)

A common intuition is that collecting more telemetry (larger m) should make contracting easier because it gives the principal more signals to condition on. Our results cut against that intuition in a precise way. Telemetry only helps if it is used to build a template that improves payoffs without exploding K. If additional features simply provide more levers for the contract to create finely separated expected payments T_i(x) = q_i ⋅ (Ax + b), then they can proliferate best-response regions, leading to slower learning under bandit feedback.

This is a cautionary point for practice: “high-dimensional scoring” can be counterproductive when contracts must be learned online. It may raise the theoretical optimum, but it also makes the induced payoff surface more non-smooth and thus statistically more expensive to navigate. Conversely, committing to a simple score—say, a scalar s(j) and a one-parameter payment t(j) = αs(j) + β—often keeps K manageable, and may therefore deliver higher realized welfare over a finite horizon.

9.4 Links to pseudo-dimension: from implementability to statistical capacity

Although K is an implementability object (a property of the best-response partition), it is closely connected to familiar capacity measures from statistical learning. One way to see this is to view the induced payoff U(⋅) as belonging to a function class 𝒰 consisting of piecewise-affine functions over 𝒳 with at most K pieces and bounded slopes L. Such classes have pseudo-dimension growing with both the ambient parameter dimension d and the number of pieces K. Informally, each piece contributes degrees of freedom (an affine form), while the unknown placement of boundaries adds combinatorial richness. This is exactly the mixture of “parametric” and “combinatorial” complexity that shows up in our regret rates through factors like (dlog K)^1/3K^1/3.

The practical lesson is that implementability complexity functions like a capacity control for contracts. Template design choices that increase K expand the effective hypothesis class the principal must search over using noisy payoff samples. This perspective also suggests why naive structural risk minimization over very rich templates can fail under bandit feedback: the cost of identifying which region one is in is not captured by within-region smoothness alone.

At the same time, the link to pseudo-dimension is constructive. It implies that we can borrow tools from learning theory—regularization, nested classes, and complexity-penalized model selection—to choose between templates. For instance, one can organize templates {𝒯_ℓ}_ℓ = 1^L of increasing expressiveness (and plausibly increasing K_ℓ), and allocate exploration to identify the best level ℓ given the horizon T. Even rough upper bounds on K_ℓ can serve as penalties that prevent overfitting through excessive regime complexity.

9.5 Practical guidance: how to keep K under control

We distill several design principles that follow from the theory.

First, prefer low-dimensional parameterizations unless there is a clear incentive reason to expand d. Increasing d improves expressiveness, but it also increases the space in which hyperplane arrangements can create many regions. In settings where online adaptation is critical, a small number of interpretable parameters (a scalar intensity, a few tier thresholds, or a small vector of weights on stable metrics) often provides a better learnability–performance tradeoff.

Second, avoid templates that create many near-ties among actions. When the agent’s utility lines are close, small changes in x can trigger frequent action switches, effectively increasing K in the portion of 𝒳 the learner explores. Practically, this motivates “smoother” contract families—e.g., restricting weight magnitudes, limiting the number of outcome bins that receive distinct payments, or imposing monotonicity/shape constraints on t(⋅)—not because smoothness is intrinsically good, but because it reduces the proliferation of best-response boundaries.

Third, design around robust regimes, not razor-thin optima. Even if worst-case theory allows kinks anywhere, many applications admit margin-like regularities: the optimal contract may lie well inside a region where the best response is stable. Template choices that enlarge such stable interiors (for example, by compressing irrelevant telemetry dimensions) can dramatically reduce the effective difficulty, even if worst-case K remains large.

Finally, treat template choice as a joint economic–statistical decision. If a rich template is necessary to implement a materially better action (higher R_i at acceptable payments), it may be worth paying the learning cost. But when incremental expressiveness merely reshuffles behavior among many similar actions, it is likely to raise K without raising value, and should be avoided.

9.6 Limitations and transition to open problems

We emphasize that K is typically unknown ex ante because it depends on the agent’s latent (q_i, c_i). Template design therefore cannot literally “minimize K” in a vacuum; instead, it should aim to control worst-case or plausible-case regime complexity using structural restrictions and domain knowledge. Moreover, our analysis is tailored to single-agent, single-contract interactions with hidden actions and bandit feedback. Once we allow menus of contracts, audits, multiple agents, or richer feedback channels, the mapping from template design to learnability can change qualitatively.

These considerations point directly to the broader agenda: understanding which modeling enrichments genuinely mitigate the hidden-action feedback bottleneck, and which simply move complexity elsewhere. We turn to these questions next.

10. Discussion and open problems: hidden-action feedback limitations; extensions to menus, audits, multiple agents/principals; empirical validation plan.

10. Discussion and open problems: beyond hidden-action bandit learning

Our analysis highlights a stark but useful message: when the principal observes only realized outcomes (and hence only realized payoffs), the statistical difficulty of contract learning is driven by behavioral discontinuities—the kinks induced by the agent’s best response—rather than by the raw size of the outcome space. This section discusses what we see as the main limitations of the hidden-action feedback model, and outlines several extensions that are both economically natural and technically open.

10.1 The hidden-action feedback bottleneck: what is truly hard?

The core friction is not that the principal faces noise per se, but that the principal faces endogenous noise generated by unobserved action switching. Even if outcomes are perfectly observed, the mapping x ↦ j is mediated by i^*(x), which can change abruptly. This places our setting closer to partial-monitoring and piecewise-structured bandits than to smooth stochastic optimization.

A useful way to phrase the limitation is: the principal does not get local comparative statics “for free.” In a smooth model, trying x and x^′ gives information that can be interpolated across the neighborhood. Here, a small perturbation can move us to a different best-response region, so local interpolation is hazardous exactly where the optimum tends to sit (near binding incentive constraints). This is why the worst-case regret remains on the order of T^2/3: the principal must spend nontrivial exploration budget locating the relevant region before it can even exploit the affine structure inside it.

An open theoretical direction is to identify economically interpretable conditions under which the bottleneck disappears. Two candidates are particularly natural.

Margin (robustness) conditions. Suppose there exists an optimal contract x^⋆ such that the agent’s best response is separated by a uniform gap:
(q_i^⋆ ⋅ t(x^⋆) − c_i^⋆) − max_{i ≠ i^⋆}(q_i ⋅ t(x^⋆) − c_i) ≥ γ.
Intuitively, if γ is large, then nearby experimentation does not change behavior, and the environment becomes locally “smooth.” Can one convert such a margin into fast rates (e.g. $\tilde O(\sqrt{T})$) while still allowing worst-case kinks elsewhere?
Smoothed (perturbed) instances. In many practical settings, costs and outcome distributions are not adversarially aligned to create knife-edge ties across a large fraction of 𝒳. A smoothed analysis in which (q_i, c_i) are perturbed may imply that “effective” K is much smaller than worst-case arrangement bounds, yielding substantially better typical performance. Formalizing this would help reconcile pessimistic minimax rates with the empirical success of simple scoring rules.

10.2 Menus of contracts: does offering choice help or hurt learnability?

Many real principals do not post a single contract but a menu: a set {x⁽¹⁾, …, x^(M)} ⊂ 𝒳, after which the agent selects a contract and then an action. Menus can strictly improve implementability: they allow screening across heterogeneous agents, and even for a single agent they can separate incentive constraints that collide under a single contract.

From a learning perspective, however, menus create a new complexity object. The agent’s decision rule becomes
(ℓ^*, i^*) ∈ arg max_{ℓ ∈ [M], i ∈ 𝒜} (q_i ⋅ t(x^(ℓ)) − c_i),
so the induced payoff is piecewise-affine over 𝒳^M, but with a combinatorial explosion in potential regions (now indexed by contract–action pairs). Whether menus reduce or increase the relevant complexity is ambiguous: a menu could collapse many actions into a few stable regimes (helping learning), or it could introduce new boundaries due purely to contract choice (hurting learning).

We view the main open problem as defining and bounding an appropriate “menu implementability complexity” K_M, and relating it to achievable regret. A promising direction is to treat each menu element as creating a “candidate regime,” and to analyze when increasing M reduces the number of regimes that compete near the optimum. Conceptually, menus may act like a discretization of 𝒳 chosen endogenously by the principal, suggesting a connection to adaptive discretization and hierarchical bandits.

10.3 Audits and action observability: valuing information explicitly

Organizations often possess costly mechanisms—audits, spot checks, manual reviews, or instrumented test tasks—that partially reveal the agent’s action or intermediate signals. These mechanisms are central in practice precisely because pure outcome-based incentives are manipulable.

A natural extension is an audit-augmented model: in each round, the principal chooses x_t and also whether to audit, paying cost κ, and receiving additional feedback (e.g. the realized action i_t, or a noisy proxy). This creates an explicit economic tradeoff between incentive payments and information acquisition.

Technically, this resembles bandits with side observations or active learning with query costs. The central question becomes: how many audits are needed to reduce regret from T^2/3 to $\tilde O(\sqrt{T})$, and how should audits be scheduled? One conjecture is that audits are most valuable near suspected boundaries between best-response regions, where payoff observations are least informative. This suggests algorithms that maintain uncertainty sets over regions and allocate audits to resolve ambiguity only when it affects the argmax of U(⋅). A clean theory would characterize regret as a function of an audit budget B (or total audit cost Bκ), interpolating between pure bandit learning and full-information region identification.

10.4 Multiple agents, heterogeneity, and repeated matching

Our baseline model effectively treats the agent as fixed over time. In many platforms, the principal interacts with a population of agents, with types (q, c) drawn repeatedly. This changes the object being learned: rather than a single best-response partition, the principal faces an aggregation over types, leading to an expected utility
Ū(x) = 𝔼_(q, c)[R_{i^*(x; q, c)} − q_{i^*(x; q, c)} ⋅ t(x)].
Even if each individual best response is discontinuous, the population average may be smoother, potentially allowing faster optimization. At the same time, heterogeneity introduces new concerns: a contract that performs well on average may have undesirable distributional effects (e.g. pushing some agents into inefficient actions), and fairness or participation constraints may bind.

Open problems here include (i) deriving regret bounds for learning under i.i.d. or drifting populations, (ii) understanding whether the relevant complexity is the expected number of active regions under the type distribution rather than worst-case K, and (iii) incorporating participation and retention dynamics, where today’s contract changes tomorrow’s pool of agents.

10.5 Multiple principals and strategic selection: learning in markets

In many applications (gig platforms, ad exchanges, procurement), principals compete. Agents can choose among principals, making the effective environment endogenous: a principal’s contract affects not only effort but also selection. This raises questions that are simultaneously economic (equilibrium and welfare) and statistical (learning under strategic feedback).

A basic model would let principals post x_t^p and agents select a principal p and an action i. Then each principal observes outcomes only for participating agents, yielding censored, selection-biased feedback. The open challenge is to characterize when no-regret dynamics converge to meaningful equilibria, and how implementability complexity interacts with market thickness and competition. In practice, these settings are where template simplicity is often most valuable—precisely because competitive pressure limits the time available for exploration.

10.6 Empirical validation: measuring K and testing the theory

Finally, we see empirical validation as essential, because the central complexity parameter K is defined by an unobserved best-response partition. We propose a three-part empirical program.

Synthetic ground-truth experiments. Construct environments with known (q_i, c_i), vary the template dimension d and the induced number of regions K, and compare learning curves across algorithms. Because K can be computed exactly in small instances (via arrangement enumeration or Monte Carlo boundary tracing), these experiments can directly test the predicted scaling in K and T, and can evaluate whether “effective K” is smaller under random perturbations (smoothed instances).
Semi-synthetic replay on observational data. In platform settings with logged outcomes and rewards, we can define a family of affine templates t(x) and use off-policy evaluation tools to estimate U(x) for a subset of x’s, then simulate online learning via replay. While this does not solve the hidden-action identification problem (the true best response remains latent), it allows us to test whether piecewise-affine approximations and region-based exploration capture real variation in payoffs.
Field tests with richer feedback variants. Where feasible, implement controlled “many-testcase” evaluations (or audit-like checks) for a small fraction of rounds to approximate unbiased estimates of U(x). The theory predicts a sharp improvement when such feedback is available. A practical design is to run a hybrid policy: mostly standard online learning from realized payoffs, with occasional intensive evaluation at candidate x’s to accelerate region elimination.

Across all three, we would treat K not as a fixed known constant but as a latent feature to be estimated or upper bounded from observed switching behavior and sensitivity to perturbations in x. This would also inform a practical template-selection pipeline: start with simple templates, expand expressiveness only when evidence suggests that the induced payoff surface is stable enough to justify richer structure.

In sum, the main conceptual limitation of hidden-action learning is the discontinuity induced by incentives. The main opportunity is that contract design itself can create or remove discontinuities—through menus, audits, population structure, and market institutions. Understanding these interactions is, in our view, the most promising route toward a contract theory that is simultaneously implementable, learnable, and empirically testable.