In 2026, it has become increasingly clear that the ambient outcome dimension m—the number of observable telemetry bins, evaluation categories, or performance states—is usually the wrong complexity parameter for repeated contracting problems. Modern principals routinely observe and log thousands (or millions) of outcome features: multi-metric scorecards for AI systems, granular event streams in gig platforms, compliance signals in regulated lending, and fine-grained quality measures in procurement. Yet the contract itself is almost never allowed to depend freely on all of these coordinates. For reasons of auditability, legal compliance, operational simplicity, and governance, real-world payment rules are typically constrained to a low-dimensional template: a handful of weights on a handful of metrics, perhaps with caps, floors, or simple affine transformations. The core methodological point of this paper is that learnability should be indexed by the effective complexity of the incentive response induced by the template, not by the raw number of outcome bins recorded by the principal.
To see why m is a misleading yardstick, it helps to separate “measurement richness” from “contractual degrees of freedom.” A principal may observe a very high-dimensional outcome j (or a discretization thereof) and may even have a reward model that maps that outcome to value. But if the principal is only permitted to choose a contract from a restricted family, then the agent’s incentives depend on that family through a small set of parameters. In other words, there is an asymmetry: the outcome space may be huge, but the policy lever is small. Treating m as the complexity parameter implicitly assumes that the principal can set an arbitrary payment tj for each outcome j, i.e., can choose from the full simplex of outcome-contingent transfers. That assumption is both unrealistic and analytically pessimistic: it pushes us into a regime where learning is dominated by a curse of dimensionality that has little to do with the actual institutional constraints faced by principals.
This pessimism is not merely aesthetic; it is formalized by negative learnability results for general contracts. When the principal can post an arbitrary outcome-contingent payment vector, the learning problem resembles nonparametric optimization of an unknown function observed through extremely sparse feedback: each round reveals only one realized outcome and one realized payoff. With a hidden action and an unobserved mapping from actions to outcome distributions, the principal faces an identification bottleneck. Intuitively, changing one coordinate of the payment vector affects incentives only through the probability that the induced action places on that outcome—probabilities which themselves are unknown and entangled with the agent’s best-response behavior. In the worst case, these entanglements allow many distinct actions to be separated only by tiny differences in high-dimensional outcome probabilities, so that distinguishing “good” from “bad” contracts requires exploration that scales with m (or worse). Put differently, if we allow contracts to vary freely across m outcomes, then we must pay an information-theoretic price for learning a high-dimensional object under bandit feedback. The resulting regret rates and sample complexities scale poorly and, more importantly, predict failure precisely in settings where practice has nonetheless converged on workable low-dimensional contracting schemes.
Our starting point is therefore a reframing: rather than parameterizing difficulty by the number of outcomes, we parameterize it by the number of distinct implementable incentive regimes that can arise within the chosen template. The key object is the principal’s induced expected utility surface over the parameter space. Even when outcomes are high-dimensional and the agent’s action set is large, the principal does not need to learn the full mapping from actions to outcome distributions. To optimize within a restricted family, the principal only needs to learn enough to compare the finitely many payoff “pieces” that arise as the agent switches best responses. This perspective leads us to what we call implementability complexity, denoted K: the number of regions in parameter space on which the agent’s best response is constant (equivalently, the number of affine pieces of the induced utility function). In our view, K is the correct analogue of the number of “effective arms” in the problem: it counts how many qualitatively different behaviors the template can elicit, not how many outcomes the principal can record.
This lens also helps reconcile two facts that otherwise appear in tension. First, outcome spaces in modern applications are exploding, and richer telemetry is generally viewed as beneficial for monitoring and evaluation. Second, contractual governance is becoming more constrained, not less: institutions increasingly require that payment rules be interpretable, stable, and auditable. The implementability complexity framework says that these trends are not contradictory. Rich telemetry can improve the reward model (how the principal values outcomes) and can support ex post auditing, while the contract can remain low-dimensional. Learnability then depends on how many distinct best-response regimes that low-dimensional contract can induce. In favorable cases, K can remain modest even as m grows, so the principal’s learning problem remains tractable.
We build intuition for K through two complementary channels. The first is geometric. Under an affine contract template, each action’s expected utility to the agent is affine in the contract parameters, and the agent’s best response is the upper envelope of these affine functions. The parameter space is therefore partitioned by the hyperplanes where the agent is indifferent between two actions. Each cell of this partition corresponds to a region where a particular action is uniquely optimal (given tie-breaking), and within that region the principal’s expected utility is itself affine. Thus the induced objective is piecewise-affine, with kinks only at action-switching boundaries. The number of cells in this arrangement—our K—captures how many times the identity of the induced action can change as the principal varies the contract parameters. This is a much finer measure than the cardinality of the action set: many nominal actions may never be optimal under the template, and many may be indistinguishable because they generate the same projected statistics relevant for incentive comparisons.
The second channel is institutional. In practice, contract templates are chosen precisely to limit the number of distinct incentive regimes the agent can select among. Consider multi-metric AI evaluation, where a deployment decision may track hundreds of safety and performance metrics, but a compensation rule might be a weighted combination of a small subset (e.g., accuracy, latency, and a safety score), possibly with thresholds. Or consider pay-for-performance platforms, where a worker’s compensation depends on a few standardized performance indicators rather than the full telemetry of their work process. In these settings, it is not m that governs the number of distinct behaviors the agent can profitably adopt; it is the expressivity of the payment template and the heterogeneity of actions as seen through that template. Our measure K is designed to quantify exactly this expressivity/heterogeneity interaction.
Adopting K as the complexity parameter has three implications that guide the rest of the paper. First, it suggests a positive learnability story that is independent of the outcome dimension: if the induced utility surface has only K affine pieces and bounded slopes, then one can design exploration strategies whose sample complexity scales with K (and the parameter dimension d), not with m. This aligns with the empirical observation that principals can often learn effective incentive weights despite enormous telemetry spaces. Second, it clarifies why unrestricted contracts are hard: allowing arbitrary outcome-contingent transfers makes K potentially enormous, because the principal can create many finely separated incentive regimes, and learning must then pay for that expressivity. Third, it highlights a trade-off that is central for policy and design: restricting contract form can improve learnability and stability (lower K), but may reduce attainable performance by ruling out high-powered or finely targeted incentives. Our model is meant to illuminate this trade-off rather than to declare a universally optimal level of contract complexity.
We also acknowledge limitations up front. Our focus on implementability complexity abstracts from several issues that matter in applications: multi-agent interactions, dynamic moral hazard, and endogenous participation constraints, among others. Moreover, K is a property of the induced best-response geometry, which depends on how agents map contracts to behavior; in practice, that mapping may be nonstationary, strategically manipulated, or influenced by unmodeled constraints. Nonetheless, we view K as a useful organizing concept precisely because it remains meaningful across many such extensions: whenever the agent’s response partitions the principal’s feasible set into a finite number of regimes, the number and geometry of those regimes should govern learning difficulty.
The remainder of the paper formalizes this perspective. We begin by laying out the repeated contracting model with affine templates and the feedback structures we consider, including the baseline outcome-only payoff feedback and a stronger evaluation-feedback variant motivated by batch testing in AI deployments. We then develop the geometric characterization of best-response regions and introduce K as the central complexity proxy, before turning to algorithms and regret bounds that scale with K rather than m. This sequence is deliberate: we want the reader to see, early and clearly, why “more telemetry” does not automatically make contracting harder—and why the right complexity notion is the number of implementable incentive regimes, not the number of logged outcomes.
We study a principal who repeatedly posts a contract from a restricted (and known) affine template, facing an agent who privately chooses among finitely many costly actions. The purpose of the model is to isolate what the principal can realistically control—low-dimensional contract parameters—while allowing the outcome space to be arbitrarily rich.
Outcomes, rewards, and contracts. In each round an observable outcome j ∈ [m] := {1, …, m} is realized. We interpret j broadly: it can be a discrete performance category, a telemetry bin obtained by discretizing a high-dimensional event stream, or an index into a scorecard of evaluation results. The principal derives a known reward rj ∈ [0, 1] from outcome j. (In applications, this corresponds to a valuation model the principal has committed to ex ante—e.g., a weighted safety/performance score for an AI system, or an internal quality score in procurement.)
Payments are restricted to an affine template parameterized by a
vector x ∈ 𝒳 ⊂ ℝd,
where 𝒳 is compact and convex. The
induced outcome-contingent transfer is
t(x) = Ax + b ∈ ℝ+m,
where A ∈ ℝm × d
and b ∈ ℝm
are known to the principal. We impose boundedness 0 ≤ tj(x) ≤ 1
for all j and x ∈ 𝒳, which serves two roles: it
captures institutional constraints (budgets, caps, and legal limits on
“high-powered” incentives), and it ensures payoffs are uniformly bounded
for learning.
Hidden actions and outcome distributions. The agent has a finite action set 𝒜 (potentially large). Each action i ∈ 𝒜 is associated with (i) an unknown cost ci ∈ [0, 1] and (ii) an unknown outcome distribution qi ∈ Δ([m]). We write qij for the probability of outcome j under action i. The principal does not observe the chosen action and does not know {qi, ci}i ∈ 𝒜; this is the informational friction that makes contracting nontrivial.
Given a posted parameter x,
action i induces an expected
payment
$$
T_i(x) \;:=\; \sum_{j=1}^m q_{ij}\, t_j(x) \;=\; q_i \cdot t(x),
$$
and an agent expected utility
UA(i ∣ x) := Ti(x) − ci = qi ⋅ (Ax + b) − ci.
The agent best-responds:
i*(x) ∈ arg maxi ∈ 𝒜{UA(i ∣ x)},
with ties broken by a fixed rule (so i*(x) is
single-valued). This tie-breaking is innocuous economically—any
selection can be justified by perturbations or conventions—but it is
convenient analytically because it makes the induced objective
well-defined everywhere.
Timing and feedback. For rounds t = 1, …, T:
1. The principal chooses xt ∈ 𝒳 and
thereby posts payments t(xt).
2. The agent chooses it = i*(xt).
3. An outcome jt ∼ qit
is realized and observed by the principal.
4. The principal’s realized payoff is
ut = rjt − tjt(xt) ∈ [−1, 1].
Because r and the template are
known, observing jt is equivalent
to observing ut; however, it
is useful conceptually to remember that the data are “one realized
outcome per round,” not a full draw from the distribution qit.
We emphasize that the principal never sees it, so even
repeated observations at the same x are confounded by the fact that
the agent’s action is an endogenous function of the contract.
The principal’s expected utility from posting x is
$$
U(x) \;:=\; \mathbb{E}[u_t \mid x_t=x] \;=\; R_{i^*(x)} - T_{i^*(x)}(x),
\quad\text{where}\quad
R_i := \sum_{j=1}^m q_{ij} r_j.
$$
Equivalently, U(x) = qi*(x) ⋅ r − qi*(x) ⋅ t(x).
The learning problem is to choose {xt} to compete
with the best fixed parameter in hindsight, despite observing only the
induced outcomes and payoffs.
Regret objectives. In the stationary case (fixed
{qi, ci}),
we evaluate an online algorithm by static regret relative to the best
single contract parameter:
$$
\mathrm{Reg}(T) \;=\; \max_{x \in \mathcal{X}} \sum_{t=1}^T \Big(U(x) -
\mathbb{E}[U(x_t)]\Big).
$$
This notion formalizes a principal who wants to learn a stable contract
that performs well over the whole horizon.
We also consider nonstationarity, motivated by settings where the
mapping from contracts to behavior changes over time: agents adapt,
technologies shift, platforms change monitoring, or the population
composition drifts. Formally, we allow the induced expected utility to
change with t, writing Ut(x)
for the principal’s expected utility in round t if x were posted. We then measure
performance by dynamic regret,
$$
\mathrm{DReg}(T) \;=\; \sum_{t=1}^T \Big(\max_{x\in\mathcal{X}} U_t(x) -
\mathbb{E}[U_t(x_t)]\Big),
$$
and control the extent of nonstationarity via a variation budget $V_T = \sum_{t=2}^T \sup_x
|U_t(x)-U_{t-1}(x)|$. This is deliberately agnostic about
why drift occurs; it captures the operational reality that
“what works” may gradually change even when the contract template itself
is fixed.
Two feedback regimes: outcome-only (bandit) versus evaluative payoff queries. The baseline regime is what many contracting environments naturally provide: each round yields a single realized outcome jt (hence a single payoff ut). This is bandit feedback with endogenous confounding through i*(xt). In particular, the principal cannot directly estimate qi, cannot directly compare actions, and cannot decompose realized payoffs into “reward” and “payment effects” beyond the identity ut = rjt − tjt(xt). Economically, this corresponds to paying and observing performance in the field: each period produces one draw from whatever behavior the agent found optimal.
In some applications, however, the principal can obtain stronger information about U(x) by running repeated evaluations at a chosen parameter x. For example, before deploying an AI system with a given incentive scheme, a firm may run a large offline test suite; a platform may pilot a compensation rule on a randomized cohort; a regulator may require stress tests under standardized scenarios. In these cases, the principal can (at some cost) obtain an approximately unbiased estimate of U(x) with controllable variance—operationally, by averaging many i.i.d. instances of the outcome/payoff process under the same x. We treat this as an “evaluation-feedback” variant: the principal can spend additional measurement effort to reduce noise in utility estimation, and the learning guarantees can improve accordingly. Importantly, even this stronger feedback does not reveal the hidden action; it simply provides cleaner access to the induced expected payoff surface.
Examples and interpretation. In multi-metric AI evaluation, the “outcome” j can encode a discretized vector of audit results (e.g., pass/fail across a battery of safety checks, latency buckets, calibration categories). The reward rj captures the principal’s downstream value from deploying a model with that evaluation profile, while the payment tj(x) could be a bonus schedule for a model developer or an internal team, constrained to be an affine function of a small set of headline metrics (dimension d). The agent’s actions correspond to development choices—training recipes, data filtering, alignment interventions—each with unknown cost and unknown distribution over evaluation outcomes. Repeated contracting arises when the principal iteratively updates incentive weights as more deployments and evaluations occur.
In pay-for-performance platforms, outcomes j can represent realized service quality categories derived from telemetry (on-time rates, complaint flags, customer ratings, compliance checks). The template x represents the platform’s adjustable levers (a few bonus multipliers, thresholds, or linear weights), while the agent’s actions capture effort and strategic responses (e.g., route choice, time allocation, selective acceptance), which shift the distribution of observable outcomes in ways that are not known ex ante. Again, the platform observes outcomes but not the worker’s internal action, and learning is about tuning a low-dimensional payment rule to induce high-value behavior.
Modeling choices and limitations. We deliberately take 𝒜 to be finite and outcomes to be discretized into m bins. This is not because we believe behavior or measurement is truly discrete, but because the central difficulty here is endogenous response under a restricted payment template, not statistical estimation of a smooth response function. Discretization lets us represent rich telemetry while keeping payoffs bounded and the incentive problem well-defined. Similarly, we assume the principal’s reward vector r is known and stable; in many domains r is itself estimated and contested. Our results should therefore be read as describing the learning problem conditional on a chosen evaluation/valuation scheme, rather than solving the full problem of learning values and incentives simultaneously.
With these primitives in place, the next step is to describe how the affine template and best-response behavior impose geometry on the principal’s induced objective U(⋅). That geometry—rather than the raw outcome dimension—will determine what can be learned efficiently from the feedback described above.
The key structural object that makes the learning problem tractable is the implementability partition: the decomposition of the parameter space 𝒳 into regions on which the agent’s best response is constant. Economically, these regions are exactly the sets of contract parameters that satisfy a fixed collection of incentive constraints; geometrically, under an affine template they form a polyhedral subdivision of 𝒳. Our learning rates will scale with the complexity of this subdivision rather than with the raw outcome dimension m or the size of the action set |𝒜|.
Fix two actions i, i′ ∈ 𝒜. Under
parameter x, the agent
compares
UA(i ∣ x) − UA(i′ ∣ x) = (qi − qi′) ⋅ t(x) − (ci − ci′).
With an affine contract t(x) = Ax + b,
this difference is affine in x:
$$
U_A(i\mid x)-U_A(i'\mid x)
= \underbrace{\big(A^\top(q_i-q_{i'})\big)}_{\in\mathbb{R}^d}\cdot x
+ \underbrace{(q_i-q_{i'})\cdot b - (c_i-c_{i'})}_{\in\mathbb{R}} .
$$
Thus, the indifference locus between i and i′ is a hyperplane (when
the coefficient vector is nonzero), and strict preference for i over i′ is a halfspace. This
is the multivariate analogue of the familiar “upper-envelope of lines”
picture for one-dimensional linear contracts: rather than slopes and
intercepts in ℝ, we have normal vectors
and offsets in ℝd.
A convenient way to package what matters for incentives is to define,
for each action i,
wi := A⊤qi ∈ ℝd, αi := qi ⋅ b − ci ∈ ℝ,
so that
UA(i ∣ x) = wi ⋅ x + αi.
From the perspective of contract design within the template,
the potentially high-dimensional outcome distribution qi influences
incentives only through the projected statistic wi = A⊤qi
and the scalar αi. In
particular, two actions with different underlying qi’s can be
incentive-equivalent if they share the same (wi, αi),
meaning the agent is indifferent between them for all x. This observation already hints at
why m should not appear in the
learning rates: the template collapses the outcome space into a d-dimensional set of
incentive-relevant directions.
For any action i, define
its best-response region (or implementability set)
𝒳i := {x ∈ 𝒳: UA(i ∣ x) ≥ UA(i′ ∣ x) ∀i′ ∈ 𝒜}.
Substituting the affine representation, we obtain
𝒳i = 𝒳 ∩ ⋂i′ ∈ 𝒜{x: (wi − wi′) ⋅ x ≥ αi′ − αi}.
That is, 𝒳i is the
intersection of 𝒳 with finitely many
halfspaces, hence a (possibly empty) convex polyhedron. The collection
{𝒳i}i ∈ 𝒜
covers 𝒳, and with a fixed tie-breaking
rule we may view it as a genuine partition up to shared boundaries: each
x is assigned a unique best
response i*(x), and the
set of points inducing a given best response is polyhedral.
This partition is the formal version of an implementability story: a contract parameter x “implements” action i precisely when all incentive constraints UA(i ∣ x) ≥ UA(i′ ∣ x) hold. Under affine templates these constraints are linear, so implementability is described by linear inequalities. This is not merely a mathematical convenience; it captures an important practical rigidity. When organizations restrict attention to a small family of scoring rules, bonus schemes, or linear-weight contracts, they implicitly restrict the set of incentive constraints they can move. The result is that action choice changes only when the parameter crosses a boundary hyperplane where some alternative becomes equally attractive.
We measure the effective “behavioral richness” of a template by
counting how many distinct best-response regions it induces on 𝒳. Let
K := |{ i*(x) : x ∈ 𝒳 }|
be the number of actions that are optimal somewhere in 𝒳 (equivalently, the number of nonempty cells
𝒳i). One can also
define K as the number of
cells in the polyhedral subdivision after accounting for tie-breaking on
boundaries; for our purposes these notions coincide up to irrelevant
measure-zero issues.
Two features of K are worth emphasizing.
K can be far smaller than |𝒜|. Even if the agent has a huge menu of actions, only a limited number may be implementable within the template. Indeed, if many actions share the same projected vector wi = A⊤qi (and comparable intercepts), the template cannot separate them. In contracting terms: a low-dimensional scorecard cannot finely discriminate among high-dimensional behaviors.
K is the right notion of complexity for learning. The principal’s induced objective U(⋅) is nonsmooth only because the agent switches actions; every kink corresponds to crossing a boundary between best-response regions. Thus, the number of such regions governs how hard it is to locate the global maximizer from noisy payoff feedback. In our regret bounds, K plays the role that “number of arms” plays in bandits, but in a geometrically structured way: regions are not independent arms, they are adjacent polyhedra carved out by hyperplanes.
A useful auxiliary quantity is the number M of distinct projected vectors wi = A⊤qi. Since comparisons depend on differences wi − wi′, the arrangement complexity is controlled by how many distinct normals appear. Standard hyperplane arrangement bounds imply that, in d dimensions, the number of cells formed by M affine functions is at most on the order of Md (up to constants and lower-order terms). This provides an interpretable proxy: increasing template dimension d or increasing the variety of incentive-relevant outcome profiles (captured by M) can explode the number of behavioral regimes the principal may face.
Once an action i is fixed,
the principal’s expected utility is affine:
U(x) = Ri − qi ⋅ (Ax + b) = Ri − (A⊤qi) ⋅ x − qi ⋅ b = βi − wi ⋅ x,
where βi := Ri − qi ⋅ b
is constant in x. Hence on any
region where i*(x) = i,
the gradient exists and equals
∇U(x) = −wi = −A⊤qi.
We encode uniform control of these slopes by assuming a known
bound
∥∇U(x)∥ ≤ L whenever
the gradient is defined,
equivalently maxi∥A⊤qi∥ ≤ L
in the relevant norm. This bound is natural in two ways. First, it is
implied by bounded payments together with bounded template coefficients
(e.g., via ∥A⊤qi∥ ≤ ∥A⊤∥⋅∥qi∥
and ∥qi∥ ≤ 1 under
appropriate norms). Second, it has a clear economic meaning: within a
fixed behavioral regime, how quickly does the principal’s expected
payoff change as we tweak contract parameters? Steeper slopes mean small
parameter mistakes translate into large utility losses, which forces
more precise learning near optima and near region boundaries.
One could also summarize the difficulty of learning U(⋅) using more classical statistical capacity notions, such as the pseudo-dimension of the function class. Recall that the pseudo-dimension Pdim(ℱ) of a real-valued class ℱ is the largest n such that there exist points x1, …, xn and thresholds θ1, …, θn for which the sign patterns of {f(xs) − θs}s = 1n can realize all 2n labelings as f ranges over ℱ. For classes defined by maxima of affine functions, pseudo-dimension is typically on the order of dlog (number of pieces). In our setting, if we consider the induced utilities consistent with some unknown {(wi, αi, βi)} but with at most K regions and bounded slopes, one expects Pdim scaling like O(dlog K) (up to constants and technical conditions).
We nevertheless prefer K (and L) as our primary structural parameters for two reasons. First, K is directly tied to the economics of implementability: it counts how many qualitatively distinct agent behaviors the template can induce. Second, the geometry induced by the implementability partition enables algorithms that exploit adjacency and piecewise affinity, yielding regret bounds that depend polynomially on K and only mildly on d (typically through logarithms or fractional powers), without ever estimating the full m-dimensional outcome distributions.
This sets up the next step. To build intuition for how “kinks” and “regions” drive learning, we begin with the warm-up case d = 1, where the implementability partition becomes intervals, U(⋅) becomes piecewise-linear with K breakpoints, and the upper-envelope geometry can be drawn—and learned—from data.
To see most transparently how the implementability partition drives
learnability, it is useful to collapse the template to a single
parameter. Think of a one-dimensional “intensity knob” (a commission
rate, a single score weight, a bonus multiplier): 𝒳 = [0, 1] ⊂ ℝ and x ∈ [0, 1] is scalar. In this case
each projected statistic wi = A⊤qi
is a real number, and the agent’s expected utility for action i becomes a line
UA(i ∣ x) = wix + αi.
The agent best response i*(x) is
therefore determined by the upper envelope of lines
maxi{wix + αi}.
Two classical geometric facts (familiar from upper-envelope arguments in
screening and mechanism design) become immediate:
Best-response regions are intervals. With a fixed tie-breaking rule, the set of x inducing a given action is an interval (possibly empty). Intuitively, if action i beats i′ at two points on the line, it beats it everywhere between those points because the difference is linear.
There are finitely many switching points. The map x ↦ i*(x) can change only at intersection points of pairs of lines. Thus the implementability partition of [0, 1] is an ordered sequence of intervals separated by at most K − 1 breakpoints, where K counts how many actions appear on the envelope over [0, 1].
This is the one-dimensional analogue of the polyhedral partition described earlier: hyperplanes become points, and cells become intervals. Economically, we can interpret each breakpoint as a threshold at which a marginal change in the contract parameter flips which incentive constraint binds.
On any interval I where the
best response is constant, say i*(x) = i
for all x ∈ I, the
principal’s induced utility is also linear:
U(x) = Ri − qi ⋅ t(x) = βi − wix, with
|wi| ≤ L.
So in d = 1, U is a
piecewise-linear function with at most K linear pieces and slope bounded by
L, with kinks only at the
(unknown) best-response breakpoints.
A small but important simplification occurs in one dimension: on any interval where U is linear, its maximum is achieved at an endpoint of that interval (unless the slope is exactly zero, in which case every point is optimal). Consequently, if we knew the breakpoints, we could restrict attention to a finite candidate set: the domain endpoints {0, 1} together with the K − 1 internal breakpoints. The learning difficulty is precisely that we do not know these switching points ex ante and observe only noisy payoffs ut at the parameters we choose.
This perspective already clarifies why K is the right complexity parameter. A generic Lipschitz function on [0, 1] can hide its maximizer anywhere, forcing exploration on an ε-grid. By contrast, a K-piecewise-linear function can “hide” its maximizer only in the vicinity of a small number of unknown kinks. Learning is therefore a problem of discovering and comparing a small number of regimes, rather than uniformly covering the continuum.
The most useful mental model for algorithms in this warm-up is an adaptive partition of [0, 1] into sub-intervals, combined with two statistical tasks:
This suggests a natural adaptive routine (we state it informally because in later sections we will use a higher-dimensional analogue):
Because there are only K − 1 true kinks, only O(K) intervals will ever need to be split “for structural reasons”; all other refinement is purely to sharpen estimates near the best candidate.
One way to summarize what this buys us is an offline
sample complexity statement: with i.i.d. payoff queries, one can return
x̂ such that
U(x̂) ≥ maxx ∈ [0, 1]U(x) − ε
using on the order of Õ(K ε−2)
payoff samples (with constants depending on the noise level and on L). The key point is conceptual: the
dependence is on the number of pieces K, not on the outcome dimension
m, and not on any fine
discretization of [0, 1].
To convert the warm-up intuition into an online regret rate under payoff feedback, we can apply the standard exploration–exploitation balancing that reappears in the general d-dimensional analysis.
This gives the familiar regret decomposition
$$
\mathrm{Reg}(T) \;\lesssim\; S(\varepsilon) + \varepsilon T
\;\approx\; \frac{K}{\varepsilon^2} + \varepsilon T .
$$
Optimizing over ε yields ε ≍ (K/T)1/3
and therefore
Reg(T) = Õ (L K1/3T2/3),
where L enters as the natural
scale factor because it converts parameter error into utility loss along
a linear piece.
This T2/3 behavior is exactly the hallmark of learning problems where one must both localize a small set of unknown structural features (here, breakpoints) and estimate payoffs with noise. In practice, it corresponds to the common situation where a decision-maker tunes a single contract knob over time: most of the horizon is spent deciding “which regime we are in,” not fine-tuning within a regime.
The same geometry also explains why we should not expect to do substantially better in the worst case. A lower-bound construction can embed a K-armed identification problem into [0, 1] by associating each “arm” with an interval and choosing piecewise-linear functions whose maxima occur near the boundaries between intervals. Informally, unless we sample near the right breakpoint, we cannot tell whether the best segment lies to the left or right, and the noise forces repeated sampling. This yields a minimax Ω(K1/3T2/3) barrier (up to logarithmic factors) for the class, even in one dimension.
It is also useful to place this warm-up next to the continuum-armed bandit literature. Because slopes are bounded, U is globally L-Lipschitz on [0, 1]. But the piecewise-linear structure is stronger than mere Lipschitz continuity: away from the K − 1 breakpoints, the function is perfectly linear, so all “curvature” is concentrated on a small set. This resembles the one-sided/weak Lipschitz conditions used to obtain dimension-dependent rates via zooming arguments: local complexity near the optimum, not global covering numbers, governs performance. Our setting differs only in that the locations of the kinks are themselves unknown and must be learned from bandit feedback, which is what restores the T2/3 exploration–exploitation tradeoff.
Finally, the upper-envelope picture provides a contracting interpretation for these statistical statements. When we restrict to a one-parameter template, the agent’s behavior changes only at a finite set of incentive thresholds; learning is hard precisely because the principal must discover which threshold is relevant. This is the core intuition we will carry into the general d-dimensional case, where “thresholds” become hyperplane boundaries and “interval refinement” becomes adaptive partitioning of 𝒳.
When we move from a single contract “knob” to a d-dimensional template x ∈ 𝒳 ⊂ ℝd, the geometry becomes richer but the economic source of structure is unchanged: the agent’s best response is still an upper envelope of affine utilities, and the principal’s induced payoff U(⋅) is therefore affine within regions and non-smooth only where the induced action switches. The implementability partition now consists of at most K polyhedral cells in 𝒳, separated by hyperplanes defined by pairwise incentive comparisons.
From a learning perspective, the central difficulty under payoff feedback is that we do not observe the agent’s action nor the region identity. We only observe noisy realizations ut ∈ [−1, 1] whose expectation equals U(xt). Thus the problem is a bandit optimization task over a function class that is globally Lipschitz (with constant L) but has a far stronger “low non-smoothness complexity” captured by K.
Fix any cell P of the
implementability partition and consider the set P ∩ 𝒳, which remains convex. On that
set, U(x) has the
affine form
U(x) = γ − g ⋅ x, ∥g∥ ≤ L,
for some (γ, g)
determined by the induced action. Maximizing an affine function over a
convex set is achieved at an extreme point; hence, if we knew
the cell, the principal could restrict attention to finitely
many candidate points (the vertices of P ∩ 𝒳, or more generally extreme
points). In economic terms: once the relevant incentive constraint is
known, optimizing the contract within that regime is “linear
programming,” not a nonparametric search.
Of course, we do not know the cells, and in higher dimensions we cannot hope to explicitly enumerate all vertices of all regions. The algorithmic goal is therefore more modest and more statistical: we adaptively refine 𝒳 into smaller sets until (i) each active set is likely contained in a single regime, and (ii) the remaining uncertainty in value is too small to justify further refinement.
The one-dimensional midpoint identity has a clean analogue in ℝd. Suppose we focus on a
small convex set C ⊂ 𝒳 (think
of a simplex or a box) and select d + 1 affinely independent “anchor”
points v0, …, vd ∈ C.
Any point y in the simplex
conv{v0, …, vd}
admits barycentric coordinates λ(y) ∈ Δd
such that $y=\sum_{k=0}^d \lambda_k(y)
v_k$. If U is affine on
this simplex, then it must satisfy the identity
$$
U(y) \;=\; \sum_{k=0}^d \lambda_k(y) \, U(v_k).
$$
This gives a concrete statistical “kink detector”: by sampling payoffs
at v0, …, vd
and at one or more interior test points y, we can check whether the observed
values are consistent with a single affine function up to the noise
tolerance. If not, then with high probability C intersects a region boundary (an
action switch), and we should split C into smaller sets.
Two aspects are worth emphasizing.
The test is local and does not require estimating qi or ci. We never attempt to infer the agent’s primitives; we only test whether the induced payoff surface behaves like one affine piece on a candidate neighborhood.
The test is scale-sensitive. If C is very small (diameter diam(C)), then even if C crosses a boundary, the maximum possible payoff difference across C is at most L diam(C). At sufficiently fine scales, it becomes statistically unnecessary to resolve whether C contains a kink, because the regret impact is negligible.
These two facts make it plausible that an adaptive refinement strategy can pay only for the “important” boundaries—those near the optimum and at scales where boundary uncertainty translates into material payoff loss.
The online procedure can be viewed as maintaining a growing tree of cells that cover 𝒳. Each cell C maintains:
A simple and robust form of UCB(C) is the maximum over anchors
of their upper confidence plus a Lipschitz slack for moving within the
cell:
UCB(C) = maxv ∈ Anchors(C)(Û(v) + rad(v)) + L diam(C).
This bound is intentionally conservative: it does not assume affinity
inside C. When the algorithm
has statistical evidence that U is affine on C, it can tighten UCB(C) using an estimated affine
model (fit from the anchor points) plus confidence; but the
Lipschitz-based bound suffices to explain the rate and is stable to
misspecification near boundaries.
The online decision rule is then: at each round t, choose a cell Ct with maximal UCB(C), and within that cell choose a point to sample that most efficiently reduces uncertainty—typically an anchor point whose confidence radius is largest, or a designated interior test point used for affine-consistency checks. If the consistency test fails at cell C, split it into smaller cells (e.g., bisect along a coordinate axis, or refine a simplex into subsimplices) and continue.
Economically, this “optimism + splitting” routine has a transparent interpretation: we keep experimenting with contract variants only in those neighborhoods that could still plausibly contain a better incentive regime, and we stop refining regions that cannot outperform the current best by more than a small margin given the slope bound.
The implementability partition has at most K regions; equivalently, there are at most K − 1 action-switching boundaries (in a suitable sense) that matter for the payoff function. An adaptive partition will repeatedly split cells that straddle these boundaries, but crucially:
This is where the d-dependence enters. In one dimension, boundaries are points and the count is literal. In ℝd, boundaries are (d − 1)-dimensional faces, and the number of small cells intersecting them grows with dimension. The analysis packages this growth into a factor on the order of dlog K (up to constants and polylogarithms), reflecting that a K-cell polyhedral complex cannot “shatter” space arbitrarily: the arrangement has limited combinatorial richness relative to K and d.
Putting these ingredients together yields the main performance
guarantee for the bandit-feedback regime. Under bounded noise (or
sub-Gaussian payoffs) and the assumptions that U is K-piecewise-affine with per-piece
gradient norm bounded by L,
there exists an online learning algorithm using only realized payoffs
ut such
that
Reg(T) = Õ (L (dlog K)1/3 K1/3 T2/3),
where Õ(⋅) hides
polylogarithmic factors in (T, K, d) (and
standard confidence parameters).
The economic content of this expression is worth separating from the algebra:
The same structural viewpoint also explains why the dependence on the outcome dimension m disappears: we never estimate qi or the full vector qi ⋅ t(x); the payoff observation is already a one-dimensional unbiased signal of U(x).
Finally, the rate is essentially unimprovable for this feedback model: even in d = 1, one can construct families of K-piecewise-affine payoff functions for which any algorithm must incur regret at least Ω(LK1/3T2/3) up to logarithmic terms. Since the lower bound holds in the simplest geometry, it carries over to higher dimensions.
Two caveats are important if we interpret this as guidance for contract design in practice.
Statistical rate versus computational tractability. The guarantee is information-theoretic: it shows what is learnable from payoff feedback given the structural restriction. Implementing the adaptive partition efficiently in high dimension may require additional assumptions (e.g., using special families of partitions, exploiting sparsity in A, or relying on oracles for optimization over 𝒳).
Why the T2/3 rate persists. With only one realized outcome per round, estimating U(x) is noisy, and boundary detection is statistically expensive. This is not a mere artifact of analysis; it is a fundamental limitation of the feedback channel. In the next section we explain how the picture changes when the principal can, at chosen times, obtain a much more accurate estimate of U(x) (for example, by evaluating the contract on a large test suite), effectively reducing the problem to stochastic optimization over a finite-complexity surface and improving the regret rate accordingly.
The T2/3 regret rate in the pure payoff-feedback model is not driven by geometry alone; it is driven by information scarcity. When we observe only a single realized outcome jt per posted contract, the principal’s signal about U(xt) is extremely noisy, and distinguishing “this neighborhood is a single affine regime” from “this neighborhood straddles an action switch” becomes statistically expensive.
In many practical settings, however, the principal can obtain a substantially cleaner estimate of the expected payoff of a candidate contract before committing to it at scale. Conceptually, this corresponds to an evaluation suite: after choosing a parameter x, we can run the same contract on M independent instances (or M i.i.d. agents, or M independent tasks presented to the same agent when incentive constraints are stationary) and observe the average realized payoff. This moves us from a one-sample bandit signal to a controlled-precision oracle for U(x).
Fix a query point x ∈ 𝒳.
Suppose we can obtain M i.i.d.
payoff samples
u(1)(x), …, u(M)(x), u(s)(x) = rj(s) − tj(s)(x),
where each j(s) is drawn
from the outcome distribution induced by the agent’s best response to
x. Then the empirical
mean
$$
\widehat U_M(x) \;:=\; \frac{1}{M}\sum_{s=1}^M u^{(s)}(x)
$$
satisfies 𝔼[ÛM(x)] = U(x),
and since u(s)(x) ∈ [−1, 1],
standard concentration implies sub-Gaussian noise with scale $\sigma \asymp 1/\sqrt{M}$:
Pr (|ÛM(x) − U(x)| ≥ ε) ≤ 2exp (−c Mε2).
Equivalently, one can phrase the model as: we have access to an oracle
that returns an unbiased estimate Û(x) with variance proxy
σ2, and achieving
accuracy ±ε costs on the order
of σ2/ε2
samples. This captures not only fixed-size suites M per round, but also an adaptive
regime where the principal chooses evaluation effort.
Economically, this stronger feedback corresponds to a setting where the principal can pilot a contract: A/B testing in a platform, offline policy evaluation on logged data (when assumptions justify it), or repeated deployment to a stable pool of tasks where the induced action distribution does not drift during the evaluation window.
Under our structural assumptions, U is piecewise-affine over 𝒳 with at most K regions. The key implication is
that the global maximizer of U
must lie in (at least) one of these regions, and within any fixed region
the function has the form
U(x) = γ − g ⋅ x with ∥g∥ ≤ L.
Thus the “hard part” is not estimating a complicated smooth surface; it
is identifying which affine piece is best (and where within that piece
the maximum lies over 𝒳). With only one
noisy payoff per round, we pay heavily to resolve both value
uncertainty and boundary uncertainty. With an evaluation
suite, we can reduce value uncertainty quickly, and boundary uncertainty
becomes manageable because we can afford sharper tests of whether two
nearby points belong to the same affine regime.
This is the sense in which stronger feedback turns our problem into something much closer to stochastic optimization over finitely many structured alternatives. At a high level, we can think of each implementability region as an “expert” that recommends the best contract in that region; learning then becomes selecting among ≤ K experts, rather than performing a fully nonparametric search over 𝒳.
There is an important subtlety: we do not know the regions ex ante, and we do not observe the agent’s action. So we cannot literally index experts by actions i, nor can we enumerate the K cells directly. The algorithmic idea is instead to discover candidate regions on the fly using the same geometric logic as in the payoff-feedback case (adaptive partitioning and affine-consistency checks), but to use the evaluation oracle to make these checks sharp and to enable aggressive elimination.
A representative scheme looks like this:
Maintain a partition of 𝒳 into active cells 𝒞 (e.g., a tree of simplices or boxes). For each cell C, we keep a small set of anchor points v0, …, vd ∈ C that are affinely independent.
Estimate U
accurately at anchors using evaluation suites, producing
confidence intervals
$$
U(v_k) \in \big[\widehat U(v_k) \pm \mathrm{rad}(v_k)\big],
\qquad \mathrm{rad}(v_k) \asymp
\sigma\sqrt{\frac{\log(TK/\delta)}{n_k}},
$$
where nk
is the number of suite-evaluations performed at vk.
Test affinity inside C: choose one or more interior
points y ∈ C with
known barycentric coordinates λ(y) relative to the
anchors, and check whether
$$
U(y) \approx \sum_{k=0}^d \lambda_k(y)\, U(v_k)
$$
within the combined confidence radii. If the test fails, we split C, because it likely intersects a
boundary between regions.
Compute an upper confidence bound for the best value in
C. When C passes the affinity test, we can
fit an affine model to the anchor estimates and (approximately) optimize
it over C ∩ 𝒳 to obtain a
candidate maximizer x̃(C) and an upper bound
UCB(C). When C does not pass (or has not yet been
tested), we can always fall back on a conservative bound such as
UCB(C) = maxv ∈ Anchors(C)(Û(v) + rad(v)) + L diam(C).
Eliminate suboptimal cells: if a cell’s upper bound falls below another cell’s lower bound by more than a tolerance, discard it from the active set. This is the successive-elimination step.
The economic interpretation mirrors a standard “pilot then scale” workflow: we evaluate a small menu of promising contract neighborhoods with enough statistical power to rule out those that cannot be optimal, and we progressively concentrate evaluation budget on the remaining plausible incentive regimes.
With sufficiently accurate per-query estimates, the exploration–exploitation tradeoff changes. Instead of spending many rounds to average out noise at each point, we can reduce noise quickly via suites and run an elimination-style algorithm whose dominant cost is essentially the number of alternatives that remain plausible.
Formally, in the stronger feedback model where each query returns an
unbiased estimate with sub-Gaussian scale σ, we can achieve regret on the
order of
$$
\mathrm{Reg}(T) \;=\; \tilde O\!\big(\sigma \sqrt{K T}\big),
$$
where the Õ(⋅) hides
polylogarithmic factors (confidence parameters, mild dependence on d, and bookkeeping for adaptive
partitioning). The key point is the $\sqrt{T}$ dependence: once value estimates
are cheap, the problem behaves like learning the best among ≈ K structured candidates rather
than like a continuum-armed bandit.
If the stronger feedback comes from a per-round suite of size M, then $\sigma \asymp 1/\sqrt{M}$ and the bound
becomes
$$
\mathrm{Reg}(T) \;=\; \tilde O\!\Big(\sqrt{\tfrac{K T}{M}}\Big),
$$
highlighting a direct tradeoff between evaluation effort and learning
speed.
Two qualifications matter for interpreting this regime as a prescription for real contracting problems.
Suites assume i.i.d. evaluation under a stable best response. To average M outcomes at the same x, we implicitly assume the agent’s induced action distribution is stable across the M instances and that the instances are comparable (no hidden selection effects). In platform settings, this may require randomization and careful experimental design; in offline evaluation, it may require strong assumptions to justify counterfactual estimates.
We shift burden from time to measurement/computation. The $\tilde O(\sigma\sqrt{KT})$ regret rate is attractive, but it presumes the principal can (i) run large evaluations at chosen x’s and (ii) maintain and optimize over candidate cells (often requiring an optimization oracle over 𝒳). Thus, stronger feedback improves statistical efficiency, but may increase operational complexity.
Within those limits, the message is clear: when the principal can cheaply approximate U(x), the learning problem is no longer bottlenecked by bandit noise, and the relevant complexity parameter becomes the number of implementability regimes K rather than the size of the outcome space m or the sheer richness of 𝒳.
So far we have treated the induced payoff surface U(⋅) as fixed across rounds. In
practice, this is often the wrong abstraction. The mapping from
contracts to outcomes may drift because the agent’s environment changes
(new tools, shifting demand, seasonality), because the population of
agents changes (turnover, composition effects), or because the agent
learns how to better exploit the contract template. Even if the
principal posts parameters xt from a fixed
compact set 𝒳, the effective function
we are optimizing can become time-indexed:
Ut(x) := 𝔼[ut ∣ xt = x],
where the expectation is taken under the (time-t) best response and outcome
distribution induced by x.
This section formalizes a minimal stability condition under which
learning remains possible, and it clarifies the robustness–adaptation
tradeoff that emerges when contracts are re-optimized in a drifting
environment.
Dynamic regret benchmarks the learner against a moving target,
$$
\mathrm{DReg}(T) \;=\; \sum_{t=1}^T \Big( \max_{x\in\mathcal X} U_t(x)
\;-\; \mathbb{E}[U_t(x_t)] \Big).
$$
Without any constraint on how {Ut}t = 1T
evolves, this benchmark is essentially unattainable: an adversarial
environment could make today’s maximizer uninformative about tomorrow’s,
forcing linear regret even with perfect computation. Economically, this
corresponds to a world where incentive constraints and technology can
change arbitrarily fast relative to the principal’s ability to
experiment—no learning rule can keep up.
Accordingly, we adopt a drift model that is intentionally agnostic about why primitives change (costs ci, distributions qi, even rewards r), and instead constrains the observable object that governs payoffs: the induced expected utility function Ut(⋅).
We measure nonstationarity via a sup-norm variation budget
$$
V_T \;:=\; \sum_{t=2}^T \sup_{x\in\mathcal X} \big|U_t(x) -
U_{t-1}(x)\big|.
$$
This definition has two advantages. First, it is directly tied to the
learning signal (payoffs) rather than to unobserved primitives; two very
different underlying changes are treated as equivalent if they produce
the same change in induced payoffs. Second, it allows drift that is
irregular in time—quiet periods punctuated by occasional jumps—so long
as the total movement is controlled.
We maintain our structural assumption each round: Ut is piecewise-affine over 𝒳 with at most K polyhedral regions and slope bound L. Importantly, we do not assume the partition itself is fixed. Boundaries between implementability regions may move over time as incentive comparisons change. The role of K is simply to upper bound per-round geometric complexity.
Conceptually, drift turns a one-shot learning problem into a sequence of “local” learning problems. When the environment is approximately stable for a while, we would like to run our static-regret procedure and harvest its guarantees; when the environment has moved substantially, we must pay to re-identify the best region and re-optimize within it.
A standard and robust way to operationalize this idea is a restarting algorithm:
This procedure does not require us to observe the drift directly. The only design choice is the block length H: short blocks adapt quickly but waste samples re-exploring; long blocks exploit more but risk chasing a stale optimum.
The proof strategy mirrors the economics of “pilot then re-pilot.” Within each block we compare performance to the best fixed contract for that block (a stability benchmark), and then we account for the gap between that blockwise benchmark and the fully dynamic optimum maxxUt(x). The variation budget controls the second gap: if {Ut} does not move much over the block, then the best fixed contract in that block is close to the best per-round contract.
Combining (i) the per-block static regret guarantee for our
piecewise-affine class and (ii) a drift-to-restart reduction yields a
dynamic regret bound of the form
DReg(T) = Õ (L (KdVT)1/3 T2/3),
where Õ(⋅) hides
polylogarithmic factors (confidence parameters and mild bookkeeping
terms). Two comparative statics are worth emphasizing.
When VT = 0 (stationarity), we recover the static-learning regime up to logs: the restart schedule can take H = T, and the bound reduces to the familiar T2/3 scaling with K and d. In other words, the drift extension does not penalize us when there is no drift.
When VT increases, the factor VT1/3 captures the additional price of adaptation. Intuitively, higher variation forces more frequent “recalibration” of which incentive regime is currently optimal. This is exactly the robustness–learning tension: a learner that optimizes aggressively for the current regime must be ready to pay again when the regime changes.
It is also useful to interpret the rate in terms of time scales. A small VT means that the environment only drifts materially over long horizons, so the principal can afford longer blocks (more exploitation per relearning episode). A large VT compresses the horizon over which information remains valid, shrinking the effective sample size for identifying the best region.
Because VT is defined directly on Ut, we do not need to specify whether drift comes from qi, from ci, or from changes in the reward mapping r. This abstraction is deliberate: in many applications, the principal cannot separately identify these components anyway. At the same time, it is a limitation: the model allows drift that may be economically implausible (payoffs changing in ways that could not arise from any reasonable change in agent technology), and it does not attempt to exploit additional structure that might exist in the primitives (e.g., a parametric trend in costs).
The key maintained restriction is per-round implementability complexity: each Ut is still a K-piece piecewise-affine function with slopes bounded by L. Thus the outcome alphabet size m remains irrelevant for learning rates; drift makes learning harder by changing which piece is active, not by increasing the dimensionality of outcome distributions.
From a contracting perspective, the variation-budget result formalizes an often implicit operational norm: contracts are revised periodically, not continuously. A firm might recalibrate sales commissions quarterly, a platform might update its scoring-and-payment rule monthly, and a regulator might adjust procurement scoring rules annually. The bound above gives a stylized justification: in a drifting environment, continually chasing the estimated optimum can be wasteful because the optimum itself is moving, yet never updating leaves persistent money on the table.
There is also a subtler point about robust contracts. When VT is large, the dynamic benchmark maxxUt(x) is extremely demanding—it rewards rapid adaptation even if this requires heavy exploration. In contrast, a principal may prefer a contract that is slightly suboptimal today but performs reasonably well across plausible regimes (a kind of maximin or worst-case criterion). Our framework does not hard-code such preferences; instead it quantifies the statistical cost of pursuing the moving optimum. This makes explicit a practical design lever: one can trade off responsiveness (lower bias to current conditions) against stability (lower variance and fewer costly restarts).
Finally, two implementation issues deserve acknowledgment.
Choosing H without knowing VT. The cleanest tuning uses VT (or an upper bound) to select a block length. If VT is unknown, one can use standard meta-techniques—e.g., a grid of candidate block lengths combined with expert aggregation, or a doubling-style schedule that gradually increases restart frequency when performance degrades. These add logarithmic factors but preserve the essential scaling.
Endogenous nonstationarity. Our drift model treats {Ut} as exogenous. In many real systems, however, the principal’s learning and the agent’s adaptation are intertwined: agents may change behavior in response to observed experimentation, or the population may sort in or out under different contracts. The variation budget can be read as a reduced-form way to bound these effects, but it does not explain them. Where such strategic feedback loops are central, one would want a richer dynamic model; our claim here is narrower—if the induced payoff surface does not move too much in total, then the principal can track it with provable dynamic-regret guarantees.
Taken together, the variation-budget extension shows that our geometric learning picture is not confined to fully stationary settings. As long as the induced payoff function evolves with bounded total movement, we can remain near the moving optimum at a cost that scales transparently with implementability complexity K, dimension d, slope bound L, and drift magnitude VT.
Our upper bounds make a strong claim: once we restrict attention to an affine template t(x) = Ax + b, the relevant statistical complexity is not the size of the outcome alphabet m, nor even the number of underlying actions |𝒜|, but the implementability complexity K—the number of best-response regions induced on 𝒳. A natural question is whether this is merely an artifact of our analysis and algorithms. In this section we explain why it is not: even in one dimension, and even for extremely benign-looking payoff functions, any learner with only noisy payoff feedback must incur regret scaling like K1/3T2/3 (up to logarithmic factors). We then show how to embed such hard instances inside our principal–agent model using only a constant number of outcomes, which cleanly separates “hardness due to incentives” (captured by K) from “hardness due to telemetry richness” (captured by m).
The core difficulty is that when U(⋅) is piecewise-affine with K pieces, it can contain K distinct regimes separated by unknown boundaries. If the principal knew the partition, learning would be much easier: one would optimize a single affine function on the relevant cell. But the partition is induced by the agent’s (unobserved) best response, and therefore must be inferred from payoff data. In the bandit feedback model (one realized payoff ut per round), inferring which regime we are in is informationally similar to identifying which “arm” is best among K alternatives when the identity of the arm depends on where we query.
A useful mental picture is the following 1D special case. Let 𝒳 = [0, 1] and suppose U(x) is piecewise-constant
on K disjoint intervals I1, …, IK
that partition [0, 1]:
U(x) = μk for
x ∈ Ik, μk ∈ [0, 1].
Choosing xt simply
selects an interval Ik and yields a
noisy sample of μk. The learner
does not know the μk’s
nor which interval contains which mean (since the interval
boundaries themselves are unknown). This is at least as hard as a K-armed bandit problem, but with the
extra complication that exploration must also localize the intervals.
Lower bounds for such “regime identification” problems precisely produce
the K1/3T2/3
scaling: we can neither avoid exploring many candidate regimes (a K effect) nor avoid averaging enough
samples within a regime to overcome noise (a T effect).
We state a representative lower bound in a simplified statistical
model that captures the feedback structure we face: on each round t, the learner chooses xt ∈ [0, 1] and
observes a noisy payoff
yt = U(xt) + ηt,
where ηt
is mean-zero sub-Gaussian noise (or simply bounded noise). The benchmark
is maxx ∈ [0, 1]U(x).
Consider the class ℱ(K, L) of functions U : [0, 1] → ℝ that are
piecewise-affine with at most K pieces and slope bounded by L.
A standard construction (which we can think of as “K disjoint wells with one slightly higher than the others”) yields:
infalg supU ∈ ℱ(K, L) 𝔼[Reg(T)] ≥ Ω (L K1/3T2/3) (up to logarithmic factors).
The proof idea is more important than the exact constants. We choose
K disjoint intervals of equal
length and define U to be
(nearly) flat on each interval. One interval—unknown to the learner—is
“good” with mean larger by a gap Δ, and the remaining K − 1 intervals are “bad.” If the
learner fails to spend enough rounds sampling from the good interval, it
suffers regret roughly Δ per
round. But to reliably tell which interval is good under noise, the
learner needs on the order of 1/Δ2 samples from
each plausible interval. Balancing these two forces gives the usual
bandit-style tradeoff:
$$
\text{exploration cost} \;\approx\; \frac{K}{\Delta^2},
\qquad
\text{exploitation cost} \;\approx\; \Delta T.
$$
Optimizing over Δ yields Δ ≍ (K/T)1/3
and regret ≍ K1/3T2/3.
Two remarks are worth stressing.
The lower bound already holds for piecewise-constant functions, which are a special case of piecewise-affine functions (take slope 0 on each cell). Thus, allowing nonzero slopes does not create the hardness; the hardness comes from the unknown regime boundaries.
The bound does not rely on high dimension: it holds even when d = 1. Higher-dimensional parameter spaces can only make the inference problem harder, but d is not the driver of the T2/3 phenomenon.
This argument is also the reason we should not expect $\sqrt{T}$-type regret under single-sample payoff feedback without stronger structural assumptions (e.g., global concavity, Lipschitz continuity without kinks, or a margin condition guaranteeing that regime boundaries are rarely near-optimal). In our setting, kinks are endogenous outcomes of incentive constraints, and in worst case they can appear anywhere.
The previous lower bound is stated directly in terms of payoff functions U(⋅). To justify our emphasis on K as an implementability parameter, we also want to know that such hard U(⋅) can actually arise from a principal–agent environment with affine contracts and best responses.
This embedding is conceptually straightforward. In one dimension
(d = 1), the agent’s expected
utility for action i under
parameter x is affine:
UA(i ∣ x) = Ti(x) − ci = (qi ⋅ A) x + (qi ⋅ b − ci).
Thus each action corresponds to a line in x, and the best response i*(x) is the
upper envelope of these lines. By choosing a family of actions whose
lines form an upper envelope with K segments, we induce exactly K best-response regions. On the
interior of each region, the principal’s expected utility is affine as
well:
U(x) = Ri*(x) − qi*(x) ⋅ (Ax + b).
By selecting rewards and payments so that U is nearly flat on each region and
differs slightly across regions, we can match the “one good interval”
construction above. Importantly, none of this requires a large outcome
space: we can realize distinct slopes qi ⋅ A
even with m = 2 outcomes by
varying the Bernoulli probability qi1 ∈ [0, 1],
since qi
is then determined by a single scalar. In other words, the lower bound
can be implemented with constant m, while K grows through the geometry of best
responses.
This is the cleanest sense in which the hardness is about incentive-induced non-smoothness rather than about estimating a high-dimensional distribution over outcomes. The agent’s private information (ci and qi) only matters insofar as it shapes where the envelope switches, i.e., how many distinct implementability cells exist and how hard they are to localize.
Because the hard instance can be implemented with m = 2, any regret lower bound that grows with m cannot be fundamental for our template-based problem—it would reflect a different modeling choice (for example, insisting on learning qi in full, or allowing contracts to condition arbitrarily on all m outcomes). Our results—and the lower bound above—support a sharper message:
If the template t(x) = Ax + b compresses outcomes into a low-dimensional statistic, then learning rates should depend on the induced geometric complexity K, not on the raw number of bins m.
Conversely, increasing m does not necessarily make learning harder unless it increases the number of implementability regions (for example, by allowing the contract to respond to many outcomes in ways that create many distinct incentive comparisons).
This distinction matters in practice. A platform might collect thousands of telemetry features, but if it commits to paying as a function of a low-dimensional score, then the relevant learning problem is governed by the score’s induced regime structure, not by the full feature space.
Lower bounds are ultimately design guidance: they tell us which “knobs” cannot be turned away by better algorithms. Here, the unavoidable dependence on K says that overly expressive templates can be statistically expensive, even if they look innocuous from a mechanism-design perspective. Each additional degree of freedom in x and each additional way of mapping outcomes into payments can create new best-response regions, increasing K and thereby increasing the exploration cost required to locate the optimal regime.
That does not mean “simpler is always better.” A richer template may raise maxx ∈ 𝒳U(x) by enabling better incentives; the point is that richer templates also raise the price of learning this optimum from data. In operational terms, we should read K as a form of endogenous capacity control: it measures how many qualitatively distinct behavioral responses the template can induce. The next section takes this perspective seriously and discusses how one might choose templates to keep K small (or at least controlled) while retaining enough flexibility to perform well—connecting implementability complexity to more familiar notions of statistical capacity such as pseudo-dimension.
The lower bound in the previous section is not merely a pessimistic statement about bandit feedback; it is a design constraint. Once we commit to learning a contract from data—rather than computing it from a fully specified model—the contract class itself becomes an object of statistical choice. Our results suggest that the right “complexity knob” is not the outcome alphabet size m nor the number of latent actions |𝒜|, but the implementability complexity K: the number of best-response regions induced on 𝒳 by the affine template t(x) = Ax + b. Put differently, a template is not only a mapping from outcomes to payments; it is also a mapping from the agent’s private model (qi, ci) into a particular geometry of kinks in U(⋅). Learning performance is governed by that geometry.
Contract designers typically reason in terms of
expressiveness: a richer template (larger d, more degrees of freedom in A, more outcome-contingencies) can
implement a wider set of incentives and potentially increase maxx ∈ 𝒳U(x).
Our regret bounds emphasize a second, orthogonal axis:
learnability. Richer templates can increase K, which raises the exploration cost
required to identify the right regime. This generates an endogenous
bias–variance style tradeoff:
richer template ⇒ higher attainable
optimum but larger K ⇒ slower learning.
The point is not that one should minimize K in isolation. Rather, K is the “price of adaptation”: if
we anticipate limited data, short horizons, or noisy outcomes, then a
template that keeps K
controlled may dominate a more expressive alternative even if the latter
has a higher first-best value.
This perspective aligns with practical contracting constraints. Platforms and organizations rarely pay on a fully outcome-contingent vector in ℝm; they commit to a scorecard, a rubric, or a small set of performance indicators. Our framework gives a sharp reason: low-dimensional templates compress the incentive comparisons the agent can exploit, thereby compressing the number of distinct best-response regimes the principal must learn.
A helpful way to operationalize “keep K small” is to revisit what drives best-response boundaries under affine templates. Incentive comparisons depend on the projected statistics wi := ATqi ∈ ℝd and the intercept terms qi ⋅ b − ci. When many actions share the same (or similar) wi, the template makes them hard to distinguish from the agent’s perspective, and the upper envelope has fewer distinct linear forms competing. In the most extreme case, if A projects all qi’s onto a one-dimensional score and that score takes only a few distinct values across economically relevant actions, then K is small even if |𝒜| is large.
This suggests a concrete design heuristic: choose A to emphasize outcome directions that are (i) strongly aligned with the principal’s rewards r and (ii) likely to separate “good” from “bad” behavior, but avoid adding directions that merely create fine-grained distinctions among many near-equivalent behaviors. Since the principal does not know {qi}, this cannot be done perfectly; still, domain knowledge often provides coarse priors. For example, in fraud prevention, paying on a small number of robust fraud signals may induce only a few qualitatively different response regimes, whereas paying on hundreds of correlated telemetry features can create many knife-edge switches among manipulation strategies, inflating K without commensurate gains in value.
Formally, our arrangement-style bound K ≤ O(Md), where M is the number of distinct wi’s, should be read as guidance: it is not the raw action count that matters, but how many distinct incentive-relevant projections the template creates. A template that is expressive in the ambient outcome space can still be learnable if it induces a small M (and hence small K) over the relevant action set.
A common intuition is that collecting more telemetry (larger m) should make contracting easier because it gives the principal more signals to condition on. Our results cut against that intuition in a precise way. Telemetry only helps if it is used to build a template that improves payoffs without exploding K. If additional features simply provide more levers for the contract to create finely separated expected payments Ti(x) = qi ⋅ (Ax + b), then they can proliferate best-response regions, leading to slower learning under bandit feedback.
This is a cautionary point for practice: “high-dimensional scoring” can be counterproductive when contracts must be learned online. It may raise the theoretical optimum, but it also makes the induced payoff surface more non-smooth and thus statistically more expensive to navigate. Conversely, committing to a simple score—say, a scalar s(j) and a one-parameter payment t(j) = αs(j) + β—often keeps K manageable, and may therefore deliver higher realized welfare over a finite horizon.
Although K is an implementability object (a property of the best-response partition), it is closely connected to familiar capacity measures from statistical learning. One way to see this is to view the induced payoff U(⋅) as belonging to a function class 𝒰 consisting of piecewise-affine functions over 𝒳 with at most K pieces and bounded slopes L. Such classes have pseudo-dimension growing with both the ambient parameter dimension d and the number of pieces K. Informally, each piece contributes degrees of freedom (an affine form), while the unknown placement of boundaries adds combinatorial richness. This is exactly the mixture of “parametric” and “combinatorial” complexity that shows up in our regret rates through factors like (dlog K)1/3K1/3.
The practical lesson is that implementability complexity functions like a capacity control for contracts. Template design choices that increase K expand the effective hypothesis class the principal must search over using noisy payoff samples. This perspective also suggests why naive structural risk minimization over very rich templates can fail under bandit feedback: the cost of identifying which region one is in is not captured by within-region smoothness alone.
At the same time, the link to pseudo-dimension is constructive. It implies that we can borrow tools from learning theory—regularization, nested classes, and complexity-penalized model selection—to choose between templates. For instance, one can organize templates {𝒯ℓ}ℓ = 1L of increasing expressiveness (and plausibly increasing Kℓ), and allocate exploration to identify the best level ℓ given the horizon T. Even rough upper bounds on Kℓ can serve as penalties that prevent overfitting through excessive regime complexity.
We distill several design principles that follow from the theory.
First, prefer low-dimensional parameterizations unless there is a clear incentive reason to expand d. Increasing d improves expressiveness, but it also increases the space in which hyperplane arrangements can create many regions. In settings where online adaptation is critical, a small number of interpretable parameters (a scalar intensity, a few tier thresholds, or a small vector of weights on stable metrics) often provides a better learnability–performance tradeoff.
Second, avoid templates that create many near-ties among actions. When the agent’s utility lines are close, small changes in x can trigger frequent action switches, effectively increasing K in the portion of 𝒳 the learner explores. Practically, this motivates “smoother” contract families—e.g., restricting weight magnitudes, limiting the number of outcome bins that receive distinct payments, or imposing monotonicity/shape constraints on t(⋅)—not because smoothness is intrinsically good, but because it reduces the proliferation of best-response boundaries.
Third, design around robust regimes, not razor-thin optima. Even if worst-case theory allows kinks anywhere, many applications admit margin-like regularities: the optimal contract may lie well inside a region where the best response is stable. Template choices that enlarge such stable interiors (for example, by compressing irrelevant telemetry dimensions) can dramatically reduce the effective difficulty, even if worst-case K remains large.
Finally, treat template choice as a joint economic–statistical decision. If a rich template is necessary to implement a materially better action (higher Ri at acceptable payments), it may be worth paying the learning cost. But when incremental expressiveness merely reshuffles behavior among many similar actions, it is likely to raise K without raising value, and should be avoided.
We emphasize that K is typically unknown ex ante because it depends on the agent’s latent (qi, ci). Template design therefore cannot literally “minimize K” in a vacuum; instead, it should aim to control worst-case or plausible-case regime complexity using structural restrictions and domain knowledge. Moreover, our analysis is tailored to single-agent, single-contract interactions with hidden actions and bandit feedback. Once we allow menus of contracts, audits, multiple agents, or richer feedback channels, the mapping from template design to learnability can change qualitatively.
These considerations point directly to the broader agenda: understanding which modeling enrichments genuinely mitigate the hidden-action feedback bottleneck, and which simply move complexity elsewhere. We turn to these questions next.
Our analysis highlights a stark but useful message: when the principal observes only realized outcomes (and hence only realized payoffs), the statistical difficulty of contract learning is driven by behavioral discontinuities—the kinks induced by the agent’s best response—rather than by the raw size of the outcome space. This section discusses what we see as the main limitations of the hidden-action feedback model, and outlines several extensions that are both economically natural and technically open.
The core friction is not that the principal faces noise per se, but that the principal faces endogenous noise generated by unobserved action switching. Even if outcomes are perfectly observed, the mapping x ↦ j is mediated by i*(x), which can change abruptly. This places our setting closer to partial-monitoring and piecewise-structured bandits than to smooth stochastic optimization.
A useful way to phrase the limitation is: the principal does not get local comparative statics “for free.” In a smooth model, trying x and x′ gives information that can be interpolated across the neighborhood. Here, a small perturbation can move us to a different best-response region, so local interpolation is hazardous exactly where the optimum tends to sit (near binding incentive constraints). This is why the worst-case regret remains on the order of T2/3: the principal must spend nontrivial exploration budget locating the relevant region before it can even exploit the affine structure inside it.
An open theoretical direction is to identify economically interpretable conditions under which the bottleneck disappears. Two candidates are particularly natural.
Margin (robustness) conditions. Suppose there
exists an optimal contract x⋆ such that the agent’s
best response is separated by a uniform gap:
(qi⋆ ⋅ t(x⋆) − ci⋆) − maxi ≠ i⋆(qi ⋅ t(x⋆) − ci) ≥ γ.
Intuitively, if γ is large,
then nearby experimentation does not change behavior, and the
environment becomes locally “smooth.” Can one convert such a margin into
fast rates (e.g. $\tilde O(\sqrt{T})$)
while still allowing worst-case kinks elsewhere?
Smoothed (perturbed) instances. In many practical settings, costs and outcome distributions are not adversarially aligned to create knife-edge ties across a large fraction of 𝒳. A smoothed analysis in which (qi, ci) are perturbed may imply that “effective” K is much smaller than worst-case arrangement bounds, yielding substantially better typical performance. Formalizing this would help reconcile pessimistic minimax rates with the empirical success of simple scoring rules.
Many real principals do not post a single contract but a menu: a set {x(1), …, x(M)} ⊂ 𝒳, after which the agent selects a contract and then an action. Menus can strictly improve implementability: they allow screening across heterogeneous agents, and even for a single agent they can separate incentive constraints that collide under a single contract.
From a learning perspective, however, menus create a new complexity
object. The agent’s decision rule becomes
(ℓ*, i*) ∈ arg maxℓ ∈ [M], i ∈ 𝒜 (qi ⋅ t(x(ℓ)) − ci),
so the induced payoff is piecewise-affine over 𝒳M, but with a
combinatorial explosion in potential regions (now indexed by
contract–action pairs). Whether menus reduce or increase the
relevant complexity is ambiguous: a menu could collapse many
actions into a few stable regimes (helping learning), or it could
introduce new boundaries due purely to contract choice (hurting
learning).
We view the main open problem as defining and bounding an appropriate “menu implementability complexity” KM, and relating it to achievable regret. A promising direction is to treat each menu element as creating a “candidate regime,” and to analyze when increasing M reduces the number of regimes that compete near the optimum. Conceptually, menus may act like a discretization of 𝒳 chosen endogenously by the principal, suggesting a connection to adaptive discretization and hierarchical bandits.
Organizations often possess costly mechanisms—audits, spot checks, manual reviews, or instrumented test tasks—that partially reveal the agent’s action or intermediate signals. These mechanisms are central in practice precisely because pure outcome-based incentives are manipulable.
A natural extension is an audit-augmented model: in each round, the principal chooses xt and also whether to audit, paying cost κ, and receiving additional feedback (e.g. the realized action it, or a noisy proxy). This creates an explicit economic tradeoff between incentive payments and information acquisition.
Technically, this resembles bandits with side observations or active learning with query costs. The central question becomes: how many audits are needed to reduce regret from T2/3 to $\tilde O(\sqrt{T})$, and how should audits be scheduled? One conjecture is that audits are most valuable near suspected boundaries between best-response regions, where payoff observations are least informative. This suggests algorithms that maintain uncertainty sets over regions and allocate audits to resolve ambiguity only when it affects the argmax of U(⋅). A clean theory would characterize regret as a function of an audit budget B (or total audit cost Bκ), interpolating between pure bandit learning and full-information region identification.
Our baseline model effectively treats the agent as fixed over time.
In many platforms, the principal interacts with a population of
agents, with types (q, c) drawn repeatedly.
This changes the object being learned: rather than a single
best-response partition, the principal faces an aggregation over types,
leading to an expected utility
Ū(x) = 𝔼(q, c)[Ri*(x; q, c) − qi*(x; q, c) ⋅ t(x)].
Even if each individual best response is discontinuous, the population
average may be smoother, potentially allowing faster optimization. At
the same time, heterogeneity introduces new concerns: a contract that
performs well on average may have undesirable distributional effects
(e.g. pushing some agents into inefficient actions), and fairness or
participation constraints may bind.
Open problems here include (i) deriving regret bounds for learning under i.i.d. or drifting populations, (ii) understanding whether the relevant complexity is the expected number of active regions under the type distribution rather than worst-case K, and (iii) incorporating participation and retention dynamics, where today’s contract changes tomorrow’s pool of agents.
In many applications (gig platforms, ad exchanges, procurement), principals compete. Agents can choose among principals, making the effective environment endogenous: a principal’s contract affects not only effort but also selection. This raises questions that are simultaneously economic (equilibrium and welfare) and statistical (learning under strategic feedback).
A basic model would let principals post xtp and agents select a principal p and an action i. Then each principal observes outcomes only for participating agents, yielding censored, selection-biased feedback. The open challenge is to characterize when no-regret dynamics converge to meaningful equilibria, and how implementability complexity interacts with market thickness and competition. In practice, these settings are where template simplicity is often most valuable—precisely because competitive pressure limits the time available for exploration.
Finally, we see empirical validation as essential, because the central complexity parameter K is defined by an unobserved best-response partition. We propose a three-part empirical program.
Synthetic ground-truth experiments. Construct environments with known (qi, ci), vary the template dimension d and the induced number of regions K, and compare learning curves across algorithms. Because K can be computed exactly in small instances (via arrangement enumeration or Monte Carlo boundary tracing), these experiments can directly test the predicted scaling in K and T, and can evaluate whether “effective K” is smaller under random perturbations (smoothed instances).
Semi-synthetic replay on observational data. In platform settings with logged outcomes and rewards, we can define a family of affine templates t(x) and use off-policy evaluation tools to estimate U(x) for a subset of x’s, then simulate online learning via replay. While this does not solve the hidden-action identification problem (the true best response remains latent), it allows us to test whether piecewise-affine approximations and region-based exploration capture real variation in payoffs.
Field tests with richer feedback variants. Where feasible, implement controlled “many-testcase” evaluations (or audit-like checks) for a small fraction of rounds to approximate unbiased estimates of U(x). The theory predicts a sharp improvement when such feedback is available. A practical design is to run a hybrid policy: mostly standard online learning from realized payoffs, with occasional intensive evaluation at candidate x’s to accelerate region elimination.
Across all three, we would treat K not as a fixed known constant but as a latent feature to be estimated or upper bounded from observed switching behavior and sensitivity to perturbations in x. This would also inform a practical template-selection pipeline: start with simple templates, expand expressiveness only when evidence suggests that the induced payoff surface is stable enough to justify richer structure.
In sum, the main conceptual limitation of hidden-action learning is the discontinuity induced by incentives. The main opportunity is that contract design itself can create or remove discontinuities—through menus, audits, population structure, and market institutions. Understanding these interactions is, in our view, the most promising route toward a contract theory that is simultaneously implementable, learnable, and empirically testable.