Digital platforms increasingly make allocation, pricing, and ranking decisions by from past interactions. A marketplace tunes reserve prices from realized bids; a gig platform adjusts surge multipliers from observed acceptance rates; an ad exchange updates relevance and pacing rules from click and conversion feedback. In each case, the platform’s decision rule is not fixed ex ante but evolves as an online algorithm processes a growing history of reports and outcomes. This adaptivity is operationally compelling, yet it reopens a foundational question in mechanism design: when agents understand that today’s report shapes tomorrow’s mechanism, what prevents of the learning process?
The economic logic of such manipulation is straightforward. In a static mechanism, a deviation is evaluated round-by-round: an agent compares the immediate utility from truthful reporting to the immediate utility from misreporting. Under learning, that comparison becomes intertemporal. A seller might shade bids today to make the platform infer lower demand and reduce future fees; a bidder may occasionally “sacrifice” surplus to push a reserve price downward; a contractor may misstate availability to influence future matching rules. These behaviors are not pathological edge cases. They are precisely the kind of forward-looking distortions one expects when (i) the mechanism depends on past reports and (ii) agents repeatedly interact with the same platform and discount the future slowly enough that influencing the future is valuable.
A common reaction is to seek a clean separation: let the platform learn, but ensure the learner is “stable” so that any single report has limited impact on future decisions. This is more than an engineering heuristic; it is an economic commitment device. If a learner is sufficiently insensitive to any one agent’s report, then the value of manipulating the future is bounded. Recent work has formalized this idea using stability notions that are equivalent, or closely related, to differential privacy and max-divergence bounds. In our setting, we interpret such a bound as a quantitative cap on the an agent can obtain by changing one report while holding everything else fixed. Stability is therefore a natural first pillar: it makes “teaching the algorithm” costly in expectation because the algorithm refuses to learn too much from any one individual interaction.
However, stability alone is not a full incentive solution. Even if future influence is limited, an agent may still benefit from misreporting for immediate reasons (e.g., to receive a better allocation today), and the platform typically cannot observe true types to enforce truthfulness directly. In classical repeated games, one might propose punishments triggered by detected deviations; in repeated mechanism design, one might attempt to embed a —a built-in rule that makes any misreport strictly worse than truth-telling by a fixed margin. This idea is powerful in discrete-type environments because one can often guarantee a uniform gap: if a report differs from the truth, the mechanism can impose (or threaten) a fixed expected loss relative to truthful play.
The difficulty is that many economically relevant type spaces are continuous. Values, qualities, delivery costs, and risk preferences are more naturally modeled as elements of Θ = [0, 1]d than as a small discrete set. And in continuous spaces, the very notion of a uniform penalty gap becomes fragile. If an agent can misreport by an arbitrarily small amount, then any penalty rule that is even mildly regular (or that respects limited observability) tends to generate vanishing differences between truthful and near-truthful reports. Put differently, the “distance” between two reports can be made so small that any feasible enforcement signal has negligible power to distinguish them. The familiar discrete-type logic—“any deviation triggers a detectable inconsistency”—breaks because there is no minimal deviation size. In practical terms, a platform cannot credibly commit to punishing every infinitesimal discrepancy between a reported value and an unobserved true value, especially when verification is noisy and costly.
This is where we believe a key design primitive for 2026 is neither pure commitment nor pure stability, but . Audits formalize an increasingly common institutional reality: platforms and regulators can sometimes verify, ex post and imperfectly, whether a report was materially accurate. Marketplaces cross-check shipping times and cancellation reasons; labor platforms verify location and time-on-task; financial intermediaries verify identity and transaction provenance; advertising platforms conduct measurement audits and fraud checks. Crucially, these verifications are rarely perfect—signals are noisy, delayed, and partial—yet they can still support meaningful deterrence if used probabilistically and coupled with penalties that scale with detected discrepancies.
Our perspective is that audits resolve the continuous-type obstacle by shifting the goal from exact truthfulness to truthfulness. Instead of attempting to penalize every deviation, we select a tolerance level ε > 0 and aim to deter deviations of magnitude at least ε. Economically, ε captures the platform’s enforcement resolution: the smallest distortion that is both meaningful for outcomes and detectable given the available verification technology. Deviations smaller than ε are treated as within an acceptable margin of error—either because they are hard to detect, because punishing them would generate excessive false positives, or because the induced welfare loss is second-order. This reframes incentives in a way that is both theoretically tractable and institutionally plausible.
Audits also complement stability in a precise way. Stability limits how much an agent can gain by manipulating the learner’s behavior, but it does not impose an immediate expected cost on misreporting. Auditing does: when an audit occurs, the platform draws a verification signal (for example, z = θ + ξ) and applies a penalty based on the discrepancy between the report b and the signal z. If the penalty rule is chosen so that any deviation ∥b − θ∥ ≥ ε increases expected penalties by at least a gap β(ε) > 0, then misreporting beyond tolerance carries a quantifiable expected loss. When audits occur with probability λ, the expected enforcement intensity is λβ(ε). The central economic tradeoff is then transparent: we only need enough audit intensity to dominate the bounded dynamic gain implied by stability and the agent’s effective planning horizon.
This design logic helps explain why audits have become so salient in practice. First, they are modular: a platform can run its preferred allocation rule most of the time, while occasionally switching to an audit procedure without fully redesigning the system. Second, they are scalable: audits can be randomized and targeted, focusing limited verification resources where incentives are strongest. Third, they match the informational constraints of modern markets: platforms often cannot observe types directly, but they can often obtain noisy proxies (device telemetry, transaction records, third-party attestations) that are informative enough to distinguish gross misreports from honest noise. Finally, audits are legible to policy. Regulators increasingly require “reasonable” monitoring and enforcement (e.g., against fraud, misclassification, or manipulation) rather than demanding impossible perfect observability. A randomized audit scheme with explicit tolerances and penalty schedules is an implementable compliance object.
At the same time, we do not want to oversell audits as a free lunch. Audits are costly—not only operationally but also in terms of user experience and fairness. Noisy verification generates false positives: truthful agents may be penalized due to measurement error, and this can violate individual rationality or deter participation. Penalties may be legally capped or reputationally constrained. Moreover, audits can interact with privacy concerns: collecting verification signals may itself be sensitive, and the platform may need to balance enforcement against confidentiality. These constraints motivate our emphasis on (audits happen with small probability λ) and on (we only target deviations above ε), both of which reduce the frequency and severity of enforcement while still deterring economically meaningful manipulation.
Conceptually, our model illuminates a three-way frontier between (i) learnability and performance, (ii) incentive protection against dynamic manipulation, and (iii) enforcement costs (including false positives). Stability of the learning rule improves incentives but can slow adaptation; stronger auditing improves incentives but directly reduces welfare when an audit mechanism overrides the learner’s preferred choice or when truthful agents are mistakenly penalized; larger tolerance ε reduces false positives and makes detection easier, but permits more slack in reporting and can degrade allocations. The economic problem is therefore not to achieve an idealized exact incentive constraint, but to choose (λ, ε) (and audit design parameters) to achieve the best attainable performance subject to plausible enforcement.
A final motivation for our approach is methodological. Continuous types and adaptive mechanisms are now the default in many applications, yet much of the incentive analysis still relies on discrete approximations or on equilibrium concepts that do not directly capture “teaching the algorithm.” By explicitly combining a stability bound (interpretable as a limit on strategic influence over the learner’s output distribution) with an audit-induced effective penalty gap for deviations above ε, we obtain a simple and portable recipe: This recipe yields incentive guarantees that are robust to the realized sequence of types and objectives and that coexist with standard online learning regret analyses, thereby connecting strategic behavior to algorithmic performance in a single quantitative framework.
The remainder of the paper develops this logic formally and studies its implications for audit design and online regret. We then place our results in context. The next section discusses related work on strategic online pricing and auctions, differential privacy as a tool for incentives and stability, the weak-DP perspective on repeated interactions, and the extensive literature on verification and auditing in economic design.
Our setting sits at the intersection of online learning, dynamic mechanism design, and verification. Because each of these literatures uses different primitives and benchmarks, it is useful to be explicit about what we borrow and what we do not. Broadly, we see four strands as most closely connected: (i) strategic behavior in online pricing and auctions, (ii) stability–especially differential privacy (DP)–as an incentive tool for adaptive algorithms, (iii) the ``weak DP’’ sequence perspective developed by Huh–Kandasamy and related papers on repeated interactions with long-sighted agents, and (iv) auditing and verification mechanisms in economic design. A fifth, more technical strand comes from stability theory in online convex optimization (OCO), which supplies convenient sufficient conditions for the stability bounds we assume.
A large literature studies online decision-making for pricing and auctions when the platform learns unknown demand or value distributions; canonical examples include dynamic pricing with bandit feedback and reserve-price learning in repeated auctions. Much of this work takes agents as myopic or non-strategic with respect to the learning rule: buyers respond to posted prices, bidders submit bids, and the platform optimizes regret against a static benchmark (e.g., the best fixed price or reserve in hindsight). These models capture an important operational reality—platforms do learn from data—but they treat the data-generating process as exogenous.
A more recent literature makes the endogeneity explicit: agents understand that their reports and actions affect the platform’s future decisions, and therefore may ``teach’’ the algorithm. This concern appears in repeated auctions where bidders shade bids to reduce future reserve prices, in dynamic pricing where customers strategically delay purchases to influence future prices, and in broader settings where participants manipulate feedback signals (ratings, clicks, conversions) to steer learning. Across these domains, a recurring lesson is that standard regret guarantees can be misleading in strategic environments: an algorithm that is optimal against an exogenous sequence can perform poorly once the sequence is chosen by forward-looking agents. One response is to restrict the agent side (e.g., bounded rationality, limited foresight, or no commitment to intertemporal strategies). Another is to seek algorithmic properties that make manipulation unprofitable, which leads directly to stability.
Our contribution is not to propose a new auction format or a bespoke dynamic pricing rule. Instead, we provide a modular deterrence principle that can wrap around an arbitrary online learner over a mechanism class Π. This is closer in spirit to the platform problem faced in practice: teams often have a preferred learning stack (Hedge over discretized rules, online gradient methods, contextual bandits, etc.), and the design question is how to add enforceable guardrails against strategic distortion without rebuilding the entire system.
The idea that algorithmic stability can function as an incentive device has been developed along several lines. In privacy-aware mechanism design, differential privacy and related max-divergence notions are used to ensure that no single agent’s report substantially changes the distribution of outcomes, which implies approximate truthfulness when utilities are bounded. This insight goes back to early connections between DP and approximate dominant-strategy incentive compatibility in static settings, and it has been refined in many subsequent works (including variants that handle payments, social welfare objectives, and approximate equilibrium concepts).
In the online setting, the relevant issue is not only how a report affects the current outcome, but also how it affects the learner’s future behavior. Here the DP/stability lens becomes particularly attractive: a per-round max-divergence bound on the learner’s mapping from history to a distribution over mechanisms provides an explicit handle on the magnitude of dynamic influence. Our use of weak, per-round stability (rather than full transcript DP) reflects the economics of the problem: what matters for incentives is the agent’s ability to steer the platform’s , and bounding that influence round-by-round is enough to control the continuation-value gain from a single deviation. At the same time, we emphasize a limitation that is sometimes glossed over: stability alone is typically insufficient to prevent deviations that are immediately profitable within a round. This is precisely why our model introduces an additional enforcement primitive (auditing) rather than relying on stability as a complete substitute for incentive constraints.
Our incentive analysis is most directly inspired by the framework of Huh and Kandasamy and subsequent work on repeated interactions with long-sighted agents. The key conceptual move in that line is to formalize how far ahead agents effectively plan, via a long-sightedness parameter α, and to connect α to stability properties of the platform’s learning rule. Roughly, if a single report can change the distribution over future actions only by a multiplicative factor eη each round (a max-divergence/η-stability condition), then the total strategic benefit from manipulating the learner across the effective horizon can be upper bounded by a term of order ηα (up to constants and normalization).
We adopt this perspective because it provides a clean separation between two objects that are otherwise hard to disentangle: the platform’s adaptivity (captured by the learner and its stability parameter η) and the agent’s forward-looking capacity (captured by α). Importantly, this approach does not require us to specify a particular parametric belief model for agents or to assume stationarity in types; it is compatible with ex-post incentive statements that hold for realized sequences. Where we depart from the baseline weak-DP sequence story is in how we generate a deterrent to offset the bounded continuation gain. In discrete-type models one can sometimes posit an exogenous commitment gap: any non-truthful report triggers a fixed expected penalty difference. In continuous types, such a uniform gap typically fails without additional structure. Our audit mechanism is precisely the missing ingredient: it endogenizes a positive ``effective penalty gap’’ β(ε) for deviations larger than a tolerance ε, thereby restoring the gap logic in a continuous environment.
Verification has a long history in economics, from costly state verification and auditing in contract theory to mechanism design with evidence, inspection, and costly monitoring. In many applications, the designer cannot observe types directly but can sometimes obtain informative signals after the fact (e.g., through accounting audits, document checks, third-party attestations, or forensic analysis). These models clarify the basic tradeoff: stronger verification improves incentives but consumes resources and can harm honest agents when verification is noisy or imperfect.
A separate but related literature studies how to elicit truthful information when direct verification is unavailable, using peer prediction, scoring rules, and correlated signals. Those approaches are powerful when one can leverage cross-agent correlations or repeated tasks, but they often require strong assumptions about common priors, conditional independence, or the availability of multiple reports about the same underlying event. In contrast, our audit primitive is deliberately minimal: we assume the platform can sometimes draw a noisy signal z = θ + ξ about an agent’s true type (or a proxy for it), and can commit to penalties based on the discrepancy between b and z. This is meant to capture institutional features that are increasingly common in digital markets: platforms can occasionally verify delivery times, locations, identity documents, or transaction provenance, but only imperfectly and not at scale every round.
Our focus on a tolerance level ε connects directly to the verification literature’s concern with false positives. If verification is noisy, then punishing every deviation from z would penalize honest agents with nontrivial probability. The tolerance/threshold approach is a practical mitigation: only ``large enough’’ discrepancies are treated as evidence of misreporting, which reduces false positives and makes enforcement legible. This framing also aligns with policy practice, where compliance regimes often specify materiality thresholds and probabilistic audits rather than demanding exact agreement with a latent ground truth.
On the algorithmic side, we draw on results linking OCO procedures to stability. Many online learners commonly used in platforms—regularized follow-the-leader, mirror descent, Hedge over discretizations, and related methods—admit sensitivity bounds showing that changing one data point (or one round’s loss vector) only slightly perturbs the chosen action distribution. In privacy theory, such sensitivity is often turned into DP guarantees via noise addition or exponential mechanisms; in learning theory it yields generalization; and in online learning it can be interpreted as robustness. We do not re-derive these results, but we rely on them as a justification that our stability parameter η is not an abstract assumption: it can be achieved by standard algorithms, sometimes with mild randomization, and it can be tuned via learning rates and regularization strengths.
The economic reason we care about this connection is that it clarifies the cost of incentive protection. Imposing stronger stability (smaller η) typically slows down adaptation and worsens regret, echoing the general privacy–accuracy and robustness–adaptivity tradeoffs. Our regret decomposition makes this explicit: we pay a learning term (e.g., $(\log|\Pi|)\sqrt{\alpha T}$ under Hedge) and an enforcement term that scales with 1/β(ε) through the required audit rate. This allows us to talk about incentive constraints and algorithmic performance in a single quantitative language, rather than treating truthfulness as a hard constraint divorced from learning.
Relative to existing strategic online learning and repeated auction models, our approach is intentionally ``wrapper-like’’: we ask for (i) a stability bound on the learner and (ii) an audit mechanism that yields an effective penalty gap above tolerance. This yields a simple implementable condition, λβ(ε) ≥ 4ηα, under which truthful reporting is an ε-Nash equilibrium (ex-post), and it delivers a regret bound that makes the audit–approximation frontier explicit. The point is not that stability and audits are the only route to incentive compatibility; rather, they are a tractable and policy-relevant route when types are continuous, verification is imperfect, and the platform insists on using adaptive learning.
At the same time, the framework abstracts away from several important issues studied elsewhere. We take the audit signal structure as given rather than deriving it from an underlying measurement or investigation process; we summarize agent forward-looking behavior via α rather than modeling full beliefs and strategic sophistication; and we treat penalties in reduced form, ignoring legal and behavioral constraints beyond individual rationality. These choices are deliberate: the goal is a portable economic logic that can be plugged into different application domains. We return to several of these extensions (targeted audits, heterogeneous noise, and IR constraints with false positives) after formalizing the baseline model.
The next section specifies the repeated interaction, the continuous type environment, our benchmark for performance (objective regret against the best fixed mechanism in Π), and the long-sightedness model captured by α.
We model a platform (or planner) that repeatedly selects an economic mechanism while facing agents who understand that today’s reports can affect tomorrow’s policy. The central difficulty is that the data used for learning are endogenous: the platform updates from reported types, but those reports come from forward-looking agents. Our goal in this section is to make the repeated interaction explicit, to define a performance benchmark that remains meaningful under adaptivity, and to formalize how we parameterize agents’ forward-looking behavior through a long-sightedness bound~α.
Fix a horizon T ∈ ℕ and a
set of agents indexed by i ∈ [n]. In each round
t ∈ [T], agent i (if present) has a privately
observed type
θi, t ∈ Θ = [0, 1]d,
where d captures the
dimensionality of the information relevant for the mechanism (e.g., a
value and a budget; a location and a time window; a vector of quality
attributes). We impose no stationarity assumption on (θi, t)t ≤ T:
types may drift, jump, or be adversarially selected. This is deliberate.
In many applications, what the platform observes is a nonstationary
environment (seasonality, changing composition of users, strategic
adaptation), and we want statements that are valid for the realized
sequence.
To allow varying participation, let Ti ⊆ [T] denote the set of rounds in which agent i is active. When i is inactive at time t ∉ Ti, we treat the report as a null symbol bi, t = ⊥ and the mechanism ignores it.
In each active round, agent i submits a report bi, t ∈ Θ. Let bt = (bi, t)i ∈ [n] be the report profile at time t. An outcome st is then realized in an outcome space S. We keep S abstract: it may include allocations, matchings, posted prices, payments, service levels, or other operational decisions.
Agent i’s stage utility
depends on her type and the realized outcome:
ui(θi, t, st) ∈ [0, 1].
The [0, 1] normalization is without
loss of generality for our incentive analysis; it pins down the scale at
which stability and enforcement trade off. (Equivalently, one may start
with bounded utilities and rescale.)
The platform’s per-round objective is given by a function
Gt(θt, st) ∈ [−1, 1], θt = (θi, t)i ∈ [n].
We allow Gt to vary with
t to accommodate time-varying
planner goals (e.g., shifting weight between revenue and quality of
service) and time-varying feasibility constraints. Examples include
revenue, social welfare, a congestion or matching-quality score, or a
composite metric used in practice for policy evaluation.
A single-round mechanism is a (possibly randomized) mapping
π : (Θ ∪ {⊥})n → Δ(S),
which takes the report profile and returns a distribution over outcomes.
We focus on a finite (or discretized) mechanism class Π, which should be interpreted as
the platform’s design space: for instance, a grid of reserve prices, a
menu of scoring rules, a finite family of allocation rules, or a
discretized set of parameters for a differentiable mechanism. The
finiteness is what permits standard online learning guarantees; any
approximation cost from discretization will later appear as a separate
term.
An (or platform policy) M
is the repeated process that chooses, at each round t, a distribution qt ∈ Δ(Π)
based on the public history and then draws a mechanism πt ∼ qt
to execute. The realized outcome is then sampled as
st ∼ πt(bt).
We assume the platform can commit to its randomized choice rule (i.e.,
the distribution qt and the
randomness used to draw πt are part of
the publicly understood protocol). This commitment assumption reflects
how platform policies are often implemented: the mapping from logged
data to the next policy is encoded in a deployed learning system, and
agents can (at least approximately) infer the platform’s update rule
over time.
Let ht − 1 denote the public history up to time t − 1 (past reports, realized outcomes, and any publicly revealed evaluation signals). The learner is a mapping ht − 1 ↦ qt(⋅ ∣ ht − 1). In the results that follow we will impose a weak, per-round stability requirement on this mapping (a max-divergence/η condition), but for now the point is simply that the platform’s mechanism choice is adaptive: qt depends on earlier reports and earlier realized objectives.
Each agent can condition her report on her current type and the
observed history. Formally, a (behavioral) strategy for agent i is a sequence of mappings
σi, t : Θ × ℋt − 1 → Δ(Θ),
where ℋt − 1 is the
history space. Truthful reporting corresponds to σi, t(θi, t, ht − 1) = δθi, t.
Because types are continuous and the platform policy is adaptive, exact incentive compatibility is typically too strong: small perturbations of a report may have only second-order effects, and any realistic enforcement/verification system will have limited resolution. We therefore work with an notion: we will aim to make truthful reporting an ε-Nash equilibrium in the repeated interaction, where deviations within an ε-ball are treated as economically negligible. The role of ε is thus twofold: it is a tolerance level for incentives, and (later) a knob that trades off enforcement intensity against accuracy.
To capture forward-looking behavior while remaining agnostic about
beliefs and sophistication, we summarize how much agents value the
future through discount factors γi(t) ≥ 0.
Agent i’s total discounted
utility under an online mechanism M and a strategy profile σ is
Ui(σ; M) = 𝔼 [∑t ∈ Tiγi(t) ui(θi, t, st)],
with the expectation over the platform’s randomization, the mechanism’s
randomization, and any randomness in strategies. (We will later subtract
expected audit penalties; for now, the baseline intertemporal utility
highlights the channel through which an agent might benefit from
manipulating the learning dynamics.)
The key quantity for our incentive bounds is not a particular
functional form of discounting, but an upper bound on the from any
period. Following the weak-DP sequence perspective, we define the
long-sightedness of agent i at
time t as
$$
\alpha_{i}(t)\;:=\;\frac{1}{\gamma_i(t)}\sum_{\substack{t'\in T_i\\
t'>t}}\gamma_i(t'),
$$
and we assume a uniform bound
α ≥ maxi ∈ [n]maxt ∈ Tiαi(t).
Intuitively, α measures how
many “current-round utility units” the agent can still gain or lose in
the future relative to the present. Because per-round utilities lie in
[0, 1], the continuation value from
time t onward is at most 1 + α in our normalization, and the
continuation value beyond the current round is at most α.
This abstraction nests familiar special cases. Under geometric discounting with γi(t) = δt − 1 and full participation, we have α = δ/(1 − δ), so α blows up as δ → 1 (very patient agents). Under a hard planning window of length K (agents care only about the next K active rounds), one obtains α ≤ K. In practice, α can be interpreted as a reduced-form proxy for institutional or behavioral frictions that limit long-run manipulation: limited attention, bounded tenure on the platform, or uncertainty about future participation.
The platform cares about cumulative objective value
$$
\mathrm{Alg}(M)\;:=\;\mathbb E\!\left[\sum_{t=1}^T
G_t(\theta_t,s_t)\right].
$$
To evaluate an adaptive policy, we compare it to a static benchmark: the
best fixed mechanism in Π
chosen in hindsight for the realized sequence. Concretely, if we were to
run a fixed π ∈ Π in
every round (with agents reporting truthfully), the resulting value
would be
$$
\mathrm{Alg}(\pi)\;:=\;\mathbb E\!\left[\sum_{t=1}^T
G_t(\theta_t,s_t)\,\middle|\, s_t\sim \pi(\theta_t)\right].
$$
We then define the (truthful-play) regret as
RegT(M) := maxπ ∈ ΠAlg(π) − Alg(M).
This benchmark is intentionally conservative: it does not ask the
platform to compete with the best policy or the best mechanism tailored
to each t, which would
conflate learning with clairvoyance. Instead, it isolates the cost of
learning within the chosen design space Π.
Two remarks are important for interpretation. First, the definition conditions on truthful reports for the benchmark mechanisms, because the planner’s goal is to perform well when the system is not strategically distorted. In other words, we aim to design M so that truthful reporting is (approximately) an equilibrium, and then we evaluate the resulting objective relative to the best fixed truthful mechanism in Π. Second, the discretization embedded in Π is a modeling choice: it reflects engineering constraints (a finite catalog of policies that can be deployed, tested, and audited) as much as mathematical convenience. Any loss from restricting to Π—relative to an idealized continuous family—will later be accounted for separately as an approximation/discretization term.
At this stage we have specified the repeated game and the performance yardstick, but we have not yet specified how the platform prevents within-round profitable misreports, especially with continuous types. Stability of the learner will eventually limit how much an agent can profit by steering mechanism choices, but it does not by itself rule out immediate gains from shading a report in the current round. Addressing that missing per-round deterrence is exactly the role of our audit primitive, to which we turn next.
Our learning-based policy is only as good as the data it learns from. When agents are forward-looking, a report is not merely a description of current preferences; it is also an instrument for shaping future mechanisms. Per-round learner stability (which we study next) limits how much an agent can profit by steering the distribution over mechanisms, but it does not by itself rule out profitable shading. With continuous types, this within-round issue is particularly stark: without some additional enforcement, an agent can typically move her report by an arbitrarily small amount in a direction that (locally) improves her stage payoff, and we cannot hope to police every infinitesimal perturbation.
We therefore introduce an audit primitive: an occasional, publicly committed verification-and-penalty procedure that makes sufficiently large misreports costly . The key modeling choice is that verification is noisy, so the audit cannot perfectly distinguish a small lie from measurement error. This naturally leads to an ε-tolerance notion: we aim to deter deviations larger than ε, while treating smaller deviations as economically negligible. The role of this section is to formalize audits and to define a single summary statistic, β(ε), that measures how strongly audits deter ε-scale deviations. This scalar will later plug directly into the incentive condition λ β(ε) ≥ 4ηα.
Fix an agent i and a round
t in which she participates.
If an audit occurs, the platform observes a verification signal
zi, t = θi, t + ξi, t,
where ξi, t
is exogenous noise (measurement error, privacy noise, or an imperfect
proxy for the latent type). We take the audit signal model as a
primitive: the distribution of ξi, t
is known to all parties and independent of reports. This captures
settings where types can be partially verified through ex post records
(delivery times, realized clicks, observed quality), through third-party
checks, or through internal instrumentation that is accurate only up to
noise.
The audit mechanism then applies a penalty based on the reported type
and the audit signal:
peni(bi, t, zi, t) ∈ [0, P],
where P is the maximal
punishment expressed in normalized utility units. Importantly, we do
require the platform to learn from the audit itself; the audit’s purpose
is deterrence. In the repeated protocol, the platform commits publicly
to mixing in the audit mechanism with probability λ each round. This commitment is
operationally natural (e.g., a known auditing schedule or a fixed
inspection lottery) and analytically convenient: it turns a complex
dynamic incentive problem into a comparison between (i) the maximum gain
from manipulating learning and (ii) the expected audit loss from
deviating.
Because Θ = [0, 1]d is continuous, it is neither feasible nor desirable to enforce exact truth-telling at infinite precision. Even with perfect verification, mechanisms are often robust only up to a resolution limit, and any real audit will have finite measurement granularity. We therefore fix a tolerance parameter ε > 0 and interpret it as the smallest economically meaningful deviation. The enforcement goal becomes:
This is an incentive requirement, aligned with our ε-Nash notion. It also reflects a policy choice: in many applications, we care about preventing misrepresentation rather than micro-manipulation.
We summarize the deterrence power of an audit design by the smallest incremental expected penalty it induces against ε-scale misreports.
Fix a penalty rule pen(b, z) and an audit
signal model z = θ + ξ. For
ε > 0, define
β(ε) := infθ ∈ Θ infb ∈ Θ: ∥b − θ∥ ≥ ε(𝔼[pen(b, z) ∣ θ] − 𝔼[pen(θ, z) ∣ θ]).
By construction, β(ε) ≥ 0. When β(ε) > 0, any report that is at least ε away from truth incurs an extra expected audit loss of at least β(ε) relative to truthful reporting. In other words, β(ε) is the audit’s marginal deterrence at scale ε.
A convenient way to think about β(ε) is through a
decomposition into (i) a detection advantage and (ii) a penalty
magnitude. Many audits have a binary “fail test” event A(b, z) ∈ {0, 1}
and a penalty of the form pen(b, z) = P ⋅ A(b, z).
Then
β(ε) = P ⋅ ρ(ε), ρ(ε) := infθinf∥b − θ∥ ≥ ε(Pr (A(b, z) = 1 ∣ θ) − Pr (A(θ, z) = 1 ∣ θ)),
where ρ(ε) is the
minimal increase in the probability of failing the audit when deviating
by at least ε. This separation
is useful in practice: ρ(ε) is governed by
measurement quality (noise), while P is governed by institutional
constraints (legal limits, fairness, and individual rationality).
A particularly transparent design is a threshold test that penalizes
a report when it is “too far” from the noisy verification:
pen(b, z) = P ⋅ 1{∥b − z∥ > τ},
with threshold τ ≥ 0. This
audit does attempt to infer θ
from z; it only checks whether
the report is plausible given the measurement noise. The threshold τ controls false positives: larger
τ means truthful agents are
penalized less often, but it also makes it harder to detect small
deviations.
Even in one dimension (d = 1), this simple form already captures a realistic compliance logic (“reports must be consistent with observed records up to tolerance τ”). It also makes the effect of noise transparent, and it yields closed forms for common noise distributions.
Suppose d = 1, ξ ∼ Laplace(0, σ), and
pen(b, z) = P ⋅ 1{|b − z| > τ}.
Let Δ := |b − θ| ≥ 0.
Then
βτ(Δ) = P(Pr (|Δ + ξ| > τ) − Pr (|ξ| > τ)), Pr (|ξ| > τ) = e−τ/σ.
A closed form for Pr (|Δ + ξ| > τ)
is piecewise:
$$
\Pr(|\Delta+\xi|>\tau)=
\begin{cases}
\frac{1}{2}e^{-(\tau-\Delta)/\sigma}+\frac{1}{2}e^{-(\tau+\Delta)/\sigma}
= e^{-\tau/\sigma}\cosh(\Delta/\sigma),
& 0\le \Delta\le \tau,\\[6pt]
1-\frac{1}{2}e^{-(\Delta-\tau)/\sigma}\big(1-e^{-2\tau/\sigma}\big),
& \Delta\ge \tau.
\end{cases}
$$
Two implications are worth highlighting. First, for Δ ≤ τ, the audit’s
detection advantage is second-order at the origin (since cosh (Δ/σ) = 1 + O(Δ2)):
tiny deviations are essentially indistinguishable from noise, which is
precisely why we adopt an ε-tolerance notion. Second, once
Δ > τ, the gap
becomes uniformly positive and increases with Δ, so
β(ε) ≥ βτ(ε) > 0 whenever
ε > τ.
Thus, under Laplace verification, we can choose a threshold τ that protects truthful agents from
frequent false positives, and still obtain a strictly positive
enforcement gap against deviations larger than ε.
If instead ξ ∼ 𝒩(0, σ2) and
we keep the same threshold audit, then
$$
\Pr(|\xi|>\tau)=2\Phi\!\left(-\frac{\tau}{\sigma}\right),
$$
and for Δ ≥ 0,
$$
\Pr(|\Delta+\xi|>\tau)
=
\Pr(\xi>\tau-\Delta)+\Pr(\xi<-\tau-\Delta)
=
\Phi\!\left(-\frac{\tau-\Delta}{\sigma}\right)+\Phi\!\left(-\frac{\tau+\Delta}{\sigma}\right),
$$
where Φ is the standard normal
CDF. Hence
$$
\beta_\tau(\Delta)
=
P\Bigg(
\Phi\!\left(-\frac{\tau-\Delta}{\sigma}\right)+\Phi\!\left(-\frac{\tau+\Delta}{\sigma}\right)
-2\Phi\!\left(-\frac{\tau}{\sigma}\right)
\Bigg),
\qquad
\beta(\varepsilon)\ge \inf_{\Delta\ge \varepsilon}\beta_\tau(\Delta).
$$
In contrast to Laplace, there is generally no simplification beyond
Φ(⋅), but the function is
one-dimensional and easy to evaluate numerically. This is exactly the
kind of object we want: the audit enters our later incentive condition
only through β(ε),
which can be computed (or lower bounded) as part of design
calibration.
In some applications, verification error is bounded: for example, a recorded timestamp may be accurate up to a known rounding unit, or a quality score may have bounded measurement error. If ξ is uniform on [−a, a] and d = 1, then Pr (|ξ| > τ) = 0 for τ ≥ a, and the threshold audit yields a gap that is piecewise linear in Δ. In the extreme case τ ≥ a, truthful agents are never falsely penalized, and any deviation larger than τ − a is detected with probability one. This illustrates a general design lesson: when verification is “sharp,” one can set τ to essentially eliminate false positives while preserving strong deterrence; when verification is “noisy-tailed,” one must trade off false positives and enforcement through (τ, ε, P).
When θ ∈ [0, 1]d, the same primitives apply, but closed forms depend on the geometry of the test and the noise. Two common implementations are:The main conceptual point is unchanged: we do not need to “solve” the audit statistically; we only need a uniform lower bound on the incremental expected penalty for ε-scale deviations.
Noisy audits inevitably punish some truthful agents. Under the
threshold audit, the truthful false-positive rate is
pfp(τ) := Pr (∥ξ∥ > τ),
so a truthful agent faces expected audit disutility λ P pfp(τ)
per round (up to discounting and participation). This creates an
individual-rationality (IR) constraint: if participation is voluntary,
λPpfp(τ)
cannot exceed the expected surplus that the mechanism delivers to
truthful agents. Put differently, we cannot set P arbitrarily large or τ arbitrarily small without risking
exclusion of exactly the agents we want to elicit information from.
This is where the ε-tolerance viewpoint becomes practically valuable. If we only require deterrence for deviations exceeding ε, we can choose τ to keep pfp(τ) small (protecting truthful agents), and choose ε modestly larger than τ so that β(ε) remains strictly positive. The resulting design is not “perfect truth-telling,” but it is often the right economic compromise: it targets the manipulation that meaningfully distorts learning and policy, while tolerating noise-level discrepancies.
The reason we labor to define β(ε) is that it is the exact quantity that trades off against the benefit of strategic manipulation in the dynamic game. When audits occur with probability λ, any deviation larger than ε suffers an expected one-shot loss of at least λ β(ε). In the next section, we formalize how learner stability upper bounds the continuation-value gain from a one-shot misreport by a term on the order of ηα. The enforcement logic is then immediate: if λ β(ε) exceeds that maximal dynamic gain (up to constants), large deviations are unprofitable, and truthful reporting becomes an ε-Nash equilibrium of the repeated interaction.
The broader message is that audits and stability are complements. Stability alone limits intertemporal manipulation but leaves within-round incentives unresolved; audits alone can deter misreports but would be too costly if used constantly. The mechanism we study mixes the two, and β(ε) is the bridge that lets us quantify the resulting incentive–performance tradeoff.
The dynamic incentive problem in online mechanism selection is not only that an agent may want to shade her report to affect the outcome, but also that she may want to perturb the learner so as to tilt the distribution over mechanisms in her favor. Our approach to controlling this second channel is deliberately algorithmic: we ask the learner to be in the sense that any single report has only a limited effect on the distribution over mechanisms selected in later rounds. Intuitively, a stable learning rule makes “steering the algorithm” a low-leverage instrument.
There are several notions of stability in learning theory (uniform stability, on-average stability, Lipschitzness of iterates). We use a differential-privacy-style notion because it gives exactly the inequality we later need in a repeated-game argument: a bound on how much the distribution over actions can change when one entry of the input history changes. This bound is strong enough to control an agent’s worst-case continuation-value gain from a single misreport, yet weak enough to be satisfied by standard no-regret learners with mild regularization.
Let ht − 1 denote
the public history available to the learner before round t (e.g., past report profiles and
any past objective feedback the platform uses). The learner is a
mapping
ht − 1 ↦ qt(⋅ ∣ ht − 1) ∈ Δ(Π),
where qt
is the distribution over mechanisms for round t (we suppress dependence on t when notationally convenient).
We say that two histories ht − 1 and ht − 1′ are if they differ only in a single report entry (one agent in one round), with everything else held fixed. This is the natural adjacency relation for strategic manipulation: it isolates the marginal effect of one agent’s one-shot deviation on the learner’s subsequent behavior.
The stability condition we impose is the following weak form of η-differential privacy (equivalently, an η max-divergence bound) applied to the learner’s output distribution.
We say the learner is weakly η-stable if for every round t, every pair of adjacent histories
ht − 1 ∼ ht − 1′,
and every event A ⊆ Π,
Pr (πt ∈ A ∣ ht − 1) ≤ eη Pr (πt ∈ A ∣ ht − 1′).
Equivalently, for every π ∈ Π,
qt(π ∣ ht − 1) ≤ eη qt(π ∣ ht − 1′).
Two remarks help interpret this condition.
First, it is : it upper bounds how much probability mass can be increased by changing one report. Since adjacency is symmetric (we can swap h and h′), it also implies the reverse inequality with h, h′ swapped, hence the likelihood ratio is bounded in both directions. In particular, no single report can make a particular mechanism dramatically more (or less) likely in the next step.
Second, it is for incentives: if an agent’s continuation value is some bounded function of the future mechanism draw, then the inequality above bounds how much that continuation value can change under a one-shot deviation. This is exactly what we will exploit in the backward-induction argument in the next section.
The multiplicative form is convenient because it composes cleanly
over time and because it is agnostic to the agent’s utility function
beyond boundedness. Concretely, if f : Π → [0, 1] is any
bounded function (think: expected continuation value from each
mechanism), then max-divergence implies a uniform expectation bound such
as
𝔼qt(⋅ ∣ h)[f(π)] ≤ eη 𝔼qt(⋅ ∣ h′)[f(π)],
and hence an additive bound of order (eη − 1) ≈ η
when η is small. This is the
sense in which η is the
“per-round manipulability budget” of the learner: smaller η means less scope for any report to
influence the distribution over future mechanisms.
From a practical standpoint, this stability is also a desirable governance property. Mechanisms chosen by a stable rule do not swing sharply in response to a single data point, which aligns with how platforms often justify policy updates (gradualism, robustness to anomalous reports, and reduced volatility).
We now verify that standard exponential-weights learning yields exactly the required stability, with η essentially equal to the learning rate. We present the argument in the full-information setting (the one underlying our regret bound later), where the learner can form a bounded “loss” (or negative reward) for each candidate mechanism each round.
Let ℓτ(π) ∈ [0, 1]
denote the loss assigned to mechanism π on round τ based on the observed feedback at
that round. Hedge maintains weights
$$
w_t(\pi)\;=\;\exp\!\Big(-\eta_H \sum_{\tau=1}^{t-1} \ell_\tau(\pi)\Big),
\qquad
q_t(\pi)\;=\;\frac{w_t(\pi)}{\sum_{\pi'\in\Pi} w_t(\pi')},
$$
where ηH > 0 is the
learning rate.
Consider two adjacent histories ht − 1 ∼ ht − 1′
that differ only in a single report entry at some round τ⋆ < t. This
report change may alter the learner’s computed loss vector for that
round, so let ℓτ⋆
and ℓτ⋆′
be the corresponding losses. Suppose we have the natural
bounded-sensitivity condition
supπ ∈ Π |ℓτ⋆(π) − ℓτ⋆′(π)| ≤ 1,
i.e., changing one report can change the assessed loss of any fixed
mechanism by at most one unit. (With gains in [−1, 1], the analogous bound is 2; all conclusions below scale
accordingly.)
Then for every π,
$$
\frac{w_t(\pi\mid h)}{w_t(\pi\mid h')}
=
\exp\!\Big(-\eta_H(\ell_{\tau^\star}(\pi)-\ell'_{\tau^\star}(\pi))\Big)
\;\le\; e^{\eta_H}.
$$
A standard normalization argument then implies the same bound for the
:
$$
\frac{q_t(\pi\mid h)}{q_t(\pi\mid h')}
\;\le\; e^{\eta_H},
$$
and hence for any event A ⊆ Π,
Pr (πt ∈ A ∣ h) ≤ eηHPr (πt ∈ A ∣ h′).
Thus Hedge is weakly η-stable
with η = ηH
(up to the sensitivity scaling). The key economic interpretation is
immediate: A smaller learning rate makes the platform harder to
manipulate intertemporally, because it tempers the extent to which any
single report can tilt future mechanism probabilities.
The same stability arises if we implement the learner via entropic
follow-the-regularized-leader (FTRL). Let Lt − 1(π) = ∑τ < tℓτ(π)
be cumulative loss. Entropic FTRL chooses a distribution
$$
q_t \;\in\; \arg\min_{q\in\Delta(\Pi)}\Big\{\langle q, L_{t-1}\rangle +
\frac{1}{\eta_F}\mathrm{KL}(q\|u)\Big\},
$$
where u is a reference
distribution (often uniform). The solution is again a softmax:
qt(π) ∝ u(π)exp ( − ηFLt − 1(π)),
so the same likelihood-ratio bound holds with η = ηF
under the same bounded-sensitivity condition. This representation
clarifies what stability is doing: rather than “chasing” the single
empirically best mechanism, the learner plays a best response to past
data, and the smoothing parameter governs manipulability.
This also highlights a limitation. If we used an unsmoothed rule such as follow-the-leader (pure argmin of Lt − 1), the mapping from histories to πt can be discontinuous: a single report can flip the identity of the empirical minimizer, producing large changes in future play and therefore large dynamic incentives to misreport. In our setting, such instability is not merely a technical nuisance; it is precisely the vulnerability strategic agents exploit.
The stability discussion above is cleanest when Π is finite, since the learner outputs a distribution over finitely many mechanisms and the max-divergence inequality is literal. Many economically relevant mechanism families, however, are parameterized by continuous variables: reserve prices, posted-price vectors, scoring weights, or allocation thresholds. In those cases, we typically work with a discretized subclass Π obtained by gridding the parameter space.
This discretization serves two purposes. Statistically/computationally, it makes online learning feasible with standard finite-action algorithms and yields regret terms depending on log |Π|. Economically, it aligns with the fact that most deployed mechanisms are implemented at finite resolution (prices in cents, reserves in dollars, and policy parameters rounded to operationally meaningful units).
The cost is an approximation term. Let 𝒰 denote an underlying continuous benchmark class and Π ⊆ 𝒰 a finite δ-net under a metric that controls the planner’s objective. Then the best fixed mechanism in Π may be worse than the best in 𝒰 by a discretization error disc(Π), which we carry explicitly in our regret decomposition later. When the objective is Lipschitz in parameters (as is common in pricing and auction applications under mild regularity), a grid with mesh δ typically yields disc(Π) = O(δ), while |Π| grows like δ−k for k effective dimensions, so log |Π| = O(klog (1/δ)). This is the familiar complexity–approximation tradeoff: finer grids improve approximation but increase learning burden.
A further subtlety is that stability can degrade if we refine Π without adjusting learning rates. Since our stability parameter is essentially the softmax temperature, and our regret-optimal learning rate depends on T and log |Π|, there is an endogenous link between discretization choice and incentive robustness. Put plainly: making the mechanism class very rich increases the platform’s capacity to fit data, but it can also increase the platform’s responsiveness to any single report unless we correspondingly regularize (lower ηH).
We will treat weak per-round η-stability as the learner-side primitive: it caps the dynamic benefit from manipulating future mechanism selection. Exponential-weights (Hedge) and entropic FTRL satisfy it transparently, with η controlled by a learning-rate/regularization parameter under bounded feedback sensitivity. In the next section, we combine this stability control with the audit-based one-shot deterrence to establish ε-Nash incentive compatibility via a backward-induction argument.
We now state and prove the central incentive result: a weakly-stable learner can be made robust to dynamic manipulation by mixing in an audit mechanism that creates an for “large” misreports. Economically, the proof formalizes a simple comparison: a deviation can only help an agent by (i) slightly steering the learner’s future distribution over mechanisms, while it hurts her by (ii) increasing her expected audit exposure. Stability bounds (i); the audit gap enforces (ii). When (ii) dominates (i) over the agent’s effective planning horizon, truth-telling becomes an equilibrium of the repeated interaction.
We encode forward-looking behavior using the standard “effective
horizon” bound. For each agent i, let γi(t) ≥ 0
be the discount factor weight on period t, and assume there is a finite
α such that for every round
t,
Thus, conditional on being at round t, the total discounted weight the
agent places on is at most α
in units of the current-period weight. This is the precise sense in
which α upper bounds how much
future steering is worth.
Fix a norm ∥⋅∥ on Θ and a tolerance parameter ε > 0. The audit rule is assumed
to satisfy an : for every agent i, every true type θ ∈ Θ, and every report
b with ∥b − θ∥ ≥ ε,
We emphasize what this is (and is not) buying us. It does require exact
verification; it only requires that sufficiently large misreports become
detectably more likely to trigger penalties. The tolerance ε is a modeling choice reflecting
practical limits: we do not attempt to police “microscopic” deviations
in continuous type spaces, but we will ensure that deviations that move
the report by at least ε are
unprofitable.
Let Mλ denote the repeated procedure that, each round t, flips an independent coin: with probability 1 − λ it executes a mechanism drawn from the learner’s distribution qt(⋅ ∣ ht − 1), and with probability λ it executes the audit mechanism πaud (which includes drawing verification signals and applying penalties). The coin flip is public and committed ex ante.
The proof uses a standard conversion from likelihood-ratio stability
to value stability. Suppose two distributions q, q′ on Π satisfy the pointwise bound q(π) ≤ eηq′(π)
for all π. Then for any
function f : Π → [0, B],
Swapping q and q′ yields the reverse
inequality, so
When η ≤ 1, we can use eη − 1 ≤ 2η
to obtain the convenient bound
This is where the constant 4 in the
main condition comes from: we pay (i) a factor 2 to turn a one-sided likelihood ratio into a
two-sided absolute difference, and (ii) a factor 2 to linearize eη − 1 for small
η.
We can now state the incentive guarantee in the form we will use later.
Assume: (a) stage utilities are normalized so that ui(θ, s) ∈ [0, 1];
(b) the learner is weakly η-stable in the per-round
max-divergence sense (as defined in the previous section) with respect
to changing a single report entry in the history; (c) the audit
mechanism satisfies the effective penalty gap for deviations of size at
least ε; and (d) each agent
i satisfies the
effective-horizon bound . If the audit mixing probability λ satisfies
then truthful reporting is an ε-Nash equilibrium of the repeated
interaction induced by Mλ (ex-post,
i.e., for every realized sequence of types and all realized public
randomness).
Fix an agent i and imagine she deviates by submitting a report bi, t far from θi, t at some round t. Because of η-stability, this single change can only tilt the platform’s future distribution over mechanisms by a bounded amount each round, which translates into a bounded increase in the agent’s expected continuation value. The total discounted value of those future tilts is at most on the order of ηα. Meanwhile, the audit lottery imposes an expected cost of at least λβ(ε) whenever the deviation is of magnitude at least ε. Condition makes the expected audit loss dominate the maximal dynamic gain, so such a deviation is not profitable. A backward-induction (one-shot deviation) argument then extends this to arbitrary deviation strategies.
Fix an arbitrary realized type sequence (θi, t)t ≤ T for all agents, and fix any public randomness outside the learner’s own randomization and the audit noise. Fix an agent i and hold all other agents to truthful play.
For each round t, define
agent i’s from t + 1 onward (measured in units of
period-t weight) under
truthful play as
$$
W_{t+1}(h_t)
\;:=\;
\mathbb E\Big[\sum_{\tau=t+1}^T
\frac{\gamma_i(\tau)}{\gamma_i(t)}\Big(u_i(\theta_{i,\tau},s_\tau)-\mathbf
1\{\text{audit at }\tau\}\cdot
\mathrm{pen}_i(\cdot)\Big)\,\Big|\,h_t\Big].
$$
By boundedness of ui and
nonnegativity of penalties, we have the crude bound Wt + 1(ht) ∈ [0, α]
using .
Consider any deviation strategy of agent i. Let t⋆ be the first round at which the deviating strategy differs from truthful reporting (if it never differs, we are done). By standard one-shot deviation logic in finite-horizon dynamic games, it suffices to show that , the agent cannot gain by deviating at t⋆ alone by reporting some b ≠ θi, t⋆.
So fix round t = t⋆, fix the
public history ht − 1 that
would obtain under truthful play up to t − 1, and compare two
counterfactual histories at the end of round t:
ht = (ht − 1, bi, t = θi, t, b−i, t = θ−i, t, public
signals), ht′ = (ht − 1, bi, t = b, b−i, t = θ−i, t, public
signals′).
These two histories differ only through agent i’s round-t report entry and whatever
(bounded) derived learner inputs that report affects; by assumption, the
learner’s stability guarantee applies to such a change.
Write the deviator’s from choosing b at round t instead of truth-telling as
$$
\Delta U_t(b)
=
\underbrace{\mathbb E\big[u_i(\theta_{i,t},s_t)\mid b\big]-\mathbb
E\big[u_i(\theta_{i,t},s_t)\mid \theta_{i,t}\big]}_{\text{(A)
within-round gain}}
\;-\;
\underbrace{\lambda\Big(\mathbb E[\mathrm{pen}(b,z_{i,t})\mid
\theta_{i,t}]-\mathbb E[\mathrm{pen}(\theta_{i,t},z_{i,t})\mid
\theta_{i,t}]\Big)}_{\text{(B) expected audit gap at }t}
\;+\;
\underbrace{\mathbb E[W_{t+1}(h_t')]-\mathbb E[W_{t+1}(h_t)]}_{\text{(C)
continuation gain}}.
$$
Term (B) is exactly where enters: if ∥b − θi, t∥ ≥ ε,
then (B) ≥ λβ(ε).
Term (C) is where stability and α enter. The continuation value
Wt + 1 is
an expectation over future mechanism draws (from the learner) and future
audit signals. Conditional on the realized future public history, each
round τ > t the
learner outputs a distribution over mechanisms that is weakly η-stable with respect to changing
agent i’s round-t report. Therefore, applying
round-by-round with B equal to
the remaining discounted mass from that round onward, and summing over
τ > t, we obtain
the uniform bound
𝔼[Wt + 1(ht′)] − 𝔼[Wt + 1(ht)] ≤ 4η α.
(Informally: at each future round, stability limits how much the
mechanism distribution can be tilted; bounded utilities convert this
into a bounded per-round payoff tilt; α aggregates these tilts across the
future.)
Putting these pieces together, for any deviation with ∥b − θi, t∥ ≥ ε,
ΔUt(b) ≤ (A) − λβ(ε) + 4ηα.
Under the maintained modeling choice that our mechanism class is
myopically truthful (so (A) ≤ 0
conditional on any fixed mechanism draw), condition implies ΔUt(b) ≤ 0.
Hence there is no profitable “large” deviation at the first deviation
time. By one-shot deviation, there is no profitable deviation strategy
that ever makes a deviation of size at least ε.
Finally, the ε-approximate nature of the equilibrium reflects precisely what we choose to police: reports within distance ε are left unconstrained by . If, as is common in applications, the platform rounds reports to an ε-grid (so deviations smaller than ε do not change any learner input or allocation), then even these deviations are payoff-irrelevant and the same argument yields exact truthfulness. More generally, our conclusion is the stated ε-NIC guarantee.
Condition is a transparent policy rule: enforcement intensity λ times “detectability × severity” β(ε) must exceed the algorithm’s manipulability budget 4ηα. It also highlights a limitation: if verification is very noisy (small β(ε)) or agents are very forward-looking (large α), then maintaining incentives requires either heavy auditing (large λ) or stronger penalties (larger P inside β), both of which can be costly or infeasible. This is exactly the tradeoff we quantify next, when we decompose regret into learning performance, audit distortion, and approximation error and then optimize the design parameters.
Having established that mixing audits into a weakly-stable learner restores incentives, we now ask the complementary performance question: The answer is naturally stated as a regret decomposition. One component is standard learning regret (we compete with the best fixed mechanism in a benchmark class). A second component is : whenever we audit, we do not run the learner’s preferred mechanism, so we mechanically lose objective value. The final components are : (i) the tolerance region ε leaves slack in reports, which can translate into objective loss when the objective is sensitive to small type perturbations; and (ii) if we discretize a continuous mechanism family into a finite class Π, we pay discretization error.
Fix a finite mechanism class Π (either intrinsically finite or
obtained by discretizing a continuous design space). For the online
procedure Mλ (learner with
audit mixing rate λ), we
evaluate performance along the truthful equilibrium path guaranteed by
the incentive theorem. Let $\mathrm{Alg}(M_\lambda)=\mathbb E[\sum_{t=1}^T
G_t(\theta_t,s_t)]$ denote the planner’s realized objective under
Mλ
(expectation over the learner’s randomization and audit randomness,
conditional on the type sequence). We compare against the best fixed
mechanism in hindsight within Π:
$$
\mathrm{Opt}_\Pi
\;:=\;
\max_{\pi\in\Pi}\;
\mathbb E\Big[\sum_{t=1}^T G_t(\theta_t,s_t(\pi))\Big],
$$
where st(π)
denotes the outcome induced at round t when π is run on the (truthful) report
profile. The (truthful-play) regret is then
RegT(Mλ) := OptΠ − Alg(Mλ).
Because Gt ∈ [−1, 1],
regret is always bounded by 2T, and the goal is to obtain
sublinear dependence on T up
to the unavoidable approximation terms.
It is helpful to conceptually “peel off” the losses we incur relative to OptΠ.
Suppose first there were no audits and no incentive issues, so that the
platform simply ran a standard online algorithm over Π against the realized objectives.
Under full-information feedback (the platform can evaluate Gt(θt, st(π))
for every π ∈ Π after
round t), Hedge delivers the
familiar guarantee
$$
\mathrm{Reg}_T^{\mathrm{learn}}
\;\le\;
\frac{\log|\Pi|}{\eta_H} \;+\; O(\eta_H T),
$$
for a learning rate ηH > 0 (here
the constant in O(⋅) depends
only on the bounded range of the objective). This is the purely
statistical price of searching over a richer mechanism class: log |Π| enters multiplicatively.
Under Mλ,
with probability λ we run
πaud instead of the
learner-chosen mechanism. Even if πaud is “reasonable,” it
is generally not the optimizer of Gt at that
round. Since Gt ∈ [−1, 1], a
crude worst-case bound is that substituting any other mechanism can
reduce the round objective by at most 2. Thus the expected cumulative distortion
from auditing is at most
Lossaudit ≤ 2λT.
This term is where incentives directly trade off against welfare:
stronger enforcement (larger λ) worsens the objective
mechanically.
Our incentive guarantee is ε-NIC: deviations smaller than ε are not deterred by the audit gap
condition. To convert this into a welfare statement, we impose a
standard sensitivity bound: there exists a constant Lθ such that
changing the report profile by at most ε in norm changes the induced
objective by at most Lθε per
round (this can be derived, for example, from Lipschitz utilities and
continuity of the allocation rule in types). Then the worst-case
objective loss from “allowed” within-ε deviations is
Lossε ≤ O(Lθ ε T).
Separately, if Π discretizes a
continuous design set 𝒰, we incur a
benchmark mismatch
$$
\mathrm{disc}(\Pi)
\;:=\;
\sup_{\text{type sequences}}\Big(\max_{\pi\in\mathcal U}\sum_{t=1}^T
G_t(\theta_t,s_t(\pi))
\;-\;
\max_{\pi\in\Pi}\sum_{t=1}^T G_t(\theta_t,s_t(\pi))\Big),
$$
which is application-specific and typically decreases as the
discretization is refined (at the expense of larger |Π|).
The preceding terms would appear in any “audit-mixing” scheme, but
our setting adds a crucial coupling: we cannot choose λ freely, because it must satisfy
the incentive feasibility condition
λ β(ε) ≥ 4η α,
where η is the learner’s
per-round stability and α is
the effective horizon. Thus, any design that makes the learner (smaller
η) relaxes the required audit
rate and therefore reduces audit distortion. Conversely, a more
aggressive learner can have smaller learning regret but may force more
auditing.
For standard multiplicative-weights updates (Hedge), stability is
controlled by the learning rate: when payoffs are bounded, the induced
mapping from histories to distributions over Π satisfies a max-divergence bound
of the form η ≲ c ηH
for a constant c. Substituting
the IC constraint at equality (the minimal auditing that still deters
large deviations),
$$
\lambda^\star(\varepsilon)
\;=\;
\frac{4\eta\alpha}{\beta(\varepsilon)}
\;\approx\;
\frac{4c\,\eta_H\,\alpha}{\beta(\varepsilon)},
$$
and then inserting into Lossaudit ≤ 2λT
yields an contribution of order
$$
\mathrm{Loss}^{\mathrm{audit}}
\;\lesssim\;
\frac{\eta_H\,\alpha\,T}{\beta(\varepsilon)}.
$$
This is the key point: the audit loss is not an independent design knob;
it is an endogenous consequence of stability and detectability.
Putting the pieces together, one obtains a bound of the form
Choosing ηH to balance
the first two (sublinear-in-T)
terms yields a convenient scaling. If we take $\eta_H\asymp 1/\sqrt{\alpha T}$ (which makes
the learner increasingly stable as agents become more forward-looking),
then the learning term becomes $O((\log|\Pi|)\sqrt{\alpha T})$ and the
induced audit term becomes $O(\sqrt{\alpha
T}/\beta(\varepsilon))$, matching the headline form we will use
going forward:
$$
\mathrm{Reg}_T(M_\lambda)
\;\le\;
O\Big((\log|\Pi|)\sqrt{\alpha T}\Big)
\;+\;
O\Big(\tfrac{\sqrt{\alpha T}}{\beta(\varepsilon)}\Big)
\;+\;
O(L_\theta\,\varepsilon\,T)
\;+\;
O(\mathrm{disc}(\Pi)).
$$
We view this expression as the planner’s “audit–regret frontier”: it
isolates the single scalar function β(ε)—the
detectability/severity of deviations of size at least ε—as the object that maps
verification technology into welfare guarantees.
For a fixed tolerance ε,
the welfare-optimal choice of audit frequency is typically to audit ,
i.e.,
$$
\lambda
\;=\;
\min\Big\{1,\;\frac{4\eta\alpha}{\beta(\varepsilon)}\Big\}.
$$
Auditing more than this cannot improve incentives (once large deviations
are already unprofitable) but continues to distort the objective. The
truncation at 1 highlights a
feasibility limitation: if β(ε) is too small (noisy
verification, weak penalties, or too small a tolerance window), then
even λ = 1 cannot satisfy the
IC inequality, and one must revisit the audit design (increase P, change the test, or accept a
larger ε).
The remaining design choice is ε, which enters the regret bound in
two opposing ways:
incentives become cheaper as
ε↑ (β(ε) typically increases), but
allocations become noisier as
ε↑ (LθεT
grows).
Abstracting away log |Π| and
discretization terms (which do not depend on ε), we are led to the
one-dimensional optimization problem
$$
\min_{\varepsilon>0}\;
\Big\{\,C_1\frac{\sqrt{\alpha T}}{\beta(\varepsilon)} \;+\; C_2
L_\theta\,\varepsilon\,T\,\Big\},
$$
for constants C1, C2
coming from the preceding inequalities. This objective is typically
U-shaped in ε: very small
ε makes β(ε) tiny (audits must be
frequent), while very large ε
creates large tolerated misreporting regions that degrade outcomes
directly. In applications where β(ε) has a closed form
(e.g., Laplace verification with threshold tests), the minimizer can be
computed analytically or by a simple line search; otherwise, a numerical
minimization over ε ∈ (0, 1]
is straightforward because the problem is one-dimensional and β(ε) is monotone.
To illustrate what the frontier can look like, consider Laplace verification z = θ + ξ with ξ ∼ Laplace(0, σ) and a threshold penalty rule pen(b, z) = P1{|b − z| > τ}. For deviations Δ = ∥b − θ∥ ≥ ε (and in the scalar case for simplicity), Proposition~1 provides a closed form. In the informative regime ε ≥ τ, β(ε) is increasing in ε and approaches the limit P(1 − e−τ/σ) as ε → ∞: once a misreport is “large enough,” the test almost surely flags it whenever auditing occurs, so detectability saturates. The frontier implication is immediate: beyond some point, increasing ε barely reduces required auditing (because β(ε) has plateaued) but continues to increase the direct tolerance loss LθεT. Hence the optimal ε is often near the scale at which verification becomes meaningfully informative, rather than arbitrarily large.
The final knob is the richness of Π. A finer discretization reduces disc(Π) but enlarges log |Π| and therefore the learning term. Practically, this suggests a two-step tuning heuristic: (1) pick a discretization coarse enough that log |Π| does not dominate $\sqrt{\alpha T}$-scaling terms; (2) verify that disc(Π) is below the economic tolerance for approximation error in the application (e.g., reserve-price grids in auctions, menu-size constraints in pricing, or tax schedules in policy design). Because audits already introduce an unavoidable distortion, it is rarely optimal to drive disc(Π) to zero if doing so explodes |Π| and learning regret.
Two caveats are worth keeping in view. First, the cleanest bounds above assume full-information feedback; with bandit feedback (only observing Gt for the executed mechanism) one can use Exp3-style algorithms, but the learning term worsens and the stability analysis requires additional care. Second, the Lipschitz conversion O(LθεT) is a reduced-form way to capture how tolerant incentives translate into welfare loss; in market design settings with discontinuities (e.g., tie-breaking, capacity constraints), establishing such a bound may require smoothing or explicit randomization.
What this section delivers, however, is a tractable design lens: once we can (i) quantify β(ε) from the verification technology and penalty rule, and (ii) bound Lθ from the allocation environment, the tuning of (ηH, λ, ε) follows from an explicit frontier. In the next section we use this lens to derive comparative statics—how optimal audit rates and tolerances move with long-sightedness α, verification noise, penalty magnitude P, and the environment’s sensitivity Lθ.
The regret–incentives decomposition above collapses the design problem to a small number of economically interpretable primitives: (i) how forward-looking agents are (their effective horizon α), (ii) how manipulable the learner is (its per-round stability η), (iii) how informative verification is (captured by the gap β(ε), which in turn depends on the noise scale σ, the test/threshold τ, and penalty magnitude P), and (iv) how sensitive the objective is to small type perturbations (the Lipschitz constant Lθ). We now spell out how optimal auditing and tolerance choices move with these primitives, and we translate the resulting comparative statics into a practical tuning workflow.
Across environments, one robust principle is that auditing is
typically welfare-improving only insofar as it is needed for incentives.
Because audits mechanically substitute away from the learner’s preferred
mechanism, the welfare-optimal policy (holding fixed ε and the audit design) is to
enforce :
$$
\lambda^\star(\varepsilon)
\;=\;
\min\Big\{1,\;\frac{4\eta\,\alpha}{\beta(\varepsilon)}\Big\}.
$$
All comparative statics for auditing therefore flow through two
channels: parameters that increase the manipulation benefit (via η or α) push λ⋆ up, while parameters
that improve detectability/severity (via β(ε)) push λ⋆ down. The remaining
design degrees of freedom—ε
(how large a lie we bother deterring) and the audit test
itself—determine which side of this inequality is cheapest to move.
The first and most direct monotonicity is in α. A larger α means an agent can rationally incur a short-run cost in exchange for steering the learner’s future mechanism choices, so the continuation-value gain from a single-round misreport is larger. Because the incentive constraint is λβ(ε) ≥ 4ηα, we need either (i) more frequent audits (λ↑), (ii) a more punitive/more informative audit that raises β(ε), or (iii) a more stable learner (η↓).
This observation has a useful operational corollary: when we expect
long-lived participants (large α), it is often better to . Indeed,
for multiplicative-weights learners, stability is controlled by the
learning rate, so we can deliberately slow down updates to reduce η and thereby reduce the required
audit rate. Concretely, with Hedge we typically have η ≲ c ηH,
so λ⋆(ε) ≈ 4c ηHα/β(ε).
Choosing $\eta_H\asymp 1/\sqrt{\alpha
T}$ makes the induced audit frequency scale like
$$
\lambda^\star(\varepsilon)
\;\asymp\;
\frac{\sqrt{\alpha/T}}{\beta(\varepsilon)},
$$
which grows only like $\sqrt{\alpha}$
rather than linearly. Economically, this is the familiar ``patient
agents require gentle policies’’ lesson: if we insist on a fast-reacting
learner, we must pay for it with more enforcement.
The audit gap β(ε) is the reduced-form object translating a noisy verification technology into deterrence. Under essentially any reasonable noise model, increasing the noise scale σ makes the distribution of z = θ + ξ less informative about θ, weakening the detection advantage of misreports and therefore decreasing β(ε) for fixed (ε, τ, P). The incentive boundary then implies λ⋆ must rise, possibly to the feasibility limit λ = 1.
The more subtle comparative static is on the ε⋆. When verification becomes noisier, it becomes expensive (in audit frequency) to deter small lies, because small deviations are hard to distinguish from noise. The frontier optimization therefore tends to increase ε⋆: we stop trying to police fine-grained deviations and instead focus enforcement on economically meaningful lies that are detectable. This logic mirrors many real compliance systems: when measurement error is high (e.g., self-reported inputs with weak third-party verification), policy often shifts from precise truth enforcement to coarse rules that deter only large misstatements.
In settings where β(ε) exhibits saturation (as in threshold tests with light-tailed noise), the implication is particularly sharp. Once ε is large enough that deviations are detected with near-maximal probability conditional on an audit, further increasing ε yields little additional incentive benefit (because β(ε) is near its plateau) but continues to increase the direct tolerance loss LθεT. Thus noisier verification pushes ε⋆ upward only until we reach the ``informative’’ scale of the signal; beyond that, improving welfare requires changing the audit technology (lowering σ) or increasing effective penalties (raising P), not simply loosening tolerance.
When the penalty rule scales linearly with P (as in pen(b, z) = P ⋅ 1{flag}),
the effective gap typically scales linearly as well: β(ε) = P ⋅ β̃(ε),
where β̃(ε) depends on
the noise model and the test. Then the boundary audit probability
obeys
$$
\lambda^\star(\varepsilon)
\;\approx\;
\frac{4\eta\alpha}{P\,\tilde\beta(\varepsilon)},
$$
so increasing P reduces
required auditing one-for-one. This is the cleanest comparative static
in the model: if we can double enforceable penalties, we can (roughly)
halve audit frequency while preserving incentives.
The limitation is that P is
not a free variable in many applications. Legally, penalties may be
capped; operationally, harsh penalties may induce churn or
nonparticipation; ethically, noisy audits create false positives whose
burden scales with P. In
practice, P often must be
chosen jointly with the audit threshold τ to satisfy an IR-type constraint
of the form
expected surplus from
participation ≥ λ P pfp(τ),
where pfp(τ)
is the false-positive probability for truthful agents. This constraint
introduces a familiar compliance tradeoff: higher penalties can reduce
audit frequency, but only if we simultaneously reduce false positives
(e.g., by raising τ or
improving measurement), otherwise we simply transfer welfare away from
truthful participants.
The Lipschitz constant Lθ governs how costly an ε-slack equilibrium is in terms of realized objective value. When Lθ is large, even small within-ε deviations can translate into meaningful objective loss, so the tolerance term O(LθεT) pushes us toward a smaller ε⋆. But smaller ε typically reduces β(ε), which then raises λ⋆ via the incentive boundary. Thus, higher Lθ tends to shift the optimal design toward : either more audits, or stronger/more informative audits, or a more stable learner (smaller η) to keep audit distortion manageable.
This comparative static also offers a concrete modeling lesson. In environments with discontinuities—e.g., capacity constraints, tie-breaking, threshold eligibility rules—Lθ may be effectively large or ill-behaved, making ε-slack costly. In such cases, it can be welfare-improving to directly (randomized tie-breaking, smoothed allocation rules, or deliberately coarse outcome granularity) to reduce Lθ. In our framework, decreasing Lθ is economically equivalent to making tolerance cheaper, which reduces the need to audit aggressively for fine deviations.
While the exact optimizer of ε is application-specific through
β(ε), the frontier
problem has a generic shape:
$$
\min_{\varepsilon>0}\;\Big\{\,C_1\frac{\sqrt{\alpha
T}}{\beta(\varepsilon)} \;+\; C_2 L_\theta\,\varepsilon\,T\,\Big\},
$$
where C1, C2
absorb constants from the learning and incentive bounds. A useful
back-of-the-envelope rule emerges if β(ε) grows approximately
linearly over the relevant range, say β(ε) ≈ k ε
(this is a reasonable local approximation when ε is near a detection threshold and
far from saturation). Then the optimizer scales as
$$
\varepsilon^\star
\;\asymp\;
\sqrt{\frac{C_1}{C_2}}\;
\Big(\frac{\sqrt{\alpha T}}{k\,L_\theta\,T}\Big)^{1/2}
\;=\;
\sqrt{\frac{C_1}{C_2}}\;\frac{\alpha^{1/4}}{\sqrt{k\,L_\theta}\;T^{1/4}}.
$$
This heuristic captures the qualitative directions emphasized above:
ε⋆ increases with
long-sightedness α (we
tolerate larger lies because deterring small ones becomes expensive),
decreases with objective sensitivity Lθ, and
decreases slowly with the horizon T (longer horizons justify finer
enforcement because the per-round tolerance loss accumulates).
Of course, if β(ε) saturates, the relevant regime is instead ``choose ε near the onset of saturation,’’ after which further increases in ε are dominated by LθεT. This is precisely where closed-form audit models (like Laplace threshold tests) are valuable: they let us locate, quantitatively, the scale at which marginal detectability gains become small.
For applications, we find it helpful to treat the model as a checklist that forces all enforcement costs into a single accounting identity.
Fix an audit test and estimate (or upper bound) the noise scale σ and the false-positive rate pfp(τ) for truthful behavior. Use these to compute or lower bound β(ε) over a grid of ε values. If β(ε) is uniformly tiny, no amount of clever learning will restore incentives without infeasible auditing; the binding constraint is measurement, not optimization.
Pick P and τ to satisfy participation/acceptability constraints (formal IR, legal constraints, or reputational considerations). This step caps the maximal achievable β(ε).
Choose ηH not only to minimize statistical regret, but to control η and thereby reduce the enforcement burden. When α is large, stability is a first-order design parameter.
For each candidate ε, set λ = λ⋆(ε) and evaluate the resulting bound (or a simulation-based proxy for it). Choose ε by a one-dimensional search. If λ⋆(ε) hits 1 over a wide range, the feasible region is telling us that the audit technology is too weak for the strategic environment; we must revisit Steps 1–2 or accept weaker incentive guarantees.
These monotone directions are robust, but two caveats matter in practice. First, α is a reduced-form summary of intertemporal incentives; different discount profiles can yield the same α but different short-run behavior, which can matter if audits are bursty or if participation is intermittent. Second, Lθ hides environment-specific non-smoothness; in some markets the true welfare impact of small misreports is highly state-dependent, so a single global Lipschitz constant may be conservative.
These considerations motivate the extensions that follow. In many platforms, agents are heterogeneous in patience and in detectability (different αi, different noise levels, different compliance histories), making natural. Participation constraints may also be endogenous: auditing and penalties can change who shows up and when. Finally, with multi-dimensional types the choice of norm in ∥b − θ∥ ≥ ε becomes economically meaningful, because it determines which deviations are treated as ``large’’—a modeling choice that can be aligned with which dimensions are verifiable or welfare-critical.
We now sketch three extensions that arise almost immediately once we try to operationalize the framework on real platforms: heterogeneous agents (and the case for targeted audits), endogenous participation constraints, and genuinely multi-dimensional types where the choice of norm in ``∥b − θ∥ ≥ ε’’ becomes a design decision rather than a mathematical convenience. Our goal here is not to redo the full analysis in each case, but to show how the same two levers—(i) stability limiting dynamic manipulation and (ii) audits creating an effective penalty gap—continue to organize the economics.
In many environments, agents differ along exactly the dimensions that determine the incentive boundary: patience/retention (an αi or a discount profile γi(⋅)), manipulability (how strongly their report affects the learner’s future actions), and detectability (a noise scale σi or an audit-gap function βi(ε)). A uniform audit rate λ designed for the worst case can therefore be extremely conservative: it over-audits short-lived or easily verified participants, and under-audits the few participants with both high strategic leverage and weak verifiability.
A minimal formal extension replaces the single constraint by
agent-specific constraints. If agent i has effective horizon αi and audit
technology implying an effective gap βi(ε),
then the same logic yields a sufficient condition of the form
λi βi(ε) ≥ 4η αi for
all i,
where λi
is the probability that agent i is audited in a given round (or,
more generally, the marginal audit probability applied to their report
in that round). This immediately suggests a targeted auditing
policy:
$$
\lambda_i^\star(\varepsilon)\;=\;\min\Big\{1,\;\frac{4\eta\,\alpha_i}{\beta_i(\varepsilon)}\Big\}.
$$
The economic content is straightforward: we devote auditing capacity to
where it purchases the most deterrence per unit of distortion. In
platform terms, this is the familiar move from “random checks on
everyone” toward “risk-based” or “priority” checks.
Once λ depends on the agent or on history, the audit policy itself becomes something agents may try to influence. If we tell agents “those who look suspicious will be checked,” then “looking suspicious” becomes a strategic state variable. This is not merely a behavioral concern; it is a mechanism-design issue because the targeting function can reintroduce dynamic incentives even if the learner over Π is stable.
One principled way to keep targeting from being manipulable is to
make the audit lottery depend only on or hard-to-manipulate signals:
tenure, transaction volume, coarse categories, or delayed verification
signals that are realized after the report is submitted. Another is to
commit to a targeting policy that is itself stable in the same
max-divergence sense: small changes in a report should not sharply
change audit probability. Concretely, one can require a “targeting
stability” condition such as
Pr (audit
i ∣ h) ≤ eηaudPr (audit
i ∣ h′)
for adjacent histories h, h′ differing
in one report entry, and then fold ηaud into the same
incentive accounting. Intuitively, if audit probability can jump
discontinuously when an agent deviates, then audits become an additional
punishment channel (often attractive in practice, but normatively and
legally delicate); if audit probability changes smoothly, then it
behaves like a standard enforcement budget rather than a discretionary
sanction.
A further practical complication is that αi and βi(⋅) may not be known a priori. Platforms often infer patience from retention, and infer detectability from past audit outcomes. This creates a feedback loop: we audit to learn who is risky, but auditing itself changes incentives and behavior.
Within our framework, a conservative approach is to treat
heterogeneity via : maintain a lower confidence bound $\underline\beta_i(\varepsilon)$ for each
agent based on audit data, and set
$$
\lambda_{i,t}\;=\;\min\Big\{1,\;\frac{4\eta\,\bar\alpha_i}{\underline\beta_i(\varepsilon)}\Big\},
$$
where ᾱi
is an upper bound on effective horizon (or a regulatorily chosen
worst-case for a class of agents). This is economically analogous to
“prudential” enforcement: we audit more when we are uncertain about
detectability. The limitation, of course, is that this can be overly
conservative early on; tightening it requires an explicit exploration
policy for audits, which reopens the question of whether the exploration
itself can be manipulated. Addressing this rigorously typically requires
combining strategic incentive bounds with bandit-style confidence
arguments, and—crucially—ensuring that the audit-allocation algorithm is
itself stable.
The earlier comparative statics treated participation constraints as an external cap on feasible penalties and false positives. In many applications, participation is an endogenous margin: agents can opt out, multi-home, reduce activity, or shift into less verifiable channels. Audits then affect not only truthfulness but also who shows up and what types are represented in the data used by the learner.
A simple reduced-form way to incorporate this is to introduce an
outside option Uiout
and require an ex-ante (or interim) individual rationality
constraint:
𝔼[∑t ∈ Tiγi(t)(ui(θi, t, st) − 1{audit} ⋅ peni(bi, t, zi, t))] ≥ Uiout for
all relevant types.
Even if the incentive constraint is satisfied, an aggressive audit
policy can violate IR through false positives or through the perceived
“tax” of being monitored. Empirically, many platforms experience this as
a trust and retention problem: enforcement that is statistically
justified may still be commercially infeasible.
This produces a three-way tradeoff: (i) audits deter manipulation and protect the learner; (ii) audits distort current outcomes (because we mix away from the learner-chosen mechanism); and (iii) audits can shrink participation, which reduces both welfare directly and the informational richness of the environment the learner faces. In such settings, it can be optimal to from punitive enforcement toward design choices that raise baseline verifiability (lower σ), reduce the damage of small misreports (lower effective Lθ through smoother rules), or provide against false positives (e.g., refundable deposits, appeal processes, or penalty escrow). Formally, these are ways to relax IR without giving up the effective penalty gap on large deviations.
One practical mechanism that sits naturally inside our model is to implement penalties via temporary holds or deposits rather than irreversible fines. Economically, a deposit can preserve deterrence while reducing expected welfare loss for truthful agents if disputes can be resolved with additional information. In reduced form, this corresponds to replacing pen(b, z) with a two-stage function: a provisional penalty applied when the signal is extreme, and a refund probability that depends on subsequent evidence. The effective gap β(ε) then depends on the net expected loss from misreporting after the refund stage, while the IR cost depends on the expected (possibly time-discounted) burden on truthful agents. This distinction matters when agents are liquidity constrained or when delayed refunds impose real costs—yet it also offers a policy-relevant design knob that is often more acceptable than raising P outright.
With Θ = [0, 1]d, the event “∥b − θ∥ ≥ ε” is not innocuous: it defines which deviations we invest in deterring. In one dimension, deviation size is unambiguous; in d dimensions, the same scalar ε can correspond to very different shapes (balls) and therefore different behavioral margins.
Two forces push us toward careful norm choice. First, verifiability is frequently anisotropic: some coordinates are easy to verify (e.g., past transactions), others are inherently private (e.g., intent or cost). Second, the planner objective may be far more sensitive in some dimensions than others (a “high-stakes” coordinate). A single Euclidean tolerance may therefore be poorly aligned with either enforcement or welfare.
A flexible specification is to use a weighted norm, for example
$$
\|x\|_{W,1}=\sum_{k=1}^d w_k|x_k|
\qquad\text{or}\qquad
\|x\|_{W,2}=\sqrt{x^\top W x},
$$
and define deviations as “large” when ∥b − θ∥W ≥ ε.
Choosing W lets us encode “we
care more (or can verify more) along these dimensions.” On the audit
side, the effective gap becomes a function of this geometry:
β(ε) = inf∥b − θ∥W ≥ ε (𝔼[pen(b, z) ∣ θ] − 𝔼[pen(θ, z) ∣ θ]).
On the welfare side, the tolerance loss is governed by an appropriate
Lipschitz constant in the same norm (or by separate constants per
coordinate). The design problem is then not only choosing ε, but choosing the enforcement
geometry that best matches both detection and objective sensitivity.
A further issue is statistical: if we use a single radial test in ℝd (e.g., penalize when ∥b − z∥2 > τ), then in high dimension the noise distribution’s tails and concentration can behave counterintuitively, and the gap β(ε) can deteriorate unless thresholds scale with d. In practice, platforms often adopt coordinate-wise or rule-based checks (“if any critical field is inconsistent, flag”) precisely to avoid this.
In our notation, a coordinate-wise threshold audit might take the
form
pen(b, z) = P ⋅ 1{maxk ≤ d|bk − zk| > τk},
or a weighted sum of flags. Such rules can yield a larger and more
interpretable β(ε)
for deviations concentrated in sensitive coordinates, while keeping
false positives controlled by tuning the τk’s. The
limitation is that they privilege sparse deviations: a manipulator could
spread a lie across many dimensions, staying below every coordinate
threshold while still shifting the allocation. This is exactly where the
choice of norm and weights matters—an ℓ∞-style audit pairs
naturally with an ℓ∞-style tolerance, while
an ℓ1-style welfare
sensitivity may require audits that aggregate evidence across
coordinates.
Heterogeneity and multidimensionality interact: if we can only verify
some coordinates well, then an “audit probability” is not the only
targeting dimension—we can also target to verify more carefully. This
suggests a richer audit lottery that chooses an audit (which
coordinates, which thresholds, which follow-up steps) rather than a
single πaud.
Conceptually, we can index audit modes by m ∈ ℳ, each delivering a gap βm(ε)
and a false-positive cost, and then choose a mixture over m to satisfy
∑m ∈ ℳλm βm(ε) ≥ 4η α
at minimum expected distortion/IR cost. This is a direct analogue of
allocating an enforcement budget across different investigative
tools.
These extensions sharpen the main message rather than overturn it. As we add heterogeneity, participation responses, and multidimensional types, the same accounting identity persists: dynamic manipulation benefits scale with stability and horizon, while deterrence comes from an expected penalty gap created by audits. What changes is that the designer now has additional degrees of freedom—who to audit, when to audit, which dimensions to verify, and how to structure penalties to preserve participation.
The pragmatic takeaway is that “audit rate” is not a single scalar in realistic deployments. It is a policy surface over agents, histories, and type dimensions, and it must be designed with two constraints in mind: it should be hard to game (or explicitly stable), and it should be aligned with what the platform can actually verify and what the objective truly cares about. These considerations set the stage for the final discussion, where we connect the reduced-form objects in our theory to concrete platform mechanisms such as refund holds, delayed allocation, and random checks, and to the policy questions they raise.
Our model treats “auditing” as an abstract enforcement primitive: with probability λ we switch from the learner-chosen mechanism to an audit mechanism πaud that produces an expected incremental loss of at least β(ε) for deviations larger than ε. On platforms, however, enforcement rarely looks like a literal re-running of the transaction under a separate rule. Instead, it is implemented through operational controls—refund holds, delayed allocation, and random checks—that simultaneously (i) create a verifiable signal linked to the underlying type, and (ii) impose a cost on misreporting. The practical value of the framework is that it lets us translate these controls into the same two accounting objects: a deterrence term (an effective penalty gap) and a distortion term (how much auditing disrupts allocation and participation).
A widely used enforcement tool is not an irreversible fine, but a temporary hold: the platform withholds a payment, blocks a withdrawal, freezes a listing, or delays a rebate until some verification clears. This is economically close to a deposit mechanism. The key observation is that a hold can generate a large β(ε) even when the platform is reluctant (legally or reputationally) to levy harsh punishments. If a misreport increases the probability that funds are held (or held longer), then the expected cost to the agent can be substantial once we account for discounting and liquidity constraints.
In our notation, this corresponds to a penalty that is not a one-shot
subtraction P ⋅ 1{|b − z| > τ},
but a of losses: a hold amount H applied with some probability and
a release time that depends on subsequent evidence. Reduced-form, one
can write an equivalent per-round expected penalty in utility units
as
𝔼[pen(b, z) ∣ θ] ≈ Pr (hold ∣ b, θ) ⋅ κ(H, duration),
where κ(⋅) converts the hold
into an effective utility cost given the agent’s time preferences and
liquidity. The platform designer’s advantage is that κ can be large even when the
expected penalty is small. This helps with participation constraints: a
truthful agent may face an occasional hold (a false positive),
but—conditional on providing documentation or waiting out the hold—the
expected permanent loss can be near zero. In terms of our earlier
objects, deposit-like enforcement can keep β(ε) high while reducing
the IR burden relative to raising P.
The limitation is that holds impose real costs that our simple ui(θ, s) ∈ [0, 1] normalization hides: liquidity, risk, and administrative burden. Moreover, holds create a new distributional concern: agents with fewer outside financing options bear a larger welfare loss for the same nominal hold. This immediately makes enforcement policy-relevant, because “equal treatment” in nominal penalties is not equal treatment in economic incidence.
A second common mechanism is to delay the final allocation or reward until some ex post outcome is observed: for example, pay sellers only after delivery confirmation, release gig payments after customer validation, or allocate prominent search ranking only after quality signals arrive. Economically, delay serves two roles. First, it creates a correlated with the hidden type (quality, effort, cost): zi, t need not be a noisy measurement of θi, t directly; it can be an outcome proxy (returns, disputes, churn) that becomes informative only with time. Second, delay changes incentives even without explicit penalties: if misreporting yields short-run gains but increases the probability of later reversal, chargeback, or de-ranking, then the expected net gain from deviating falls.
In our framework, these policies enlarge the designer’s menu for achieving a positive effective gap β(ε). Instead of relying on a direct statistical test of |b − z|, the platform can engineer z by choosing what to reveal and when. One can think of “delayed allocation” as moving some of the stage utility into the future, where it is conditioned on realized evidence. The mechanism remains online, but the verification becomes endogenous to the platform’s timing choices.
The main tradeoff is immediate: delay is itself a distortion. It can reduce matching efficiency, slow market clearing, and harm user experience. Thus it plays the same role as λ in the regret decomposition: it is a lever that improves incentives but can worsen the objective in the short run. In high-velocity environments (ad auctions, real-time pricing), even small delays can be infeasible, so the relevant analog is “delayed settlement” or “post-auction adjustments” rather than delayed allocation per se.
The third operational primitive is the simplest: random checks. Platforms often implement enforcement as a fixed-rate sampling rule: “x% of transactions are reviewed,” “x% of accounts are verified,” or “x% of claims require documentation.” This corresponds most directly to our constant λ mixing, and it has a normative advantage: transparency and predictability. If agents can anticipate the enforcement environment, then the platform can credibly claim due process and reduce perceptions of arbitrary punishment.
From an incentive perspective, random checks are attractive because they are hard to game. They do not create an obvious targeting rule that agents can manipulate by changing their “suspicion profile.” In our language, random auditing makes the audit lottery stable almost by construction. The economics, however, is that uniform randomness is rarely cost-effective: it expends auditing capacity on low-risk events. This is why many platforms migrate to risk-based systems. The challenge is to preserve the anti-manipulation properties of uniform randomness while improving cost-effectiveness—precisely the point where stability-style restrictions on the targeting policy become valuable.
Once we interpret πaud as an operational enforcement system, several policy constraints become central rather than peripheral.
First is . If penalties (including holds and de-rankings) are triggered by noisy signals, then false positives are unavoidable. A platform can respond by lowering P, raising thresholds τ, or reducing λ, but these moves weaken deterrence unless they are offset by improved verifiability (higher β(ε)). An alternative response—common in practice—is procedural: appeals, second-stage checks, and human review. In our reduced-form language, these procedures change both β(ε) and the truthful-agent cost by reshaping the distribution of net penalties conditional on truthful behavior.
Second is . Regulators and courts often evaluate enforcement not only by error rates but also by penalty severity relative to the harm. Our model makes explicit that P enters deterrence linearly, but proportionality may cap feasible P (or restrict the form of pen(⋅)). This pushes platforms toward enforcement mechanisms that are “large enough in expectation” to deter manipulation while remaining proportional in any single realization—again pointing to deposits and staged penalties rather than large irreversible fines.
Third is . Auditing requires collecting and processing data; verification signals zi, t often include sensitive information. Paradoxically, a platform may need to be transparent about the and of audits to sustain deterrence, while being constrained in what it can reveal about targeting and evidence. Our framework suggests a practical compromise: publicly commit to a coarse audit lottery (a baseline λ and broad categories), while keeping detailed detection features private, and ensure that whatever targeting remains is stable enough to limit strategic gaming.
A recurring operational lesson is that “audit more” is usually the least efficient knob. In the regret bound, audits harm the objective directly through mixing, whereas better verification improves incentives requiring as much distortion. Interpreted literally, investing in verification systems—better documentation pipelines, stronger identity checks, cryptographic receipts, better anomaly detection, or improved measurement that reduces noise scale σ—is equivalent to increasing β(ε) and thereby reducing the minimum λ needed for incentive compatibility.
This framing also clarifies why platforms often bundle enforcement with product changes. For example, requiring standardized invoices or escrowed shipment tracking is not merely an anti-fraud policy; it is a way to raise the effective penalty gap for misreporting by making audits more informative. In our terms, the platform is engineering a larger ρ(ε) (detection advantage) and thus a larger β(ε) for the same penalty schedule.
Several modeling gaps become salient once we take platform details seriously.
In many environments, the audit signal is not exogenous noise around θ, but an outcome influenced by the agent (or by both sides of a transaction). Returns, ratings, and disputes can be gamed. This turns the audit stage into an additional strategic interaction, and β(ε) becomes equilibrium-dependent. A robust theory would treat auditing as a mechanism with its own incentive constraints, rather than as a fixed penalty gap.
Our stability condition bounds the effect of a single report change on the learner’s output. On platforms, manipulation may be coordinated: fake accounts, rings, or coalitions that submit correlated reports. Extending stability and the incentive argument to group deviations would require group-privacy-style bounds and a corresponding strengthening of enforcement. Practically, this points to audit designs that target network structure (shared payment instruments, shared devices) rather than only pointwise report inconsistencies.
Many platforms do not observe a full-information objective Gt(θt, st) each round; they observe noisy proxies or bandit feedback (clicks, conversions). This complicates both learning and incentive control. The platform’s enforcement policy then affects not only agent incentives but also the statistical quality of feedback, potentially biasing the learner. A unified analysis would couple bandit regret with incentive constraints and would treat audits as both deterrence and exploration.
Targeted audits and holds can disproportionately burden certain groups if verification error rates differ. In our reduced form, this means σ (or β(ε)) varies systematically across subpopulations, and a uniform enforcement rule can generate unequal false-positive rates. Addressing this requires adding explicit fairness constraints to the enforcement optimization (constraints on false-positive incidence, hold duration, or appeal success), and then re-solving the incentive-feasibility condition under those constraints.
Taken together, refund holds, delayed allocation, and random checks are not ad hoc “trust and safety” patches; they are implementations of a common economic idea: create an expected loss from large deviations that is hard to manipulate and acceptable under participation and policy constraints. Our framework is intentionally reduced-form, but it clarifies what practitioners already grapple with: enforcement is most effective when it improves the of verification (raising β(ε)) rather than merely increasing audit frequency λ, and it is most sustainable when it is procedurally legitimate (controlling false positives, enabling appeal) rather than maximally punitive. The open technical agenda is to endogenize the audit signal, incorporate partial-feedback learning, and formalize fairness and due-process constraints—so that the same stability-plus-audits logic can be carried from a stylized repeated game into the operational realities that platforms and regulators actually face.