Continuous-Type Nash Incentive-Compatible Online Mechanism Learning via Random Audits: Stability, ε-NIC, and the Audit–Regret Frontier

Metadata

Total Words: 16,331
Export Date: 2026-01-14 23:18:11
Description: Online mechanism learning fails under strategic, forward-looking agents: even if each single-round mechanism is truthful, agents can manipulate the learner by lying early to change future rules. Huh and Kandasamy (2024) resolve this for discrete types by mixing a weakly differentially private online learner (e.g., Hedge) with a commitment mechanism that has a positive penalty gap β > 0. In modern markets, types are continuous and exact penalty gaps vanish. We introduce a clean, implementable alternative: randomized audits/verification with noise, which induce an "effective penalty gap" β(ε) against deviations larger than tolerance ε. We prove a stability-to-incentives theorem: any per-round stable learner can be wrapped with a small audit probability λ to yield ex-post ε-Nash incentive compatibility against non-myopic agents with long-sightedness α, provided λβ(ε) dominates the learner’s stability parameter. We then derive regret bounds and a closed-form tuning rule for λ and ε that optimally trades off audit distortion, learning regret, and approximation error. The results translate the weak-DP+commitment paradigm into a 2026-ready enforcement primitive aligned with platform auditing, compliance checks, and algorithmic accountability regimes.

1. Introduction: dynamic manipulation under learning, why continuous types break penalty-gap commitment, and why audits are the right 2026 primitive
2. Related work: strategic online pricing/auctions; DP and incentives; Huh–Kandasamy weak DP sequences; auditing/verification in economic design; online convex optimization stability
3. Model: repeated interaction, continuous types, objective regret benchmark, and long-sightedness α
4. Audit primitives: noisy verification, tolerance ε, and defining the effective penalty gap β(ε); closed forms for canonical noise models
5. Stability of learners: weak per-round DP / max-divergence stability; instantiations for Hedge and (optional) entropic FTRL; discussion of finite vs discretized Π
6. Main theorem (Stability + Audits ⇒ ε-NIC): backward-induction incentive proof with approximate penalties
7. Regret bounds and the audit–regret frontier: decomposition into learning regret, audit distortion, and approximation/discretization error; parameter tuning
8. Comparative statics and design guidance: how optimal audit rates scale with horizon α, verification noise, penalty size, and market Lipschitzness
9. Extensions (optional): heterogeneous agents and targeted audits; endogenous participation constraints (brief); multi-dimensional types and norm choices
10. Discussion: practical audit mechanisms on platforms (refund holds, delayed allocation, random checks), policy implications, and open problems

Content

1. Introduction: dynamic manipulation under learning, why continuous types break penalty-gap commitment, and why audits are the right 2026 primitive

Digital platforms increasingly make allocation, pricing, and ranking decisions by from past interactions. A marketplace tunes reserve prices from realized bids; a gig platform adjusts surge multipliers from observed acceptance rates; an ad exchange updates relevance and pacing rules from click and conversion feedback. In each case, the platform’s decision rule is not fixed ex ante but evolves as an online algorithm processes a growing history of reports and outcomes. This adaptivity is operationally compelling, yet it reopens a foundational question in mechanism design: when agents understand that today’s report shapes tomorrow’s mechanism, what prevents of the learning process?

The economic logic of such manipulation is straightforward. In a static mechanism, a deviation is evaluated round-by-round: an agent compares the immediate utility from truthful reporting to the immediate utility from misreporting. Under learning, that comparison becomes intertemporal. A seller might shade bids today to make the platform infer lower demand and reduce future fees; a bidder may occasionally “sacrifice” surplus to push a reserve price downward; a contractor may misstate availability to influence future matching rules. These behaviors are not pathological edge cases. They are precisely the kind of forward-looking distortions one expects when (i) the mechanism depends on past reports and (ii) agents repeatedly interact with the same platform and discount the future slowly enough that influencing the future is valuable.

A common reaction is to seek a clean separation: let the platform learn, but ensure the learner is “stable” so that any single report has limited impact on future decisions. This is more than an engineering heuristic; it is an economic commitment device. If a learner is sufficiently insensitive to any one agent’s report, then the value of manipulating the future is bounded. Recent work has formalized this idea using stability notions that are equivalent, or closely related, to differential privacy and max-divergence bounds. In our setting, we interpret such a bound as a quantitative cap on the an agent can obtain by changing one report while holding everything else fixed. Stability is therefore a natural first pillar: it makes “teaching the algorithm” costly in expectation because the algorithm refuses to learn too much from any one individual interaction.

However, stability alone is not a full incentive solution. Even if future influence is limited, an agent may still benefit from misreporting for immediate reasons (e.g., to receive a better allocation today), and the platform typically cannot observe true types to enforce truthfulness directly. In classical repeated games, one might propose punishments triggered by detected deviations; in repeated mechanism design, one might attempt to embed a —a built-in rule that makes any misreport strictly worse than truth-telling by a fixed margin. This idea is powerful in discrete-type environments because one can often guarantee a uniform gap: if a report differs from the truth, the mechanism can impose (or threaten) a fixed expected loss relative to truthful play.

The difficulty is that many economically relevant type spaces are continuous. Values, qualities, delivery costs, and risk preferences are more naturally modeled as elements of Θ = [0, 1]^d than as a small discrete set. And in continuous spaces, the very notion of a uniform penalty gap becomes fragile. If an agent can misreport by an arbitrarily small amount, then any penalty rule that is even mildly regular (or that respects limited observability) tends to generate vanishing differences between truthful and near-truthful reports. Put differently, the “distance” between two reports can be made so small that any feasible enforcement signal has negligible power to distinguish them. The familiar discrete-type logic—“any deviation triggers a detectable inconsistency”—breaks because there is no minimal deviation size. In practical terms, a platform cannot credibly commit to punishing every infinitesimal discrepancy between a reported value and an unobserved true value, especially when verification is noisy and costly.

This is where we believe a key design primitive for 2026 is neither pure commitment nor pure stability, but . Audits formalize an increasingly common institutional reality: platforms and regulators can sometimes verify, ex post and imperfectly, whether a report was materially accurate. Marketplaces cross-check shipping times and cancellation reasons; labor platforms verify location and time-on-task; financial intermediaries verify identity and transaction provenance; advertising platforms conduct measurement audits and fraud checks. Crucially, these verifications are rarely perfect—signals are noisy, delayed, and partial—yet they can still support meaningful deterrence if used probabilistically and coupled with penalties that scale with detected discrepancies.

Our perspective is that audits resolve the continuous-type obstacle by shifting the goal from exact truthfulness to truthfulness. Instead of attempting to penalize every deviation, we select a tolerance level ε > 0 and aim to deter deviations of magnitude at least ε. Economically, ε captures the platform’s enforcement resolution: the smallest distortion that is both meaningful for outcomes and detectable given the available verification technology. Deviations smaller than ε are treated as within an acceptable margin of error—either because they are hard to detect, because punishing them would generate excessive false positives, or because the induced welfare loss is second-order. This reframes incentives in a way that is both theoretically tractable and institutionally plausible.

Audits also complement stability in a precise way. Stability limits how much an agent can gain by manipulating the learner’s behavior, but it does not impose an immediate expected cost on misreporting. Auditing does: when an audit occurs, the platform draws a verification signal (for example, z = θ + ξ) and applies a penalty based on the discrepancy between the report b and the signal z. If the penalty rule is chosen so that any deviation ∥b − θ∥ ≥ ε increases expected penalties by at least a gap β(ε) > 0, then misreporting beyond tolerance carries a quantifiable expected loss. When audits occur with probability λ, the expected enforcement intensity is λβ(ε). The central economic tradeoff is then transparent: we only need enough audit intensity to dominate the bounded dynamic gain implied by stability and the agent’s effective planning horizon.

This design logic helps explain why audits have become so salient in practice. First, they are modular: a platform can run its preferred allocation rule most of the time, while occasionally switching to an audit procedure without fully redesigning the system. Second, they are scalable: audits can be randomized and targeted, focusing limited verification resources where incentives are strongest. Third, they match the informational constraints of modern markets: platforms often cannot observe types directly, but they can often obtain noisy proxies (device telemetry, transaction records, third-party attestations) that are informative enough to distinguish gross misreports from honest noise. Finally, audits are legible to policy. Regulators increasingly require “reasonable” monitoring and enforcement (e.g., against fraud, misclassification, or manipulation) rather than demanding impossible perfect observability. A randomized audit scheme with explicit tolerances and penalty schedules is an implementable compliance object.

At the same time, we do not want to oversell audits as a free lunch. Audits are costly—not only operationally but also in terms of user experience and fairness. Noisy verification generates false positives: truthful agents may be penalized due to measurement error, and this can violate individual rationality or deter participation. Penalties may be legally capped or reputationally constrained. Moreover, audits can interact with privacy concerns: collecting verification signals may itself be sensitive, and the platform may need to balance enforcement against confidentiality. These constraints motivate our emphasis on (audits happen with small probability λ) and on (we only target deviations above ε), both of which reduce the frequency and severity of enforcement while still deterring economically meaningful manipulation.

Conceptually, our model illuminates a three-way frontier between (i) learnability and performance, (ii) incentive protection against dynamic manipulation, and (iii) enforcement costs (including false positives). Stability of the learning rule improves incentives but can slow adaptation; stronger auditing improves incentives but directly reduces welfare when an audit mechanism overrides the learner’s preferred choice or when truthful agents are mistakenly penalized; larger tolerance ε reduces false positives and makes detection easier, but permits more slack in reporting and can degrade allocations. The economic problem is therefore not to achieve an idealized exact incentive constraint, but to choose (λ, ε) (and audit design parameters) to achieve the best attainable performance subject to plausible enforcement.

A final motivation for our approach is methodological. Continuous types and adaptive mechanisms are now the default in many applications, yet much of the incentive analysis still relies on discrete approximations or on equilibrium concepts that do not directly capture “teaching the algorithm.” By explicitly combining a stability bound (interpretable as a limit on strategic influence over the learner’s output distribution) with an audit-induced effective penalty gap for deviations above ε, we obtain a simple and portable recipe: This recipe yields incentive guarantees that are robust to the realized sequence of types and objectives and that coexist with standard online learning regret analyses, thereby connecting strategic behavior to algorithmic performance in a single quantitative framework.

The remainder of the paper develops this logic formally and studies its implications for audit design and online regret. We then place our results in context. The next section discusses related work on strategic online pricing and auctions, differential privacy as a tool for incentives and stability, the weak-DP perspective on repeated interactions, and the extensive literature on verification and auditing in economic design.

Our setting sits at the intersection of online learning, dynamic mechanism design, and verification. Because each of these literatures uses different primitives and benchmarks, it is useful to be explicit about what we borrow and what we do not. Broadly, we see four strands as most closely connected: (i) strategic behavior in online pricing and auctions, (ii) stability–especially differential privacy (DP)–as an incentive tool for adaptive algorithms, (iii) the ``weak DP’’ sequence perspective developed by Huh–Kandasamy and related papers on repeated interactions with long-sighted agents, and (iv) auditing and verification mechanisms in economic design. A fifth, more technical strand comes from stability theory in online convex optimization (OCO), which supplies convenient sufficient conditions for the stability bounds we assume.

A large literature studies online decision-making for pricing and auctions when the platform learns unknown demand or value distributions; canonical examples include dynamic pricing with bandit feedback and reserve-price learning in repeated auctions. Much of this work takes agents as myopic or non-strategic with respect to the learning rule: buyers respond to posted prices, bidders submit bids, and the platform optimizes regret against a static benchmark (e.g., the best fixed price or reserve in hindsight). These models capture an important operational reality—platforms do learn from data—but they treat the data-generating process as exogenous.

A more recent literature makes the endogeneity explicit: agents understand that their reports and actions affect the platform’s future decisions, and therefore may ``teach’’ the algorithm. This concern appears in repeated auctions where bidders shade bids to reduce future reserve prices, in dynamic pricing where customers strategically delay purchases to influence future prices, and in broader settings where participants manipulate feedback signals (ratings, clicks, conversions) to steer learning. Across these domains, a recurring lesson is that standard regret guarantees can be misleading in strategic environments: an algorithm that is optimal against an exogenous sequence can perform poorly once the sequence is chosen by forward-looking agents. One response is to restrict the agent side (e.g., bounded rationality, limited foresight, or no commitment to intertemporal strategies). Another is to seek algorithmic properties that make manipulation unprofitable, which leads directly to stability.

Our contribution is not to propose a new auction format or a bespoke dynamic pricing rule. Instead, we provide a modular deterrence principle that can wrap around an arbitrary online learner over a mechanism class Π. This is closer in spirit to the platform problem faced in practice: teams often have a preferred learning stack (Hedge over discretized rules, online gradient methods, contextual bandits, etc.), and the design question is how to add enforceable guardrails against strategic distortion without rebuilding the entire system.

The idea that algorithmic stability can function as an incentive device has been developed along several lines. In privacy-aware mechanism design, differential privacy and related max-divergence notions are used to ensure that no single agent’s report substantially changes the distribution of outcomes, which implies approximate truthfulness when utilities are bounded. This insight goes back to early connections between DP and approximate dominant-strategy incentive compatibility in static settings, and it has been refined in many subsequent works (including variants that handle payments, social welfare objectives, and approximate equilibrium concepts).

In the online setting, the relevant issue is not only how a report affects the current outcome, but also how it affects the learner’s future behavior. Here the DP/stability lens becomes particularly attractive: a per-round max-divergence bound on the learner’s mapping from history to a distribution over mechanisms provides an explicit handle on the magnitude of dynamic influence. Our use of weak, per-round stability (rather than full transcript DP) reflects the economics of the problem: what matters for incentives is the agent’s ability to steer the platform’s , and bounding that influence round-by-round is enough to control the continuation-value gain from a single deviation. At the same time, we emphasize a limitation that is sometimes glossed over: stability alone is typically insufficient to prevent deviations that are immediately profitable within a round. This is precisely why our model introduces an additional enforcement primitive (auditing) rather than relying on stability as a complete substitute for incentive constraints.

Our incentive analysis is most directly inspired by the framework of Huh and Kandasamy and subsequent work on repeated interactions with long-sighted agents. The key conceptual move in that line is to formalize how far ahead agents effectively plan, via a long-sightedness parameter α, and to connect α to stability properties of the platform’s learning rule. Roughly, if a single report can change the distribution over future actions only by a multiplicative factor e^η each round (a max-divergence/η-stability condition), then the total strategic benefit from manipulating the learner across the effective horizon can be upper bounded by a term of order ηα (up to constants and normalization).

We adopt this perspective because it provides a clean separation between two objects that are otherwise hard to disentangle: the platform’s adaptivity (captured by the learner and its stability parameter η) and the agent’s forward-looking capacity (captured by α). Importantly, this approach does not require us to specify a particular parametric belief model for agents or to assume stationarity in types; it is compatible with ex-post incentive statements that hold for realized sequences. Where we depart from the baseline weak-DP sequence story is in how we generate a deterrent to offset the bounded continuation gain. In discrete-type models one can sometimes posit an exogenous commitment gap: any non-truthful report triggers a fixed expected penalty difference. In continuous types, such a uniform gap typically fails without additional structure. Our audit mechanism is precisely the missing ingredient: it endogenizes a positive ``effective penalty gap’’ β(ε) for deviations larger than a tolerance ε, thereby restoring the gap logic in a continuous environment.

Verification has a long history in economics, from costly state verification and auditing in contract theory to mechanism design with evidence, inspection, and costly monitoring. In many applications, the designer cannot observe types directly but can sometimes obtain informative signals after the fact (e.g., through accounting audits, document checks, third-party attestations, or forensic analysis). These models clarify the basic tradeoff: stronger verification improves incentives but consumes resources and can harm honest agents when verification is noisy or imperfect.

A separate but related literature studies how to elicit truthful information when direct verification is unavailable, using peer prediction, scoring rules, and correlated signals. Those approaches are powerful when one can leverage cross-agent correlations or repeated tasks, but they often require strong assumptions about common priors, conditional independence, or the availability of multiple reports about the same underlying event. In contrast, our audit primitive is deliberately minimal: we assume the platform can sometimes draw a noisy signal z = θ + ξ about an agent’s true type (or a proxy for it), and can commit to penalties based on the discrepancy between b and z. This is meant to capture institutional features that are increasingly common in digital markets: platforms can occasionally verify delivery times, locations, identity documents, or transaction provenance, but only imperfectly and not at scale every round.

Our focus on a tolerance level ε connects directly to the verification literature’s concern with false positives. If verification is noisy, then punishing every deviation from z would penalize honest agents with nontrivial probability. The tolerance/threshold approach is a practical mitigation: only ``large enough’’ discrepancies are treated as evidence of misreporting, which reduces false positives and makes enforcement legible. This framing also aligns with policy practice, where compliance regimes often specify materiality thresholds and probabilistic audits rather than demanding exact agreement with a latent ground truth.

On the algorithmic side, we draw on results linking OCO procedures to stability. Many online learners commonly used in platforms—regularized follow-the-leader, mirror descent, Hedge over discretizations, and related methods—admit sensitivity bounds showing that changing one data point (or one round’s loss vector) only slightly perturbs the chosen action distribution. In privacy theory, such sensitivity is often turned into DP guarantees via noise addition or exponential mechanisms; in learning theory it yields generalization; and in online learning it can be interpreted as robustness. We do not re-derive these results, but we rely on them as a justification that our stability parameter η is not an abstract assumption: it can be achieved by standard algorithms, sometimes with mild randomization, and it can be tuned via learning rates and regularization strengths.

The economic reason we care about this connection is that it clarifies the cost of incentive protection. Imposing stronger stability (smaller η) typically slows down adaptation and worsens regret, echoing the general privacy–accuracy and robustness–adaptivity tradeoffs. Our regret decomposition makes this explicit: we pay a learning term (e.g., $(\log|\Pi|)\sqrt{\alpha T}$ under Hedge) and an enforcement term that scales with 1/β(ε) through the required audit rate. This allows us to talk about incentive constraints and algorithmic performance in a single quantitative language, rather than treating truthfulness as a hard constraint divorced from learning.

Relative to existing strategic online learning and repeated auction models, our approach is intentionally ``wrapper-like’’: we ask for (i) a stability bound on the learner and (ii) an audit mechanism that yields an effective penalty gap above tolerance. This yields a simple implementable condition, λβ(ε) ≥ 4ηα, under which truthful reporting is an ε-Nash equilibrium (ex-post), and it delivers a regret bound that makes the audit–approximation frontier explicit. The point is not that stability and audits are the only route to incentive compatibility; rather, they are a tractable and policy-relevant route when types are continuous, verification is imperfect, and the platform insists on using adaptive learning.

At the same time, the framework abstracts away from several important issues studied elsewhere. We take the audit signal structure as given rather than deriving it from an underlying measurement or investigation process; we summarize agent forward-looking behavior via α rather than modeling full beliefs and strategic sophistication; and we treat penalties in reduced form, ignoring legal and behavioral constraints beyond individual rationality. These choices are deliberate: the goal is a portable economic logic that can be plugged into different application domains. We return to several of these extensions (targeted audits, heterogeneous noise, and IR constraints with false positives) after formalizing the baseline model.

The next section specifies the repeated interaction, the continuous type environment, our benchmark for performance (objective regret against the best fixed mechanism in Π), and the long-sightedness model captured by α.

3. Model: repeated interaction, continuous types, objective regret benchmark, and long-sightedness α

We model a platform (or planner) that repeatedly selects an economic mechanism while facing agents who understand that today’s reports can affect tomorrow’s policy. The central difficulty is that the data used for learning are endogenous: the platform updates from reported types, but those reports come from forward-looking agents. Our goal in this section is to make the repeated interaction explicit, to define a performance benchmark that remains meaningful under adaptivity, and to formalize how we parameterize agents’ forward-looking behavior through a long-sightedness bound~α.

Fix a horizon T ∈ ℕ and a set of agents indexed by i ∈ [n]. In each round t ∈ [T], agent i (if present) has a privately observed type
θ_i, t ∈ Θ = [0, 1]^d,
where d captures the dimensionality of the information relevant for the mechanism (e.g., a value and a budget; a location and a time window; a vector of quality attributes). We impose no stationarity assumption on (θ_i, t)_t ≤ T: types may drift, jump, or be adversarially selected. This is deliberate. In many applications, what the platform observes is a nonstationary environment (seasonality, changing composition of users, strategic adaptation), and we want statements that are valid for the realized sequence.

To allow varying participation, let T_i ⊆ [T] denote the set of rounds in which agent i is active. When i is inactive at time t ∉ T_i, we treat the report as a null symbol b_i, t = ⊥ and the mechanism ignores it.

In each active round, agent i submits a report b_i, t ∈ Θ. Let b_t = (b_i, t)_{i ∈ [n]} be the report profile at time t. An outcome s_t is then realized in an outcome space S. We keep S abstract: it may include allocations, matchings, posted prices, payments, service levels, or other operational decisions.

Agent i’s stage utility depends on her type and the realized outcome:
u_i(θ_i, t, s_t) ∈ [0, 1].
The [0, 1] normalization is without loss of generality for our incentive analysis; it pins down the scale at which stability and enforcement trade off. (Equivalently, one may start with bounded utilities and rescale.)

The platform’s per-round objective is given by a function
G_t(θ_t, s_t) ∈ [−1, 1], θ_t = (θ_i, t)_{i ∈ [n]}.
We allow G_t to vary with t to accommodate time-varying planner goals (e.g., shifting weight between revenue and quality of service) and time-varying feasibility constraints. Examples include revenue, social welfare, a congestion or matching-quality score, or a composite metric used in practice for policy evaluation.

A single-round mechanism is a (possibly randomized) mapping
π : (Θ ∪ {⊥})ⁿ → Δ(S),
which takes the report profile and returns a distribution over outcomes. We focus on a finite (or discretized) mechanism class Π, which should be interpreted as the platform’s design space: for instance, a grid of reserve prices, a menu of scoring rules, a finite family of allocation rules, or a discretized set of parameters for a differentiable mechanism. The finiteness is what permits standard online learning guarantees; any approximation cost from discretization will later appear as a separate term.

An (or platform policy) M is the repeated process that chooses, at each round t, a distribution q_t ∈ Δ(Π) based on the public history and then draws a mechanism π_t ∼ q_t to execute. The realized outcome is then sampled as
s_t ∼ π_t(b_t).
We assume the platform can commit to its randomized choice rule (i.e., the distribution q_t and the randomness used to draw π_t are part of the publicly understood protocol). This commitment assumption reflects how platform policies are often implemented: the mapping from logged data to the next policy is encoded in a deployed learning system, and agents can (at least approximately) infer the platform’s update rule over time.

Let h_t − 1 denote the public history up to time t − 1 (past reports, realized outcomes, and any publicly revealed evaluation signals). The learner is a mapping h_t − 1 ↦ q_t(⋅ ∣ h_t − 1). In the results that follow we will impose a weak, per-round stability requirement on this mapping (a max-divergence/η condition), but for now the point is simply that the platform’s mechanism choice is adaptive: q_t depends on earlier reports and earlier realized objectives.

Each agent can condition her report on her current type and the observed history. Formally, a (behavioral) strategy for agent i is a sequence of mappings
σ_i, t : Θ × ℋ_t − 1 → Δ(Θ),
where ℋ_t − 1 is the history space. Truthful reporting corresponds to σ_i, t(θ_i, t, h_t − 1) = δ_{θ_i, t}.

Because types are continuous and the platform policy is adaptive, exact incentive compatibility is typically too strong: small perturbations of a report may have only second-order effects, and any realistic enforcement/verification system will have limited resolution. We therefore work with an notion: we will aim to make truthful reporting an ε-Nash equilibrium in the repeated interaction, where deviations within an ε-ball are treated as economically negligible. The role of ε is thus twofold: it is a tolerance level for incentives, and (later) a knob that trades off enforcement intensity against accuracy.

To capture forward-looking behavior while remaining agnostic about beliefs and sophistication, we summarize how much agents value the future through discount factors γ_i(t) ≥ 0. Agent i’s total discounted utility under an online mechanism M and a strategy profile σ is
U_i(σ; M) = 𝔼 [∑_{t ∈ T_i}γ_i(t) u_i(θ_i, t, s_t)],
with the expectation over the platform’s randomization, the mechanism’s randomization, and any randomness in strategies. (We will later subtract expected audit penalties; for now, the baseline intertemporal utility highlights the channel through which an agent might benefit from manipulating the learning dynamics.)

The key quantity for our incentive bounds is not a particular functional form of discounting, but an upper bound on the from any period. Following the weak-DP sequence perspective, we define the long-sightedness of agent i at time t as
$$ \alpha_{i}(t)\;:=\;\frac{1}{\gamma_i(t)}\sum_{\substack{t'\in T_i\\ t'>t}}\gamma_i(t'), $$
and we assume a uniform bound
α ≥ max_{i ∈ [n]}max_{t ∈ T_i}α_i(t).
Intuitively, α measures how many “current-round utility units” the agent can still gain or lose in the future relative to the present. Because per-round utilities lie in [0, 1], the continuation value from time t onward is at most 1 + α in our normalization, and the continuation value beyond the current round is at most α.

This abstraction nests familiar special cases. Under geometric discounting with γ_i(t) = δ^t − 1 and full participation, we have α = δ/(1 − δ), so α blows up as δ → 1 (very patient agents). Under a hard planning window of length K (agents care only about the next K active rounds), one obtains α ≤ K. In practice, α can be interpreted as a reduced-form proxy for institutional or behavioral frictions that limit long-run manipulation: limited attention, bounded tenure on the platform, or uncertainty about future participation.

The platform cares about cumulative objective value
$$ \mathrm{Alg}(M)\;:=\;\mathbb E\!\left[\sum_{t=1}^T G_t(\theta_t,s_t)\right]. $$
To evaluate an adaptive policy, we compare it to a static benchmark: the best fixed mechanism in Π chosen in hindsight for the realized sequence. Concretely, if we were to run a fixed π ∈ Π in every round (with agents reporting truthfully), the resulting value would be
$$ \mathrm{Alg}(\pi)\;:=\;\mathbb E\!\left[\sum_{t=1}^T G_t(\theta_t,s_t)\,\middle|\, s_t\sim \pi(\theta_t)\right]. $$
We then define the (truthful-play) regret as
Reg_T(M) := max_{π ∈ Π}Alg(π) − Alg(M).
This benchmark is intentionally conservative: it does not ask the platform to compete with the best policy or the best mechanism tailored to each t, which would conflate learning with clairvoyance. Instead, it isolates the cost of learning within the chosen design space Π.

Two remarks are important for interpretation. First, the definition conditions on truthful reports for the benchmark mechanisms, because the planner’s goal is to perform well when the system is not strategically distorted. In other words, we aim to design M so that truthful reporting is (approximately) an equilibrium, and then we evaluate the resulting objective relative to the best fixed truthful mechanism in Π. Second, the discretization embedded in Π is a modeling choice: it reflects engineering constraints (a finite catalog of policies that can be deployed, tested, and audited) as much as mathematical convenience. Any loss from restricting to Π—relative to an idealized continuous family—will later be accounted for separately as an approximation/discretization term.

At this stage we have specified the repeated game and the performance yardstick, but we have not yet specified how the platform prevents within-round profitable misreports, especially with continuous types. Stability of the learner will eventually limit how much an agent can profit by steering mechanism choices, but it does not by itself rule out immediate gains from shading a report in the current round. Addressing that missing per-round deterrence is exactly the role of our audit primitive, to which we turn next.

4. Audit primitives: noisy verification, tolerance ε, and defining the effective penalty gap β(ε); closed forms for canonical noise models

Our learning-based policy is only as good as the data it learns from. When agents are forward-looking, a report is not merely a description of current preferences; it is also an instrument for shaping future mechanisms. Per-round learner stability (which we study next) limits how much an agent can profit by steering the distribution over mechanisms, but it does not by itself rule out profitable shading. With continuous types, this within-round issue is particularly stark: without some additional enforcement, an agent can typically move her report by an arbitrarily small amount in a direction that (locally) improves her stage payoff, and we cannot hope to police every infinitesimal perturbation.

We therefore introduce an audit primitive: an occasional, publicly committed verification-and-penalty procedure that makes sufficiently large misreports costly . The key modeling choice is that verification is noisy, so the audit cannot perfectly distinguish a small lie from measurement error. This naturally leads to an ε-tolerance notion: we aim to deter deviations larger than ε, while treating smaller deviations as economically negligible. The role of this section is to formalize audits and to define a single summary statistic, β(ε), that measures how strongly audits deter ε-scale deviations. This scalar will later plug directly into the incentive condition λ β(ε) ≥ 4ηα.

Fix an agent i and a round t in which she participates. If an audit occurs, the platform observes a verification signal
z_i, t = θ_i, t + ξ_i, t,
where ξ_i, t is exogenous noise (measurement error, privacy noise, or an imperfect proxy for the latent type). We take the audit signal model as a primitive: the distribution of ξ_i, t is known to all parties and independent of reports. This captures settings where types can be partially verified through ex post records (delivery times, realized clicks, observed quality), through third-party checks, or through internal instrumentation that is accurate only up to noise.

The audit mechanism then applies a penalty based on the reported type and the audit signal:
pen_i(b_i, t, z_i, t) ∈ [0, P],
where P is the maximal punishment expressed in normalized utility units. Importantly, we do require the platform to learn from the audit itself; the audit’s purpose is deterrence. In the repeated protocol, the platform commits publicly to mixing in the audit mechanism with probability λ each round. This commitment is operationally natural (e.g., a known auditing schedule or a fixed inspection lottery) and analytically convenient: it turns a complex dynamic incentive problem into a comparison between (i) the maximum gain from manipulating learning and (ii) the expected audit loss from deviating.

Because Θ = [0, 1]^d is continuous, it is neither feasible nor desirable to enforce exact truth-telling at infinite precision. Even with perfect verification, mechanisms are often robust only up to a resolution limit, and any real audit will have finite measurement granularity. We therefore fix a tolerance parameter ε > 0 and interpret it as the smallest economically meaningful deviation. The enforcement goal becomes:

This is an incentive requirement, aligned with our ε-Nash notion. It also reflects a policy choice: in many applications, we care about preventing misrepresentation rather than micro-manipulation.

We summarize the deterrence power of an audit design by the smallest incremental expected penalty it induces against ε-scale misreports.

Fix a penalty rule pen(b, z) and an audit signal model z = θ + ξ. For ε > 0, define
β(ε) := inf_{θ ∈ Θ} inf_{b ∈ Θ: ∥b − θ∥ ≥ ε}(𝔼[pen(b, z) ∣ θ] − 𝔼[pen(θ, z) ∣ θ]).

By construction, β(ε) ≥ 0. When β(ε) > 0, any report that is at least ε away from truth incurs an extra expected audit loss of at least β(ε) relative to truthful reporting. In other words, β(ε) is the audit’s marginal deterrence at scale ε.

A convenient way to think about β(ε) is through a decomposition into (i) a detection advantage and (ii) a penalty magnitude. Many audits have a binary “fail test” event A(b, z) ∈ {0, 1} and a penalty of the form pen(b, z) = P ⋅ A(b, z). Then
β(ε) = P ⋅ ρ(ε), ρ(ε) := inf_θinf_{∥b − θ∥ ≥ ε}(Pr (A(b, z) = 1 ∣ θ) − Pr (A(θ, z) = 1 ∣ θ)),
where ρ(ε) is the minimal increase in the probability of failing the audit when deviating by at least ε. This separation is useful in practice: ρ(ε) is governed by measurement quality (noise), while P is governed by institutional constraints (legal limits, fairness, and individual rationality).

A particularly transparent design is a threshold test that penalizes a report when it is “too far” from the noisy verification:
pen(b, z) = P ⋅ 1{∥b − z∥ > τ},
with threshold τ ≥ 0. This audit does attempt to infer θ from z; it only checks whether the report is plausible given the measurement noise. The threshold τ controls false positives: larger τ means truthful agents are penalized less often, but it also makes it harder to detect small deviations.

Even in one dimension (d = 1), this simple form already captures a realistic compliance logic (“reports must be consistent with observed records up to tolerance τ”). It also makes the effect of noise transparent, and it yields closed forms for common noise distributions.

Suppose d = 1, ξ ∼ Laplace(0, σ), and pen(b, z) = P ⋅ 1{|b − z| > τ}. Let Δ := |b − θ| ≥ 0. Then
β_τ(Δ) = P(Pr (|Δ + ξ| > τ) − Pr (|ξ| > τ)), Pr (|ξ| > τ) = e^−τ/σ.
A closed form for Pr (|Δ + ξ| > τ) is piecewise:
$$ \Pr(|\Delta+\xi|>\tau)= \begin{cases} \frac{1}{2}e^{-(\tau-\Delta)/\sigma}+\frac{1}{2}e^{-(\tau+\Delta)/\sigma} = e^{-\tau/\sigma}\cosh(\Delta/\sigma), & 0\le \Delta\le \tau,\\[6pt] 1-\frac{1}{2}e^{-(\Delta-\tau)/\sigma}\big(1-e^{-2\tau/\sigma}\big), & \Delta\ge \tau. \end{cases} $$
Two implications are worth highlighting. First, for Δ ≤ τ, the audit’s detection advantage is second-order at the origin (since cosh (Δ/σ) = 1 + O(Δ²)): tiny deviations are essentially indistinguishable from noise, which is precisely why we adopt an ε-tolerance notion. Second, once Δ > τ, the gap becomes uniformly positive and increases with Δ, so
β(ε) ≥ β_τ(ε) > 0 whenever ε > τ.
Thus, under Laplace verification, we can choose a threshold τ that protects truthful agents from frequent false positives, and still obtain a strictly positive enforcement gap against deviations larger than ε.

If instead ξ ∼ 𝒩(0, σ²) and we keep the same threshold audit, then
$$ \Pr(|\xi|>\tau)=2\Phi\!\left(-\frac{\tau}{\sigma}\right), $$
and for Δ ≥ 0,
$$ \Pr(|\Delta+\xi|>\tau) = \Pr(\xi>\tau-\Delta)+\Pr(\xi<-\tau-\Delta) = \Phi\!\left(-\frac{\tau-\Delta}{\sigma}\right)+\Phi\!\left(-\frac{\tau+\Delta}{\sigma}\right), $$
where Φ is the standard normal CDF. Hence
$$ \beta_\tau(\Delta) = P\Bigg( \Phi\!\left(-\frac{\tau-\Delta}{\sigma}\right)+\Phi\!\left(-\frac{\tau+\Delta}{\sigma}\right) -2\Phi\!\left(-\frac{\tau}{\sigma}\right) \Bigg), \qquad \beta(\varepsilon)\ge \inf_{\Delta\ge \varepsilon}\beta_\tau(\Delta). $$
In contrast to Laplace, there is generally no simplification beyond Φ(⋅), but the function is one-dimensional and easy to evaluate numerically. This is exactly the kind of object we want: the audit enters our later incentive condition only through β(ε), which can be computed (or lower bounded) as part of design calibration.

In some applications, verification error is bounded: for example, a recorded timestamp may be accurate up to a known rounding unit, or a quality score may have bounded measurement error. If ξ is uniform on [−a, a] and d = 1, then Pr (|ξ| > τ) = 0 for τ ≥ a, and the threshold audit yields a gap that is piecewise linear in Δ. In the extreme case τ ≥ a, truthful agents are never falsely penalized, and any deviation larger than τ − a is detected with probability one. This illustrates a general design lesson: when verification is “sharp,” one can set τ to essentially eliminate false positives while preserving strong deterrence; when verification is “noisy-tailed,” one must trade off false positives and enforcement through (τ, ε, P).

When θ ∈ [0, 1]^d, the same primitives apply, but closed forms depend on the geometry of the test and the noise. Two common implementations are:

The main conceptual point is unchanged: we do not need to “solve” the audit statistically; we only need a uniform lower bound on the incremental expected penalty for ε-scale deviations.

Noisy audits inevitably punish some truthful agents. Under the threshold audit, the truthful false-positive rate is
p_fp(τ) := Pr (∥ξ∥ > τ),
so a truthful agent faces expected audit disutility λ P p_fp(τ) per round (up to discounting and participation). This creates an individual-rationality (IR) constraint: if participation is voluntary, λPp_fp(τ) cannot exceed the expected surplus that the mechanism delivers to truthful agents. Put differently, we cannot set P arbitrarily large or τ arbitrarily small without risking exclusion of exactly the agents we want to elicit information from.

This is where the ε-tolerance viewpoint becomes practically valuable. If we only require deterrence for deviations exceeding ε, we can choose τ to keep p_fp(τ) small (protecting truthful agents), and choose ε modestly larger than τ so that β(ε) remains strictly positive. The resulting design is not “perfect truth-telling,” but it is often the right economic compromise: it targets the manipulation that meaningfully distorts learning and policy, while tolerating noise-level discrepancies.

The reason we labor to define β(ε) is that it is the exact quantity that trades off against the benefit of strategic manipulation in the dynamic game. When audits occur with probability λ, any deviation larger than ε suffers an expected one-shot loss of at least λ β(ε). In the next section, we formalize how learner stability upper bounds the continuation-value gain from a one-shot misreport by a term on the order of ηα. The enforcement logic is then immediate: if λ β(ε) exceeds that maximal dynamic gain (up to constants), large deviations are unprofitable, and truthful reporting becomes an ε-Nash equilibrium of the repeated interaction.

The broader message is that audits and stability are complements. Stability alone limits intertemporal manipulation but leaves within-round incentives unresolved; audits alone can deter misreports but would be too costly if used constantly. The mechanism we study mixes the two, and β(ε) is the bridge that lets us quantify the resulting incentive–performance tradeoff.

5. Stability of learners: weak per-round DP / max-divergence stability; instantiations for Hedge and (optional) entropic FTRL; discussion of finite vs discretized Π

The dynamic incentive problem in online mechanism selection is not only that an agent may want to shade her report to affect the outcome, but also that she may want to perturb the learner so as to tilt the distribution over mechanisms in her favor. Our approach to controlling this second channel is deliberately algorithmic: we ask the learner to be in the sense that any single report has only a limited effect on the distribution over mechanisms selected in later rounds. Intuitively, a stable learning rule makes “steering the algorithm” a low-leverage instrument.

There are several notions of stability in learning theory (uniform stability, on-average stability, Lipschitzness of iterates). We use a differential-privacy-style notion because it gives exactly the inequality we later need in a repeated-game argument: a bound on how much the distribution over actions can change when one entry of the input history changes. This bound is strong enough to control an agent’s worst-case continuation-value gain from a single misreport, yet weak enough to be satisfied by standard no-regret learners with mild regularization.

Let h_t − 1 denote the public history available to the learner before round t (e.g., past report profiles and any past objective feedback the platform uses). The learner is a mapping
h_t − 1 ↦ q_t(⋅ ∣ h_t − 1) ∈ Δ(Π),
where q_t is the distribution over mechanisms for round t (we suppress dependence on t when notationally convenient).

We say that two histories h_t − 1 and h_t − 1^′ are if they differ only in a single report entry (one agent in one round), with everything else held fixed. This is the natural adjacency relation for strategic manipulation: it isolates the marginal effect of one agent’s one-shot deviation on the learner’s subsequent behavior.

The stability condition we impose is the following weak form of η-differential privacy (equivalently, an η max-divergence bound) applied to the learner’s output distribution.

We say the learner is weakly η-stable if for every round t, every pair of adjacent histories h_t − 1 ∼ h_t − 1^′, and every event A ⊆ Π,
Pr (π_t ∈ A ∣ h_t − 1) ≤ e^η Pr (π_t ∈ A ∣ h_t − 1^′).
Equivalently, for every π ∈ Π,
q_t(π ∣ h_t − 1) ≤ e^η q_t(π ∣ h_t − 1^′).

Two remarks help interpret this condition.

First, it is : it upper bounds how much probability mass can be increased by changing one report. Since adjacency is symmetric (we can swap h and h^′), it also implies the reverse inequality with h, h^′ swapped, hence the likelihood ratio is bounded in both directions. In particular, no single report can make a particular mechanism dramatically more (or less) likely in the next step.

Second, it is for incentives: if an agent’s continuation value is some bounded function of the future mechanism draw, then the inequality above bounds how much that continuation value can change under a one-shot deviation. This is exactly what we will exploit in the backward-induction argument in the next section.

The multiplicative form is convenient because it composes cleanly over time and because it is agnostic to the agent’s utility function beyond boundedness. Concretely, if f : Π → [0, 1] is any bounded function (think: expected continuation value from each mechanism), then max-divergence implies a uniform expectation bound such as
𝔼_{q_t(⋅ ∣ h)}[f(π)] ≤ e^η 𝔼_{q_t(⋅ ∣ h^′)}[f(π)],
and hence an additive bound of order (e^η − 1) ≈ η when η is small. This is the sense in which η is the “per-round manipulability budget” of the learner: smaller η means less scope for any report to influence the distribution over future mechanisms.

From a practical standpoint, this stability is also a desirable governance property. Mechanisms chosen by a stable rule do not swing sharply in response to a single data point, which aligns with how platforms often justify policy updates (gradualism, robustness to anomalous reports, and reduced volatility).

We now verify that standard exponential-weights learning yields exactly the required stability, with η essentially equal to the learning rate. We present the argument in the full-information setting (the one underlying our regret bound later), where the learner can form a bounded “loss” (or negative reward) for each candidate mechanism each round.

Let ℓ_τ(π) ∈ [0, 1] denote the loss assigned to mechanism π on round τ based on the observed feedback at that round. Hedge maintains weights
$$ w_t(\pi)\;=\;\exp\!\Big(-\eta_H \sum_{\tau=1}^{t-1} \ell_\tau(\pi)\Big), \qquad q_t(\pi)\;=\;\frac{w_t(\pi)}{\sum_{\pi'\in\Pi} w_t(\pi')}, $$
where η_H > 0 is the learning rate.

Consider two adjacent histories h_t − 1 ∼ h_t − 1^′ that differ only in a single report entry at some round τ^⋆ < t. This report change may alter the learner’s computed loss vector for that round, so let ℓ_τ^⋆ and ℓ_τ^⋆^′ be the corresponding losses. Suppose we have the natural bounded-sensitivity condition
sup_{π ∈ Π} |ℓ_τ^⋆(π) − ℓ_τ^⋆^′(π)| ≤ 1,
i.e., changing one report can change the assessed loss of any fixed mechanism by at most one unit. (With gains in [−1, 1], the analogous bound is 2; all conclusions below scale accordingly.)

Then for every π,
$$ \frac{w_t(\pi\mid h)}{w_t(\pi\mid h')} = \exp\!\Big(-\eta_H(\ell_{\tau^\star}(\pi)-\ell'_{\tau^\star}(\pi))\Big) \;\le\; e^{\eta_H}. $$
A standard normalization argument then implies the same bound for the :
$$ \frac{q_t(\pi\mid h)}{q_t(\pi\mid h')} \;\le\; e^{\eta_H}, $$
and hence for any event A ⊆ Π,
Pr (π_t ∈ A ∣ h) ≤ e^η_HPr (π_t ∈ A ∣ h^′).
Thus Hedge is weakly η-stable with η = η_H (up to the sensitivity scaling). The key economic interpretation is immediate: A smaller learning rate makes the platform harder to manipulate intertemporally, because it tempers the extent to which any single report can tilt future mechanism probabilities.

The same stability arises if we implement the learner via entropic follow-the-regularized-leader (FTRL). Let L_t − 1(π) = ∑_τ < tℓ_τ(π) be cumulative loss. Entropic FTRL chooses a distribution
$$ q_t \;\in\; \arg\min_{q\in\Delta(\Pi)}\Big\{\langle q, L_{t-1}\rangle + \frac{1}{\eta_F}\mathrm{KL}(q\|u)\Big\}, $$
where u is a reference distribution (often uniform). The solution is again a softmax:
q_t(π) ∝ u(π)exp ( − η_FL_t − 1(π)),
so the same likelihood-ratio bound holds with η = η_F under the same bounded-sensitivity condition. This representation clarifies what stability is doing: rather than “chasing” the single empirically best mechanism, the learner plays a best response to past data, and the smoothing parameter governs manipulability.

This also highlights a limitation. If we used an unsmoothed rule such as follow-the-leader (pure argmin of L_t − 1), the mapping from histories to π_t can be discontinuous: a single report can flip the identity of the empirical minimizer, producing large changes in future play and therefore large dynamic incentives to misreport. In our setting, such instability is not merely a technical nuisance; it is precisely the vulnerability strategic agents exploit.

The stability discussion above is cleanest when Π is finite, since the learner outputs a distribution over finitely many mechanisms and the max-divergence inequality is literal. Many economically relevant mechanism families, however, are parameterized by continuous variables: reserve prices, posted-price vectors, scoring weights, or allocation thresholds. In those cases, we typically work with a discretized subclass Π obtained by gridding the parameter space.

This discretization serves two purposes. Statistically/computationally, it makes online learning feasible with standard finite-action algorithms and yields regret terms depending on log |Π|. Economically, it aligns with the fact that most deployed mechanisms are implemented at finite resolution (prices in cents, reserves in dollars, and policy parameters rounded to operationally meaningful units).

The cost is an approximation term. Let 𝒰 denote an underlying continuous benchmark class and Π ⊆ 𝒰 a finite δ-net under a metric that controls the planner’s objective. Then the best fixed mechanism in Π may be worse than the best in 𝒰 by a discretization error disc(Π), which we carry explicitly in our regret decomposition later. When the objective is Lipschitz in parameters (as is common in pricing and auction applications under mild regularity), a grid with mesh δ typically yields disc(Π) = O(δ), while |Π| grows like δ^−k for k effective dimensions, so log |Π| = O(klog (1/δ)). This is the familiar complexity–approximation tradeoff: finer grids improve approximation but increase learning burden.

A further subtlety is that stability can degrade if we refine Π without adjusting learning rates. Since our stability parameter is essentially the softmax temperature, and our regret-optimal learning rate depends on T and log |Π|, there is an endogenous link between discretization choice and incentive robustness. Put plainly: making the mechanism class very rich increases the platform’s capacity to fit data, but it can also increase the platform’s responsiveness to any single report unless we correspondingly regularize (lower η_H).

We will treat weak per-round η-stability as the learner-side primitive: it caps the dynamic benefit from manipulating future mechanism selection. Exponential-weights (Hedge) and entropic FTRL satisfy it transparently, with η controlled by a learning-rate/regularization parameter under bounded feedback sensitivity. In the next section, we combine this stability control with the audit-based one-shot deterrence to establish ε-Nash incentive compatibility via a backward-induction argument.

6. Main theorem (Stability + Audits ⇒ ε-NIC): backward-induction incentive proof with approximate penalties

We now state and prove the central incentive result: a weakly-stable learner can be made robust to dynamic manipulation by mixing in an audit mechanism that creates an for “large” misreports. Economically, the proof formalizes a simple comparison: a deviation can only help an agent by (i) slightly steering the learner’s future distribution over mechanisms, while it hurts her by (ii) increasing her expected audit exposure. Stability bounds (i); the audit gap enforces (ii). When (ii) dominates (i) over the agent’s effective planning horizon, truth-telling becomes an equilibrium of the repeated interaction.

We encode forward-looking behavior using the standard “effective horizon” bound. For each agent i, let γ_i(t) ≥ 0 be the discount factor weight on period t, and assume there is a finite α such that for every round t,

Thus, conditional on being at round t, the total discounted weight the agent places on is at most α in units of the current-period weight. This is the precise sense in which α upper bounds how much future steering is worth.

Fix a norm ∥⋅∥ on Θ and a tolerance parameter ε > 0. The audit rule is assumed to satisfy an : for every agent i, every true type θ ∈ Θ, and every report b with ∥b − θ∥ ≥ ε,

We emphasize what this is (and is not) buying us. It does require exact verification; it only requires that sufficiently large misreports become detectably more likely to trigger penalties. The tolerance ε is a modeling choice reflecting practical limits: we do not attempt to police “microscopic” deviations in continuous type spaces, but we will ensure that deviations that move the report by at least ε are unprofitable.

Let M_λ denote the repeated procedure that, each round t, flips an independent coin: with probability 1 − λ it executes a mechanism drawn from the learner’s distribution q_t(⋅ ∣ h_t − 1), and with probability λ it executes the audit mechanism π^aud (which includes drawing verification signals and applying penalties). The coin flip is public and committed ex ante.

The proof uses a standard conversion from likelihood-ratio stability to value stability. Suppose two distributions q, q^′ on Π satisfy the pointwise bound q(π) ≤ e^ηq^′(π) for all π. Then for any function f : Π → [0, B],

Swapping q and q^′ yields the reverse inequality, so

When η ≤ 1, we can use e^η − 1 ≤ 2η to obtain the convenient bound

This is where the constant 4 in the main condition comes from: we pay (i) a factor 2 to turn a one-sided likelihood ratio into a two-sided absolute difference, and (ii) a factor 2 to linearize e^η − 1 for small η.

We can now state the incentive guarantee in the form we will use later.

Assume: (a) stage utilities are normalized so that u_i(θ, s) ∈ [0, 1]; (b) the learner is weakly η-stable in the per-round max-divergence sense (as defined in the previous section) with respect to changing a single report entry in the history; (c) the audit mechanism satisfies the effective penalty gap for deviations of size at least ε; and (d) each agent i satisfies the effective-horizon bound . If the audit mixing probability λ satisfies

then truthful reporting is an ε-Nash equilibrium of the repeated interaction induced by M_λ (ex-post, i.e., for every realized sequence of types and all realized public randomness).

Fix an agent i and imagine she deviates by submitting a report b_i, t far from θ_i, t at some round t. Because of η-stability, this single change can only tilt the platform’s future distribution over mechanisms by a bounded amount each round, which translates into a bounded increase in the agent’s expected continuation value. The total discounted value of those future tilts is at most on the order of ηα. Meanwhile, the audit lottery imposes an expected cost of at least λβ(ε) whenever the deviation is of magnitude at least ε. Condition makes the expected audit loss dominate the maximal dynamic gain, so such a deviation is not profitable. A backward-induction (one-shot deviation) argument then extends this to arbitrary deviation strategies.

Fix an arbitrary realized type sequence (θ_i, t)_t ≤ T for all agents, and fix any public randomness outside the learner’s own randomization and the audit noise. Fix an agent i and hold all other agents to truthful play.

For each round t, define agent i’s from t + 1 onward (measured in units of period-t weight) under truthful play as
$$ W_{t+1}(h_t) \;:=\; \mathbb E\Big[\sum_{\tau=t+1}^T \frac{\gamma_i(\tau)}{\gamma_i(t)}\Big(u_i(\theta_{i,\tau},s_\tau)-\mathbf 1\{\text{audit at }\tau\}\cdot \mathrm{pen}_i(\cdot)\Big)\,\Big|\,h_t\Big]. $$
By boundedness of u_i and nonnegativity of penalties, we have the crude bound W_t + 1(h_t) ∈ [0, α] using .

Consider any deviation strategy of agent i. Let t^⋆ be the first round at which the deviating strategy differs from truthful reporting (if it never differs, we are done). By standard one-shot deviation logic in finite-horizon dynamic games, it suffices to show that , the agent cannot gain by deviating at t^⋆ alone by reporting some b ≠ θ_i, t^⋆.

So fix round t = t^⋆, fix the public history h_t − 1 that would obtain under truthful play up to t − 1, and compare two counterfactual histories at the end of round t:
h_t = (h_t − 1, b_i, t = θ_i, t, b_−i, t = θ_−i, t, public signals), h_t^′ = (h_t − 1, b_i, t = b, b_−i, t = θ_−i, t, public signals^′).
These two histories differ only through agent i’s round-t report entry and whatever (bounded) derived learner inputs that report affects; by assumption, the learner’s stability guarantee applies to such a change.

Write the deviator’s from choosing b at round t instead of truth-telling as
$$ \Delta U_t(b) = \underbrace{\mathbb E\big[u_i(\theta_{i,t},s_t)\mid b\big]-\mathbb E\big[u_i(\theta_{i,t},s_t)\mid \theta_{i,t}\big]}_{\text{(A) within-round gain}} \;-\; \underbrace{\lambda\Big(\mathbb E[\mathrm{pen}(b,z_{i,t})\mid \theta_{i,t}]-\mathbb E[\mathrm{pen}(\theta_{i,t},z_{i,t})\mid \theta_{i,t}]\Big)}_{\text{(B) expected audit gap at }t} \;+\; \underbrace{\mathbb E[W_{t+1}(h_t')]-\mathbb E[W_{t+1}(h_t)]}_{\text{(C) continuation gain}}. $$
Term (B) is exactly where enters: if ∥b − θ_i, t∥ ≥ ε, then (B) ≥ λβ(ε).

Term (C) is where stability and α enter. The continuation value W_t + 1 is an expectation over future mechanism draws (from the learner) and future audit signals. Conditional on the realized future public history, each round τ > t the learner outputs a distribution over mechanisms that is weakly η-stable with respect to changing agent i’s round-t report. Therefore, applying round-by-round with B equal to the remaining discounted mass from that round onward, and summing over τ > t, we obtain the uniform bound
𝔼[W_t + 1(h_t^′)] − 𝔼[W_t + 1(h_t)] ≤ 4η α.
(Informally: at each future round, stability limits how much the mechanism distribution can be tilted; bounded utilities convert this into a bounded per-round payoff tilt; α aggregates these tilts across the future.)

Putting these pieces together, for any deviation with ∥b − θ_i, t∥ ≥ ε,
ΔU_t(b) ≤ (A) − λβ(ε) + 4ηα.
Under the maintained modeling choice that our mechanism class is myopically truthful (so (A) ≤ 0 conditional on any fixed mechanism draw), condition implies ΔU_t(b) ≤ 0. Hence there is no profitable “large” deviation at the first deviation time. By one-shot deviation, there is no profitable deviation strategy that ever makes a deviation of size at least ε.

Finally, the ε-approximate nature of the equilibrium reflects precisely what we choose to police: reports within distance ε are left unconstrained by . If, as is common in applications, the platform rounds reports to an ε-grid (so deviations smaller than ε do not change any learner input or allocation), then even these deviations are payoff-irrelevant and the same argument yields exact truthfulness. More generally, our conclusion is the stated ε-NIC guarantee.

Condition is a transparent policy rule: enforcement intensity λ times “detectability × severity” β(ε) must exceed the algorithm’s manipulability budget 4ηα. It also highlights a limitation: if verification is very noisy (small β(ε)) or agents are very forward-looking (large α), then maintaining incentives requires either heavy auditing (large λ) or stronger penalties (larger P inside β), both of which can be costly or infeasible. This is exactly the tradeoff we quantify next, when we decompose regret into learning performance, audit distortion, and approximation error and then optimize the design parameters.

7. Regret bounds and the audit–regret frontier: decomposition into learning regret, audit distortion, and approximation/discretization error; parameter tuning

Having established that mixing audits into a weakly-stable learner restores incentives, we now ask the complementary performance question: The answer is naturally stated as a regret decomposition. One component is standard learning regret (we compete with the best fixed mechanism in a benchmark class). A second component is : whenever we audit, we do not run the learner’s preferred mechanism, so we mechanically lose objective value. The final components are : (i) the tolerance region ε leaves slack in reports, which can translate into objective loss when the objective is sensitive to small type perturbations; and (ii) if we discretize a continuous mechanism family into a finite class Π, we pay discretization error.

Fix a finite mechanism class Π (either intrinsically finite or obtained by discretizing a continuous design space). For the online procedure M_λ (learner with audit mixing rate λ), we evaluate performance along the truthful equilibrium path guaranteed by the incentive theorem. Let $\mathrm{Alg}(M_\lambda)=\mathbb E[\sum_{t=1}^T G_t(\theta_t,s_t)]$ denote the planner’s realized objective under M_λ (expectation over the learner’s randomization and audit randomness, conditional on the type sequence). We compare against the best fixed mechanism in hindsight within Π:
$$ \mathrm{Opt}_\Pi \;:=\; \max_{\pi\in\Pi}\; \mathbb E\Big[\sum_{t=1}^T G_t(\theta_t,s_t(\pi))\Big], $$
where s_t(π) denotes the outcome induced at round t when π is run on the (truthful) report profile. The (truthful-play) regret is then
Reg_T(M_λ) := Opt_Π − Alg(M_λ).
Because G_t ∈ [−1, 1], regret is always bounded by 2T, and the goal is to obtain sublinear dependence on T up to the unavoidable approximation terms.

It is helpful to conceptually “peel off” the losses we incur relative to Opt_Π.

Suppose first there were no audits and no incentive issues, so that the platform simply ran a standard online algorithm over Π against the realized objectives. Under full-information feedback (the platform can evaluate G_t(θ_t, s_t(π)) for every π ∈ Π after round t), Hedge delivers the familiar guarantee
$$ \mathrm{Reg}_T^{\mathrm{learn}} \;\le\; \frac{\log|\Pi|}{\eta_H} \;+\; O(\eta_H T), $$
for a learning rate η_H > 0 (here the constant in O(⋅) depends only on the bounded range of the objective). This is the purely statistical price of searching over a richer mechanism class: log |Π| enters multiplicatively.

Under M_λ, with probability λ we run π^aud instead of the learner-chosen mechanism. Even if π^aud is “reasonable,” it is generally not the optimizer of G_t at that round. Since G_t ∈ [−1, 1], a crude worst-case bound is that substituting any other mechanism can reduce the round objective by at most 2. Thus the expected cumulative distortion from auditing is at most
Loss^audit ≤ 2λT.
This term is where incentives directly trade off against welfare: stronger enforcement (larger λ) worsens the objective mechanically.

Our incentive guarantee is ε-NIC: deviations smaller than ε are not deterred by the audit gap condition. To convert this into a welfare statement, we impose a standard sensitivity bound: there exists a constant L_θ such that changing the report profile by at most ε in norm changes the induced objective by at most L_θε per round (this can be derived, for example, from Lipschitz utilities and continuity of the allocation rule in types). Then the worst-case objective loss from “allowed” within-ε deviations is
Loss^ε ≤ O(L_θ ε T).
Separately, if Π discretizes a continuous design set 𝒰, we incur a benchmark mismatch
$$ \mathrm{disc}(\Pi) \;:=\; \sup_{\text{type sequences}}\Big(\max_{\pi\in\mathcal U}\sum_{t=1}^T G_t(\theta_t,s_t(\pi)) \;-\; \max_{\pi\in\Pi}\sum_{t=1}^T G_t(\theta_t,s_t(\pi))\Big), $$
which is application-specific and typically decreases as the discretization is refined (at the expense of larger |Π|).

The preceding terms would appear in any “audit-mixing” scheme, but our setting adds a crucial coupling: we cannot choose λ freely, because it must satisfy the incentive feasibility condition
λ β(ε) ≥ 4η α,
where η is the learner’s per-round stability and α is the effective horizon. Thus, any design that makes the learner (smaller η) relaxes the required audit rate and therefore reduces audit distortion. Conversely, a more aggressive learner can have smaller learning regret but may force more auditing.

For standard multiplicative-weights updates (Hedge), stability is controlled by the learning rate: when payoffs are bounded, the induced mapping from histories to distributions over Π satisfies a max-divergence bound of the form η ≲ c η_H for a constant c. Substituting the IC constraint at equality (the minimal auditing that still deters large deviations),
$$ \lambda^\star(\varepsilon) \;=\; \frac{4\eta\alpha}{\beta(\varepsilon)} \;\approx\; \frac{4c\,\eta_H\,\alpha}{\beta(\varepsilon)}, $$
and then inserting into Loss^audit ≤ 2λT yields an contribution of order
$$ \mathrm{Loss}^{\mathrm{audit}} \;\lesssim\; \frac{\eta_H\,\alpha\,T}{\beta(\varepsilon)}. $$
This is the key point: the audit loss is not an independent design knob; it is an endogenous consequence of stability and detectability.

Putting the pieces together, one obtains a bound of the form

Choosing η_H to balance the first two (sublinear-in-T) terms yields a convenient scaling. If we take $\eta_H\asymp 1/\sqrt{\alpha T}$ (which makes the learner increasingly stable as agents become more forward-looking), then the learning term becomes $O((\log|\Pi|)\sqrt{\alpha T})$ and the induced audit term becomes $O(\sqrt{\alpha T}/\beta(\varepsilon))$, matching the headline form we will use going forward:
$$ \mathrm{Reg}_T(M_\lambda) \;\le\; O\Big((\log|\Pi|)\sqrt{\alpha T}\Big) \;+\; O\Big(\tfrac{\sqrt{\alpha T}}{\beta(\varepsilon)}\Big) \;+\; O(L_\theta\,\varepsilon\,T) \;+\; O(\mathrm{disc}(\Pi)). $$
We view this expression as the planner’s “audit–regret frontier”: it isolates the single scalar function β(ε)—the detectability/severity of deviations of size at least ε—as the object that maps verification technology into welfare guarantees.

For a fixed tolerance ε, the welfare-optimal choice of audit frequency is typically to audit , i.e.,
$$ \lambda \;=\; \min\Big\{1,\;\frac{4\eta\alpha}{\beta(\varepsilon)}\Big\}. $$
Auditing more than this cannot improve incentives (once large deviations are already unprofitable) but continues to distort the objective. The truncation at 1 highlights a feasibility limitation: if β(ε) is too small (noisy verification, weak penalties, or too small a tolerance window), then even λ = 1 cannot satisfy the IC inequality, and one must revisit the audit design (increase P, change the test, or accept a larger ε).

The remaining design choice is ε, which enters the regret bound in two opposing ways:
incentives become cheaper as ε↑ (β(ε) typically increases), but allocations become noisier as ε↑ (L_θεT grows).
Abstracting away log |Π| and discretization terms (which do not depend on ε), we are led to the one-dimensional optimization problem
$$ \min_{\varepsilon>0}\; \Big\{\,C_1\frac{\sqrt{\alpha T}}{\beta(\varepsilon)} \;+\; C_2 L_\theta\,\varepsilon\,T\,\Big\}, $$
for constants C₁, C₂ coming from the preceding inequalities. This objective is typically U-shaped in ε: very small ε makes β(ε) tiny (audits must be frequent), while very large ε creates large tolerated misreporting regions that degrade outcomes directly. In applications where β(ε) has a closed form (e.g., Laplace verification with threshold tests), the minimizer can be computed analytically or by a simple line search; otherwise, a numerical minimization over ε ∈ (0, 1] is straightforward because the problem is one-dimensional and β(ε) is monotone.

To illustrate what the frontier can look like, consider Laplace verification z = θ + ξ with ξ ∼ Laplace(0, σ) and a threshold penalty rule pen(b, z) = P1{|b − z| > τ}. For deviations Δ = ∥b − θ∥ ≥ ε (and in the scalar case for simplicity), Proposition~1 provides a closed form. In the informative regime ε ≥ τ, β(ε) is increasing in ε and approaches the limit P(1 − e^−τ/σ) as ε → ∞: once a misreport is “large enough,” the test almost surely flags it whenever auditing occurs, so detectability saturates. The frontier implication is immediate: beyond some point, increasing ε barely reduces required auditing (because β(ε) has plateaued) but continues to increase the direct tolerance loss L_θεT. Hence the optimal ε is often near the scale at which verification becomes meaningfully informative, rather than arbitrarily large.

The final knob is the richness of Π. A finer discretization reduces disc(Π) but enlarges log |Π| and therefore the learning term. Practically, this suggests a two-step tuning heuristic: (1) pick a discretization coarse enough that log |Π| does not dominate $\sqrt{\alpha T}$-scaling terms; (2) verify that disc(Π) is below the economic tolerance for approximation error in the application (e.g., reserve-price grids in auctions, menu-size constraints in pricing, or tax schedules in policy design). Because audits already introduce an unavoidable distortion, it is rarely optimal to drive disc(Π) to zero if doing so explodes |Π| and learning regret.

Two caveats are worth keeping in view. First, the cleanest bounds above assume full-information feedback; with bandit feedback (only observing G_t for the executed mechanism) one can use Exp3-style algorithms, but the learning term worsens and the stability analysis requires additional care. Second, the Lipschitz conversion O(L_θεT) is a reduced-form way to capture how tolerant incentives translate into welfare loss; in market design settings with discontinuities (e.g., tie-breaking, capacity constraints), establishing such a bound may require smoothing or explicit randomization.

What this section delivers, however, is a tractable design lens: once we can (i) quantify β(ε) from the verification technology and penalty rule, and (ii) bound L_θ from the allocation environment, the tuning of (η_H, λ, ε) follows from an explicit frontier. In the next section we use this lens to derive comparative statics—how optimal audit rates and tolerances move with long-sightedness α, verification noise, penalty magnitude P, and the environment’s sensitivity L_θ.

8. Comparative statics and design guidance: how optimal audit rates scale with horizon α, verification noise, penalty size, and market Lipschitzness

The regret–incentives decomposition above collapses the design problem to a small number of economically interpretable primitives: (i) how forward-looking agents are (their effective horizon α), (ii) how manipulable the learner is (its per-round stability η), (iii) how informative verification is (captured by the gap β(ε), which in turn depends on the noise scale σ, the test/threshold τ, and penalty magnitude P), and (iv) how sensitive the objective is to small type perturbations (the Lipschitz constant L_θ). We now spell out how optimal auditing and tolerance choices move with these primitives, and we translate the resulting comparative statics into a practical tuning workflow.

Across environments, one robust principle is that auditing is typically welfare-improving only insofar as it is needed for incentives. Because audits mechanically substitute away from the learner’s preferred mechanism, the welfare-optimal policy (holding fixed ε and the audit design) is to enforce :
$$ \lambda^\star(\varepsilon) \;=\; \min\Big\{1,\;\frac{4\eta\,\alpha}{\beta(\varepsilon)}\Big\}. $$
All comparative statics for auditing therefore flow through two channels: parameters that increase the manipulation benefit (via η or α) push λ^⋆ up, while parameters that improve detectability/severity (via β(ε)) push λ^⋆ down. The remaining design degrees of freedom—ε (how large a lie we bother deterring) and the audit test itself—determine which side of this inequality is cheapest to move.

The first and most direct monotonicity is in α. A larger α means an agent can rationally incur a short-run cost in exchange for steering the learner’s future mechanism choices, so the continuation-value gain from a single-round misreport is larger. Because the incentive constraint is λβ(ε) ≥ 4ηα, we need either (i) more frequent audits (λ↑), (ii) a more punitive/more informative audit that raises β(ε), or (iii) a more stable learner (η↓).

This observation has a useful operational corollary: when we expect long-lived participants (large α), it is often better to . Indeed, for multiplicative-weights learners, stability is controlled by the learning rate, so we can deliberately slow down updates to reduce η and thereby reduce the required audit rate. Concretely, with Hedge we typically have η ≲ c η_H, so λ^⋆(ε) ≈ 4c η_Hα/β(ε). Choosing $\eta_H\asymp 1/\sqrt{\alpha T}$ makes the induced audit frequency scale like
$$ \lambda^\star(\varepsilon) \;\asymp\; \frac{\sqrt{\alpha/T}}{\beta(\varepsilon)}, $$
which grows only like $\sqrt{\alpha}$ rather than linearly. Economically, this is the familiar ``patient agents require gentle policies’’ lesson: if we insist on a fast-reacting learner, we must pay for it with more enforcement.

The audit gap β(ε) is the reduced-form object translating a noisy verification technology into deterrence. Under essentially any reasonable noise model, increasing the noise scale σ makes the distribution of z = θ + ξ less informative about θ, weakening the detection advantage of misreports and therefore decreasing β(ε) for fixed (ε, τ, P). The incentive boundary then implies λ^⋆ must rise, possibly to the feasibility limit λ = 1.

The more subtle comparative static is on the ε^⋆. When verification becomes noisier, it becomes expensive (in audit frequency) to deter small lies, because small deviations are hard to distinguish from noise. The frontier optimization therefore tends to increase ε^⋆: we stop trying to police fine-grained deviations and instead focus enforcement on economically meaningful lies that are detectable. This logic mirrors many real compliance systems: when measurement error is high (e.g., self-reported inputs with weak third-party verification), policy often shifts from precise truth enforcement to coarse rules that deter only large misstatements.

In settings where β(ε) exhibits saturation (as in threshold tests with light-tailed noise), the implication is particularly sharp. Once ε is large enough that deviations are detected with near-maximal probability conditional on an audit, further increasing ε yields little additional incentive benefit (because β(ε) is near its plateau) but continues to increase the direct tolerance loss L_θεT. Thus noisier verification pushes ε^⋆ upward only until we reach the ``informative’’ scale of the signal; beyond that, improving welfare requires changing the audit technology (lowering σ) or increasing effective penalties (raising P), not simply loosening tolerance.

When the penalty rule scales linearly with P (as in pen(b, z) = P ⋅ 1{flag}), the effective gap typically scales linearly as well: β(ε) = P ⋅ β̃(ε), where β̃(ε) depends on the noise model and the test. Then the boundary audit probability obeys
$$ \lambda^\star(\varepsilon) \;\approx\; \frac{4\eta\alpha}{P\,\tilde\beta(\varepsilon)}, $$
so increasing P reduces required auditing one-for-one. This is the cleanest comparative static in the model: if we can double enforceable penalties, we can (roughly) halve audit frequency while preserving incentives.

The limitation is that P is not a free variable in many applications. Legally, penalties may be capped; operationally, harsh penalties may induce churn or nonparticipation; ethically, noisy audits create false positives whose burden scales with P. In practice, P often must be chosen jointly with the audit threshold τ to satisfy an IR-type constraint of the form
expected surplus from participation ≥ λ P p_fp(τ),
where p_fp(τ) is the false-positive probability for truthful agents. This constraint introduces a familiar compliance tradeoff: higher penalties can reduce audit frequency, but only if we simultaneously reduce false positives (e.g., by raising τ or improving measurement), otherwise we simply transfer welfare away from truthful participants.

The Lipschitz constant L_θ governs how costly an ε-slack equilibrium is in terms of realized objective value. When L_θ is large, even small within-ε deviations can translate into meaningful objective loss, so the tolerance term O(L_θεT) pushes us toward a smaller ε^⋆. But smaller ε typically reduces β(ε), which then raises λ^⋆ via the incentive boundary. Thus, higher L_θ tends to shift the optimal design toward : either more audits, or stronger/more informative audits, or a more stable learner (smaller η) to keep audit distortion manageable.

This comparative static also offers a concrete modeling lesson. In environments with discontinuities—e.g., capacity constraints, tie-breaking, threshold eligibility rules—L_θ may be effectively large or ill-behaved, making ε-slack costly. In such cases, it can be welfare-improving to directly (randomized tie-breaking, smoothed allocation rules, or deliberately coarse outcome granularity) to reduce L_θ. In our framework, decreasing L_θ is economically equivalent to making tolerance cheaper, which reduces the need to audit aggressively for fine deviations.

While the exact optimizer of ε is application-specific through β(ε), the frontier problem has a generic shape:
$$ \min_{\varepsilon>0}\;\Big\{\,C_1\frac{\sqrt{\alpha T}}{\beta(\varepsilon)} \;+\; C_2 L_\theta\,\varepsilon\,T\,\Big\}, $$
where C₁, C₂ absorb constants from the learning and incentive bounds. A useful back-of-the-envelope rule emerges if β(ε) grows approximately linearly over the relevant range, say β(ε) ≈ k ε (this is a reasonable local approximation when ε is near a detection threshold and far from saturation). Then the optimizer scales as
$$ \varepsilon^\star \;\asymp\; \sqrt{\frac{C_1}{C_2}}\; \Big(\frac{\sqrt{\alpha T}}{k\,L_\theta\,T}\Big)^{1/2} \;=\; \sqrt{\frac{C_1}{C_2}}\;\frac{\alpha^{1/4}}{\sqrt{k\,L_\theta}\;T^{1/4}}. $$
This heuristic captures the qualitative directions emphasized above: ε^⋆ increases with long-sightedness α (we tolerate larger lies because deterring small ones becomes expensive), decreases with objective sensitivity L_θ, and decreases slowly with the horizon T (longer horizons justify finer enforcement because the per-round tolerance loss accumulates).

Of course, if β(ε) saturates, the relevant regime is instead ``choose ε near the onset of saturation,’’ after which further increases in ε are dominated by L_θεT. This is precisely where closed-form audit models (like Laplace threshold tests) are valuable: they let us locate, quantitatively, the scale at which marginal detectability gains become small.

For applications, we find it helpful to treat the model as a checklist that forces all enforcement costs into a single accounting identity.

Fix an audit test and estimate (or upper bound) the noise scale σ and the false-positive rate p_fp(τ) for truthful behavior. Use these to compute or lower bound β(ε) over a grid of ε values. If β(ε) is uniformly tiny, no amount of clever learning will restore incentives without infeasible auditing; the binding constraint is measurement, not optimization.

Pick P and τ to satisfy participation/acceptability constraints (formal IR, legal constraints, or reputational considerations). This step caps the maximal achievable β(ε).

Choose η_H not only to minimize statistical regret, but to control η and thereby reduce the enforcement burden. When α is large, stability is a first-order design parameter.

For each candidate ε, set λ = λ^⋆(ε) and evaluate the resulting bound (or a simulation-based proxy for it). Choose ε by a one-dimensional search. If λ^⋆(ε) hits 1 over a wide range, the feasible region is telling us that the audit technology is too weak for the strategic environment; we must revisit Steps 1–2 or accept weaker incentive guarantees.

These monotone directions are robust, but two caveats matter in practice. First, α is a reduced-form summary of intertemporal incentives; different discount profiles can yield the same α but different short-run behavior, which can matter if audits are bursty or if participation is intermittent. Second, L_θ hides environment-specific non-smoothness; in some markets the true welfare impact of small misreports is highly state-dependent, so a single global Lipschitz constant may be conservative.

These considerations motivate the extensions that follow. In many platforms, agents are heterogeneous in patience and in detectability (different α_i, different noise levels, different compliance histories), making natural. Participation constraints may also be endogenous: auditing and penalties can change who shows up and when. Finally, with multi-dimensional types the choice of norm in ∥b − θ∥ ≥ ε becomes economically meaningful, because it determines which deviations are treated as ``large’’—a modeling choice that can be aligned with which dimensions are verifiable or welfare-critical.

9. Extensions (optional): heterogeneous agents and targeted audits; endogenous participation constraints (brief); multi-dimensional types and norm choices

We now sketch three extensions that arise almost immediately once we try to operationalize the framework on real platforms: heterogeneous agents (and the case for targeted audits), endogenous participation constraints, and genuinely multi-dimensional types where the choice of norm in ``∥b − θ∥ ≥ ε’’ becomes a design decision rather than a mathematical convenience. Our goal here is not to redo the full analysis in each case, but to show how the same two levers—(i) stability limiting dynamic manipulation and (ii) audits creating an effective penalty gap—continue to organize the economics.

In many environments, agents differ along exactly the dimensions that determine the incentive boundary: patience/retention (an α_i or a discount profile γ_i(⋅)), manipulability (how strongly their report affects the learner’s future actions), and detectability (a noise scale σ_i or an audit-gap function β_i(ε)). A uniform audit rate λ designed for the worst case can therefore be extremely conservative: it over-audits short-lived or easily verified participants, and under-audits the few participants with both high strategic leverage and weak verifiability.

A minimal formal extension replaces the single constraint by agent-specific constraints. If agent i has effective horizon α_i and audit technology implying an effective gap β_i(ε), then the same logic yields a sufficient condition of the form
λ_i β_i(ε) ≥ 4η α_i for all i,
where λ_i is the probability that agent i is audited in a given round (or, more generally, the marginal audit probability applied to their report in that round). This immediately suggests a targeted auditing policy:
$$ \lambda_i^\star(\varepsilon)\;=\;\min\Big\{1,\;\frac{4\eta\,\alpha_i}{\beta_i(\varepsilon)}\Big\}. $$
The economic content is straightforward: we devote auditing capacity to where it purchases the most deterrence per unit of distortion. In platform terms, this is the familiar move from “random checks on everyone” toward “risk-based” or “priority” checks.

Once λ depends on the agent or on history, the audit policy itself becomes something agents may try to influence. If we tell agents “those who look suspicious will be checked,” then “looking suspicious” becomes a strategic state variable. This is not merely a behavioral concern; it is a mechanism-design issue because the targeting function can reintroduce dynamic incentives even if the learner over Π is stable.

One principled way to keep targeting from being manipulable is to make the audit lottery depend only on or hard-to-manipulate signals: tenure, transaction volume, coarse categories, or delayed verification signals that are realized after the report is submitted. Another is to commit to a targeting policy that is itself stable in the same max-divergence sense: small changes in a report should not sharply change audit probability. Concretely, one can require a “targeting stability” condition such as
Pr (audit i ∣ h) ≤ e^η_audPr (audit i ∣ h^′)
for adjacent histories h, h^′ differing in one report entry, and then fold η_aud into the same incentive accounting. Intuitively, if audit probability can jump discontinuously when an agent deviates, then audits become an additional punishment channel (often attractive in practice, but normatively and legally delicate); if audit probability changes smoothly, then it behaves like a standard enforcement budget rather than a discretionary sanction.

A further practical complication is that α_i and β_i(⋅) may not be known a priori. Platforms often infer patience from retention, and infer detectability from past audit outcomes. This creates a feedback loop: we audit to learn who is risky, but auditing itself changes incentives and behavior.

Within our framework, a conservative approach is to treat heterogeneity via : maintain a lower confidence bound $\underline\beta_i(\varepsilon)$ for each agent based on audit data, and set
$$ \lambda_{i,t}\;=\;\min\Big\{1,\;\frac{4\eta\,\bar\alpha_i}{\underline\beta_i(\varepsilon)}\Big\}, $$
where ᾱ_i is an upper bound on effective horizon (or a regulatorily chosen worst-case for a class of agents). This is economically analogous to “prudential” enforcement: we audit more when we are uncertain about detectability. The limitation, of course, is that this can be overly conservative early on; tightening it requires an explicit exploration policy for audits, which reopens the question of whether the exploration itself can be manipulated. Addressing this rigorously typically requires combining strategic incentive bounds with bandit-style confidence arguments, and—crucially—ensuring that the audit-allocation algorithm is itself stable.

The earlier comparative statics treated participation constraints as an external cap on feasible penalties and false positives. In many applications, participation is an endogenous margin: agents can opt out, multi-home, reduce activity, or shift into less verifiable channels. Audits then affect not only truthfulness but also who shows up and what types are represented in the data used by the learner.

A simple reduced-form way to incorporate this is to introduce an outside option U_i^out and require an ex-ante (or interim) individual rationality constraint:
𝔼[∑_{t ∈ T_i}γ_i(t)(u_i(θ_i, t, s_t) − 1{audit} ⋅ pen_i(b_i, t, z_i, t))] ≥ U_i^out for all relevant types.
Even if the incentive constraint is satisfied, an aggressive audit policy can violate IR through false positives or through the perceived “tax” of being monitored. Empirically, many platforms experience this as a trust and retention problem: enforcement that is statistically justified may still be commercially infeasible.

This produces a three-way tradeoff: (i) audits deter manipulation and protect the learner; (ii) audits distort current outcomes (because we mix away from the learner-chosen mechanism); and (iii) audits can shrink participation, which reduces both welfare directly and the informational richness of the environment the learner faces. In such settings, it can be optimal to from punitive enforcement toward design choices that raise baseline verifiability (lower σ), reduce the damage of small misreports (lower effective L_θ through smoother rules), or provide against false positives (e.g., refundable deposits, appeal processes, or penalty escrow). Formally, these are ways to relax IR without giving up the effective penalty gap on large deviations.

One practical mechanism that sits naturally inside our model is to implement penalties via temporary holds or deposits rather than irreversible fines. Economically, a deposit can preserve deterrence while reducing expected welfare loss for truthful agents if disputes can be resolved with additional information. In reduced form, this corresponds to replacing pen(b, z) with a two-stage function: a provisional penalty applied when the signal is extreme, and a refund probability that depends on subsequent evidence. The effective gap β(ε) then depends on the net expected loss from misreporting after the refund stage, while the IR cost depends on the expected (possibly time-discounted) burden on truthful agents. This distinction matters when agents are liquidity constrained or when delayed refunds impose real costs—yet it also offers a policy-relevant design knob that is often more acceptable than raising P outright.

With Θ = [0, 1]^d, the event “∥b − θ∥ ≥ ε” is not innocuous: it defines which deviations we invest in deterring. In one dimension, deviation size is unambiguous; in d dimensions, the same scalar ε can correspond to very different shapes (balls) and therefore different behavioral margins.

Two forces push us toward careful norm choice. First, verifiability is frequently anisotropic: some coordinates are easy to verify (e.g., past transactions), others are inherently private (e.g., intent or cost). Second, the planner objective may be far more sensitive in some dimensions than others (a “high-stakes” coordinate). A single Euclidean tolerance may therefore be poorly aligned with either enforcement or welfare.

A flexible specification is to use a weighted norm, for example
$$ \|x\|_{W,1}=\sum_{k=1}^d w_k|x_k| \qquad\text{or}\qquad \|x\|_{W,2}=\sqrt{x^\top W x}, $$
and define deviations as “large” when ∥b − θ∥_W ≥ ε. Choosing W lets us encode “we care more (or can verify more) along these dimensions.” On the audit side, the effective gap becomes a function of this geometry:
β(ε) = inf_{∥b − θ∥_W ≥ ε} (𝔼[pen(b, z) ∣ θ] − 𝔼[pen(θ, z) ∣ θ]).
On the welfare side, the tolerance loss is governed by an appropriate Lipschitz constant in the same norm (or by separate constants per coordinate). The design problem is then not only choosing ε, but choosing the enforcement geometry that best matches both detection and objective sensitivity.

A further issue is statistical: if we use a single radial test in ℝ^d (e.g., penalize when ∥b − z∥₂ > τ), then in high dimension the noise distribution’s tails and concentration can behave counterintuitively, and the gap β(ε) can deteriorate unless thresholds scale with d. In practice, platforms often adopt coordinate-wise or rule-based checks (“if any critical field is inconsistent, flag”) precisely to avoid this.

In our notation, a coordinate-wise threshold audit might take the form
pen(b, z) = P ⋅ 1{max_k ≤ d|b_k − z_k| > τ_k},
or a weighted sum of flags. Such rules can yield a larger and more interpretable β(ε) for deviations concentrated in sensitive coordinates, while keeping false positives controlled by tuning the τ_k’s. The limitation is that they privilege sparse deviations: a manipulator could spread a lie across many dimensions, staying below every coordinate threshold while still shifting the allocation. This is exactly where the choice of norm and weights matters—an ℓ_∞-style audit pairs naturally with an ℓ_∞-style tolerance, while an ℓ₁-style welfare sensitivity may require audits that aggregate evidence across coordinates.

Heterogeneity and multidimensionality interact: if we can only verify some coordinates well, then an “audit probability” is not the only targeting dimension—we can also target to verify more carefully. This suggests a richer audit lottery that chooses an audit (which coordinates, which thresholds, which follow-up steps) rather than a single π^aud. Conceptually, we can index audit modes by m ∈ ℳ, each delivering a gap β_m(ε) and a false-positive cost, and then choose a mixture over m to satisfy
∑_{m ∈ ℳ}λ_m β_m(ε) ≥ 4η α
at minimum expected distortion/IR cost. This is a direct analogue of allocating an enforcement budget across different investigative tools.

These extensions sharpen the main message rather than overturn it. As we add heterogeneity, participation responses, and multidimensional types, the same accounting identity persists: dynamic manipulation benefits scale with stability and horizon, while deterrence comes from an expected penalty gap created by audits. What changes is that the designer now has additional degrees of freedom—who to audit, when to audit, which dimensions to verify, and how to structure penalties to preserve participation.

The pragmatic takeaway is that “audit rate” is not a single scalar in realistic deployments. It is a policy surface over agents, histories, and type dimensions, and it must be designed with two constraints in mind: it should be hard to game (or explicitly stable), and it should be aligned with what the platform can actually verify and what the objective truly cares about. These considerations set the stage for the final discussion, where we connect the reduced-form objects in our theory to concrete platform mechanisms such as refund holds, delayed allocation, and random checks, and to the policy questions they raise.

10. Discussion: practical audit mechanisms on platforms (refund holds, delayed allocation, random checks), policy implications, and open problems

Our model treats “auditing” as an abstract enforcement primitive: with probability λ we switch from the learner-chosen mechanism to an audit mechanism π^aud that produces an expected incremental loss of at least β(ε) for deviations larger than ε. On platforms, however, enforcement rarely looks like a literal re-running of the transaction under a separate rule. Instead, it is implemented through operational controls—refund holds, delayed allocation, and random checks—that simultaneously (i) create a verifiable signal linked to the underlying type, and (ii) impose a cost on misreporting. The practical value of the framework is that it lets us translate these controls into the same two accounting objects: a deterrence term (an effective penalty gap) and a distortion term (how much auditing disrupts allocation and participation).

A widely used enforcement tool is not an irreversible fine, but a temporary hold: the platform withholds a payment, blocks a withdrawal, freezes a listing, or delays a rebate until some verification clears. This is economically close to a deposit mechanism. The key observation is that a hold can generate a large β(ε) even when the platform is reluctant (legally or reputationally) to levy harsh punishments. If a misreport increases the probability that funds are held (or held longer), then the expected cost to the agent can be substantial once we account for discounting and liquidity constraints.

In our notation, this corresponds to a penalty that is not a one-shot subtraction P ⋅ 1{|b − z| > τ}, but a of losses: a hold amount H applied with some probability and a release time that depends on subsequent evidence. Reduced-form, one can write an equivalent per-round expected penalty in utility units as
𝔼[pen(b, z) ∣ θ] ≈ Pr (hold ∣ b, θ) ⋅ κ(H, duration),
where κ(⋅) converts the hold into an effective utility cost given the agent’s time preferences and liquidity. The platform designer’s advantage is that κ can be large even when the expected penalty is small. This helps with participation constraints: a truthful agent may face an occasional hold (a false positive), but—conditional on providing documentation or waiting out the hold—the expected permanent loss can be near zero. In terms of our earlier objects, deposit-like enforcement can keep β(ε) high while reducing the IR burden relative to raising P.

The limitation is that holds impose real costs that our simple u_i(θ, s) ∈ [0, 1] normalization hides: liquidity, risk, and administrative burden. Moreover, holds create a new distributional concern: agents with fewer outside financing options bear a larger welfare loss for the same nominal hold. This immediately makes enforcement policy-relevant, because “equal treatment” in nominal penalties is not equal treatment in economic incidence.

A second common mechanism is to delay the final allocation or reward until some ex post outcome is observed: for example, pay sellers only after delivery confirmation, release gig payments after customer validation, or allocate prominent search ranking only after quality signals arrive. Economically, delay serves two roles. First, it creates a correlated with the hidden type (quality, effort, cost): z_i, t need not be a noisy measurement of θ_i, t directly; it can be an outcome proxy (returns, disputes, churn) that becomes informative only with time. Second, delay changes incentives even without explicit penalties: if misreporting yields short-run gains but increases the probability of later reversal, chargeback, or de-ranking, then the expected net gain from deviating falls.

In our framework, these policies enlarge the designer’s menu for achieving a positive effective gap β(ε). Instead of relying on a direct statistical test of |b − z|, the platform can engineer z by choosing what to reveal and when. One can think of “delayed allocation” as moving some of the stage utility into the future, where it is conditioned on realized evidence. The mechanism remains online, but the verification becomes endogenous to the platform’s timing choices.

The main tradeoff is immediate: delay is itself a distortion. It can reduce matching efficiency, slow market clearing, and harm user experience. Thus it plays the same role as λ in the regret decomposition: it is a lever that improves incentives but can worsen the objective in the short run. In high-velocity environments (ad auctions, real-time pricing), even small delays can be infeasible, so the relevant analog is “delayed settlement” or “post-auction adjustments” rather than delayed allocation per se.

The third operational primitive is the simplest: random checks. Platforms often implement enforcement as a fixed-rate sampling rule: “x% of transactions are reviewed,” “x% of accounts are verified,” or “x% of claims require documentation.” This corresponds most directly to our constant λ mixing, and it has a normative advantage: transparency and predictability. If agents can anticipate the enforcement environment, then the platform can credibly claim due process and reduce perceptions of arbitrary punishment.

From an incentive perspective, random checks are attractive because they are hard to game. They do not create an obvious targeting rule that agents can manipulate by changing their “suspicion profile.” In our language, random auditing makes the audit lottery stable almost by construction. The economics, however, is that uniform randomness is rarely cost-effective: it expends auditing capacity on low-risk events. This is why many platforms migrate to risk-based systems. The challenge is to preserve the anti-manipulation properties of uniform randomness while improving cost-effectiveness—precisely the point where stability-style restrictions on the targeting policy become valuable.

Once we interpret π^aud as an operational enforcement system, several policy constraints become central rather than peripheral.

First is . If penalties (including holds and de-rankings) are triggered by noisy signals, then false positives are unavoidable. A platform can respond by lowering P, raising thresholds τ, or reducing λ, but these moves weaken deterrence unless they are offset by improved verifiability (higher β(ε)). An alternative response—common in practice—is procedural: appeals, second-stage checks, and human review. In our reduced-form language, these procedures change both β(ε) and the truthful-agent cost by reshaping the distribution of net penalties conditional on truthful behavior.

Second is . Regulators and courts often evaluate enforcement not only by error rates but also by penalty severity relative to the harm. Our model makes explicit that P enters deterrence linearly, but proportionality may cap feasible P (or restrict the form of pen(⋅)). This pushes platforms toward enforcement mechanisms that are “large enough in expectation” to deter manipulation while remaining proportional in any single realization—again pointing to deposits and staged penalties rather than large irreversible fines.

Third is . Auditing requires collecting and processing data; verification signals z_i, t often include sensitive information. Paradoxically, a platform may need to be transparent about the and of audits to sustain deterrence, while being constrained in what it can reveal about targeting and evidence. Our framework suggests a practical compromise: publicly commit to a coarse audit lottery (a baseline λ and broad categories), while keeping detailed detection features private, and ensure that whatever targeting remains is stable enough to limit strategic gaming.

A recurring operational lesson is that “audit more” is usually the least efficient knob. In the regret bound, audits harm the objective directly through mixing, whereas better verification improves incentives requiring as much distortion. Interpreted literally, investing in verification systems—better documentation pipelines, stronger identity checks, cryptographic receipts, better anomaly detection, or improved measurement that reduces noise scale σ—is equivalent to increasing β(ε) and thereby reducing the minimum λ needed for incentive compatibility.

This framing also clarifies why platforms often bundle enforcement with product changes. For example, requiring standardized invoices or escrowed shipment tracking is not merely an anti-fraud policy; it is a way to raise the effective penalty gap for misreporting by making audits more informative. In our terms, the platform is engineering a larger ρ(ε) (detection advantage) and thus a larger β(ε) for the same penalty schedule.

Several modeling gaps become salient once we take platform details seriously.

In many environments, the audit signal is not exogenous noise around θ, but an outcome influenced by the agent (or by both sides of a transaction). Returns, ratings, and disputes can be gamed. This turns the audit stage into an additional strategic interaction, and β(ε) becomes equilibrium-dependent. A robust theory would treat auditing as a mechanism with its own incentive constraints, rather than as a fixed penalty gap.
Our stability condition bounds the effect of a single report change on the learner’s output. On platforms, manipulation may be coordinated: fake accounts, rings, or coalitions that submit correlated reports. Extending stability and the incentive argument to group deviations would require group-privacy-style bounds and a corresponding strengthening of enforcement. Practically, this points to audit designs that target network structure (shared payment instruments, shared devices) rather than only pointwise report inconsistencies.
Many platforms do not observe a full-information objective G_t(θ_t, s_t) each round; they observe noisy proxies or bandit feedback (clicks, conversions). This complicates both learning and incentive control. The platform’s enforcement policy then affects not only agent incentives but also the statistical quality of feedback, potentially biasing the learner. A unified analysis would couple bandit regret with incentive constraints and would treat audits as both deterrence and exploration.
Targeted audits and holds can disproportionately burden certain groups if verification error rates differ. In our reduced form, this means σ (or β(ε)) varies systematically across subpopulations, and a uniform enforcement rule can generate unequal false-positive rates. Addressing this requires adding explicit fairness constraints to the enforcement optimization (constraints on false-positive incidence, hold duration, or appeal success), and then re-solving the incentive-feasibility condition under those constraints.

Taken together, refund holds, delayed allocation, and random checks are not ad hoc “trust and safety” patches; they are implementations of a common economic idea: create an expected loss from large deviations that is hard to manipulate and acceptable under participation and policy constraints. Our framework is intentionally reduced-form, but it clarifies what practitioners already grapple with: enforcement is most effective when it improves the of verification (raising β(ε)) rather than merely increasing audit frequency λ, and it is most sustainable when it is procedurally legitimate (controlling false positives, enabling appeal) rather than maximally punitive. The open technical agenda is to endogenize the audit signal, incorporate partial-feedback learning, and formalize fairness and due-process constraints—so that the same stability-plus-audits logic can be carried from a stylized repeated game into the operational realities that platforms and regulators actually face.