Online platforms increasingly learn and redesign their rules while interacting with strategic users. Marketplaces update ranking and fee policies, ride-hailing systems adapt dispatch and surge, and subscription services tune personalized offers and product features. A natural theoretical lens is : in each round t ∈ [T], the platform selects a mechanism (or a lottery over mechanisms), agents observe it, interact, and the platform updates using the realized feedback. The promise of this paradigm is to combine economic incentives with the statistical efficiency of online learning.
A key friction, however, sits one step the usual incentive-compatibility question of truthful reporting: in most platform settings, participation itself is discretionary and strategically meaningful. A worker can reject a gig, a seller can list elsewhere, and a user can churn. This participation decision is not merely an exogenous missing-data problem. It is an action that agents can use to manage current payoffs and, when the platform learns from observed interactions, to shape the platform’s future policy. Put differently, even if we succeed in making truthful reporting optimal conditional on showing up, we may still fail to obtain economically meaningful guarantees if agents can profitably manipulate the learning process by selectively withholding information.
We view as the missing piece in incentive-compatible online learning. The classical logic of dominant-strategy or Bayesian IC treats the mechanism as fixed; the agent’s report affects the current allocation and payment, but not the mechanism itself. In contrast, when the platform’s mechanism evolves with history, a strategic agent may tolerate a small immediate loss to push the learner toward more favorable future mechanisms. The literature has recently shown that stability of the learning rule—formalized via differential privacy–type notions—limits such intertemporal manipulation incentives. Yet most of these results assume that each round the agent must “play” (report, bid, or demand) and cannot opt out without consequence. On real platforms, abstention is ubiquitous and often costless up to an outside option: the driver keeps driving elsewhere, the seller sells on another marketplace, or the user simply does nothing.
Our starting observation is that abstention is itself a —a report of ⟂—that changes the data stream seen by the learner. When the platform’s policy updates are sensitive to observed interactions, an agent may benefit from strategically replacing informative interaction with nonparticipation precisely in rounds where participation would push the platform in an unfavorable direction. The resulting selection can be self-fulfilling: the platform learns from a distorted population, adapts toward rules that fit the selected participants, and thereby further discourages the missing users from returning. This is a familiar pattern in practice (platform “death spirals” and adverse selection), but it is rarely integrated into the core incentive-compatibility requirement of online mechanism design.
Three motivating examples illustrate why participation is central.
In many labor platforms, workers observe an offered contract or dispatch rule and then choose whether to accept. Acceptance is both an allocation decision and an information revelation: by accepting, a worker reveals something about their type (availability, costs, preferences) and generates performance data that informs future dispatch and pricing. A sufficiently forward-looking worker may reject marginal jobs to “train” the system—e.g., to increase future offered wages or to avoid being classified as willing to accept low-pay tasks. This behavior is qualitatively different from misreporting within a fixed contract; it is manipulating the learning channel by controlling whether one’s signal enters the dataset at all.
Sellers frequently list across multiple marketplaces and can shift inventory in response to platform policies. When a marketplace adjusts fees, ranking, or advertising rules based on observed listings and sales, participation is a strategic lever. Withholding high-quality inventory today can make the platform infer lower demand elasticity or lower quality mix, potentially changing tomorrow’s fees or search weights. Conversely, a seller might temporarily flood the platform to influence the learner in a favorable direction. In both cases, the platform’s learning problem is entangled with the sellers’ participation incentives, and the induced selection bias is endogenously generated.
In subscription or freemium models, users can disengage at any time. Platforms learn from engagement signals and churn events to tailor retention offers and content. If users anticipate that disengagement triggers discounts or premium features, then opting out becomes a strategic bargaining tool. More subtly, even absent explicit “win-back” offers, users can affect the platform’s learning about demand by disappearing, thereby shaping future product or pricing policies. This motivates a participation-aware notion of IC: we cannot evaluate welfare or regret on a population that endogenously self-selects in response to the algorithm itself.
These examples point to a broader economic theme: learning and incentives jointly determine , not just . Once participation is endogenous, standard performance metrics become ambiguous. A platform can appear to have low regret “on the observed sample” simply because the sample has been strategically filtered to be easy to serve. Conversely, a platform can fail to learn not because of statistical limitations, but because strategic users rationally deny it the data needed to improve.
Our contribution is to integrate participation decisions into the incentive and learning guarantees in a simple, modular way. We model a repeated environment in which, at each round t, each agent who has an opportunity to interact observes the platform’s announced mechanism lottery and their private type θi, t, chooses a participation action ai, t ∈ {0, 1}, and, if participating, submits a report bi, t. Nonparticipation yields an outside option payoff oi, t, capturing the fact that abstention is rarely “punished” by the platform in the same way as a detected misreport could be. The platform’s learner updates from public history and observed outcomes, and thus today’s participation and reports can influence tomorrow’s mechanism choice. Agents may be long-sighted, optimizing discounted utility; we summarize their effective degree of forward-lookingness by a parameter α(A) for an agent class A.
Within this model, we focus on a participation-and-reporting notion of Nash incentive compatibility (PR-NIC): no agent should profit by any joint deviation that alters both their participation pattern and their reports over time. This requirement is intentionally stronger than “truthful reporting conditional on participation,” because the main strategic problem is precisely the option to withdraw and thereby manipulate the learning rule. Importantly, we do not treat missingness as noise; we treat it as equilibrium behavior.
The economic logic of our main conditions is transparent. If the platform’s learning rule is —in the sense that changing one agent’s current effective report between θ and ⟂ cannot substantially change the distribution over future mechanisms—then the agent’s potential future benefit from manipulation is bounded. Stability is thus a lever for incentive provision: it limits how much an agent can gain by either misreporting or abstaining. But stability alone does not create truthful behavior; it only upper bounds the value of strategic manipulation. To turn that bound into a strict incentive constraint, we combine stability with two “immediate payoff” forces that operate in the current round.
First, to deter , we allow the platform to mix in a commitment or audit-like mechanism with a penalty gap β > 0—a primitive that captures credible consequences for lying when the enforcement mechanism is used. The platform plays this commitment mechanism with probability λ each round. The product λβ is the expected per-round deterrence from misreporting, and it must dominate the maximum discounted future gain from manipulating the learner, which stability bounds in terms of η and α(A).
Second, to deter , we require a participation margin: truthful participation must beat the outside option by at least a buffer. Formally, letting κi, t denote the gap between the agent’s expected utility from truthful participation in round t and the outside option oi, t, we require κi, t to exceed the same manipulation bound derived from stability. Economically, this condition says that if abstention is almost as good as participation, then a long-sighted agent can cheaply “buy” influence over the learning process by withholding data. Preventing this requires either (i) making participation strictly attractive (e.g., via better terms, subsidies, minimum earnings guarantees), or (ii) reducing the agent’s ability to influence future mechanisms (more stability), or (iii) reducing long-sightedness (e.g., shorter horizons, myopia), which is typically not under the designer’s control.
These ingredients deliver a clean message for platform design: . In practice, this aligns with policies such as guaranteed earnings in gig platforms, listing fee rebates to retain sellers, or retention benefits for subscribers. In our framework, these instruments are not merely participation boosters; they are incentive constraints that stabilize the data-generating process needed for learning.
Conditional on equilibrium participation and truthful reporting, the platform faces a standard online learning problem over the realized stream of effective types (with ⟂ encoding absence). This yields a participant-robust notion of regret: performance is compared to the best fixed benchmark mechanism evaluated on the realized participating population. We emphasize both the appeal and the limitation of this metric. It is appealing because it matches what the platform can actually serve—agents who are present—and it allows us to apply familiar regret guarantees once incentives pin down the data stream. It is limited because it does not automatically translate into welfare for the full population when participation is low or highly selected; bridging that gap requires additional assumptions (e.g., a lower bound on participation or an additive objective structure).
Finally, we highlight a sharp impossibility intuition that emerges when outside options are high and agents are very forward-looking. If κi, t can be driven arbitrarily close to zero for many rounds, then abstention becomes an almost costless manipulation channel. In such environments, no mechanism can simultaneously (i) be robustly incentive compatible in the participation-and-reporting sense and (ii) learn quickly enough to guarantee sublinear regret against meaningful benchmark classes. This negative result is not an artifact of our proof technique; it reflects a genuine tension between adaptivity and voluntary participation that platform designers confront in practice.
The remainder of the paper formalizes these ideas, derives sufficient conditions under which truthful threshold participation is a Nash equilibrium, and characterizes the resulting learning guarantees and tradeoffs. We then relate our approach to existing work on stability-based incentives, dynamic pricing, individual rationality, and selection bias in learning, situating endogenous participation as a central design constraint rather than a secondary complication.
Our paper sits at the intersection of online learning, mechanism design, and strategic data generation. The closest conceptual starting point is the line of work showing that of the platform’s learning rule can serve as a substitute for commitment in repeated strategic environments. In particular, Huh–Kandasamy develop a framework in which an online algorithm selects a mechanism each round and agents may attempt to manipulate future choices by deviating today. Their key technical move is to impose a weak differential-privacy-type condition (or, more broadly, a bounded-influence condition) on the learning dynamics, which limits how much any single agent can shift the distribution over future mechanisms. When combined with a simple commitment device (or a small amount of ``forced exploration’’ with penalties), this yields incentive compatibility and no-regret learning guarantees in a repeated setting. We adopt this stability-first logic, but we expand the action space in the direction that is most salient for platforms: agents can choose whether to participate at all, and nonparticipation yields an outside option. This modifies the strategic problem: abstention becomes a manipulation channel that operates through missingness rather than through misreports. Formally, it is natural to treat abstention as an effective report ⟂, but economically it is distinct because it is typically directly punishable in the same way as a detected lie. Our conditions therefore require not only a reporting deterrent (via commitment penalties) but also a that makes withdrawing information costly enough relative to the bounded future benefit implied by stability.
A second related strand studies online mechanism design and dynamic pricing with strategic buyers. In dynamic pricing, the seller posts prices over time while learning demand, and buyers may time purchases or misrepresent willingness-to-pay to influence future prices. This literature highlights the basic intertemporal manipulation problem: an agent may forgo a profitable trade today to induce lower prices tomorrow. Mechanism-learning versions of this problem arise in repeated auctions, posted-price mechanisms, and bandit-style pricing with strategic demand. Many models assume that buyers arrive exogenously and are present each period, so the strategic decision is primarily or , rather than a richer participation choice that affects what data enters the learner. Our participation model can be read as capturing exactly this acceptance/rejection channel, but we emphasize a general platform setting in which (i) the platform learns not only a scalar demand curve but a mapping from reported types to allocations, and (ii) nonparticipation carries an outside option that may be close to indifference. The latter is crucial: when the outside option makes the per-round participation margin κi, t small, strategic delay or refusal becomes an inexpensive way to influence learning, and we show this creates a genuine barrier to sublinear regret under robust incentives. This complements dynamic pricing results that identify regimes in which strategic buyers force the seller to slow learning or commit to stable policies.
Our work also relates to the large literature on individual rationality (IR) and participation constraints in mechanism design. In static mechanism design, IR ensures that truthful participation yields nonnegative utility relative to an outside option (often normalized to zero), and the designer may use subsidies, entry fees, or distortions to satisfy participation constraints. In dynamic environments, IR becomes more subtle: participation can be voluntary in each period, outside options may be time-varying, and agents may care about continuation values. Our focus is not on optimizing subject to dynamic IR per se, but on using a participation margin as an incentive instrument that prevents strategic selection from undermining learning. The condition κi, t ≥ 4η α(A) is stronger than conventional per-round IR (κi, t ≥ 0): it is a buffer that absorbs the maximum discounted benefit from manipulating the learner. Economically, this aligns with practice in which platforms use minimum earnings guarantees, temporary rebates, or retention benefits not merely to increase usage but to stabilize the data-generating process. At the same time, we acknowledge the limitation: such margins may be expensive or infeasible when competition raises outside options, and our impossibility discussion can be interpreted as a formalization of the resulting ``race to the outside option’’ that constrains platform learning.
A fourth connection is to the machine-learning literature on selection bias, missing data, and strategic data acquisition. Classical statistical treatments often model missingness as exogenous (e.g., missing at random) or treat selection as an adversarial but non-strategic process. In many platform domains, however, selection is : the mechanism affects who participates, which affects what the platform learns, which affects the next mechanism. This feedback loop resembles strategic classification and performative prediction, where agents respond to a deployed model and thereby shift the data distribution. Our model makes this loop explicit in an online mechanism context by letting the effective type profile θ̃t depend on the mechanism lottery through agents’ participation decisions. The participant-robust regret benchmark we use is deliberately conservative: it evaluates performance on the realized effective types (including ⟂). This choice parallels evaluation in learning under selective labels, where performance guarantees are sometimes only possible on the observed subpopulation unless one imposes additional structure. We therefore view our welfare translation results (under additive objectives and bounded selection) as an analogue of assumptions that connect selective-sample performance to population performance.
Relatedly, there is growing interest in incentive-aware online learning where agents may strategically manipulate observations. Differential privacy and other algorithmic stability notions have emerged as a unifying toolkit: stability simultaneously supports generalization and limits the benefit of influencing future decisions. In repeated games and adaptive mechanism design, stability bounds the value of intertemporal deviations because the future policy distribution does not react too much to any single-period change. Our weak-DP assumption follows this tradition but is tailored to a natural deviation in platform settings: switching between participation and abstention, i.e., between θ and ⟂. This is weaker than full per-agent DP over all possible reports, and it matches the economic lever we care about. The novelty is not the stability inequality itself but the way it interacts with outside options: even a small bounded future influence can dominate incentives when abstention is nearly costless. This is precisely why we separate the deterrence of misreports (handled by commitment penalties) from the deterrence of abstention (handled by participation margins).
We also connect to the literature on regret notions in strategic environments. Standard no-regret guarantees treat the data sequence as exogenous. In strategic settings, the data are endogenous, so one must specify (i) an equilibrium concept generating the sequence and (ii) a regret benchmark that remains meaningful. Some papers study policy regret, stackelberg regret, or equilibrium regret notions that incorporate the effect of today’s action on tomorrow’s data. Our approach is closer in spirit to a two-step reduction: first, enforce a participation-and-reporting Nash equilibrium (PR-NIC) using stability and a commitment mixture; second, conditional on this equilibrium, apply a standard full-information regret analysis to the realized effective-type sequence. This yields a clean modular guarantee, but it also highlights the boundary of what can be achieved: absent conditions like κi, t ≥ 4η α(A), the equilibrium itself may generate an arbitrarily selected sample, in which case learning-theoretic regret becomes decoupled from welfare.
Finally, our modeling choice is related to empirical and theoretical work on platform retention, multi-homing, and dynamic participation. In industrial organization and market design, platform policies influence entry, churn, and cross-platform allocation of supply, generating dynamics that can amplify small design changes into large participation shifts. These models often treat platform policy as chosen by the platform with some commitment, and participation responds as a demand or supply function. We instead emphasize policy choice: the platform learns, and the resulting endogeneity of participation affects what can be learned. Our ``design frontier’’—which makes the enforcement intensity λ depend on long-sightedness α(A), stability η, commitment strength β, and participation margins—can be read as a compact comparative-statics summary of a broader retention-versus-learning tradeoff. We do not attempt to endogenize outside options through competition or to model rich cross-platform dynamics; rather, we provide a mechanism-learning primitive that can be embedded into such IO environments.
In sum, the paper contributes to a growing view that incentive-compatible learning is not only about eliciting truthful reports but also about . Stability-based tools remain powerful, but once abstention is available, they must be paired with either (i) enforceable penalties for lying and (ii) a strictly positive participation margin (or an equivalent subsidy/benefit) that blocks strategic withholding of data. The next section formalizes this environment and makes precise how weak stability, commitment mixing, and outside options jointly determine equilibrium behavior and learning guarantees.
We study an online platform that repeatedly deploys a mechanism while learning from strategically generated data. Time is discrete, indexed by rounds t ∈ [T]. There is a (potentially large) population of agents i ∈ [n], but in any given round only some agents have an opportunity to interact with the platform. We represent this by an availability set Ti ⊆ [T]: agent i can potentially participate only at rounds t ∈ Ti. This captures, for example, sporadic arrivals in a marketplace, heterogeneous work schedules in a labor platform, or intermittent eligibility in an allocation system.
At each round t ∈ Ti, agent i privately observes a type θi, t ∈ Θi (discrete in our core theorem, with standard extensions). The crucial feature is that participation is endogenous. After observing the platform’s current mechanism (more precisely, a publicly announced ), the agent chooses a participation decision ai, t ∈ {0, 1}. If ai, t = 1, the agent submits a report bi, t ∈ Θi; if ai, t = 0, the agent does not interact with the mechanism and instead receives an outside option payoff oi, t. We allow oi, t to vary across agents and time, reflecting changing alternative opportunities, multi-homing, or opportunity costs; for simplicity it is realized immediately and does not depend on the report (since no report is submitted when abstaining). As in the rest of the paper, we normalize utilities to be bounded: the per-round utility from participating satisfies ui(θ, s) ∈ [−1, 1], and outside options may be taken in the same bounded range without loss of generality.
A convenient formal device is to treat abstention as a special “null”
report ⟂. Define the and
$$
\tilde\theta_{i,t}=\begin{cases}
\theta_{i,t} & \text{if } a_{i,t}=1,\\
\perp & \text{if } a_{i,t}=0,
\end{cases}
\qquad
\tilde b_{i,t}=\begin{cases}
b_{i,t} & \text{if } a_{i,t}=1,\\
\perp & \text{if } a_{i,t}=0.
\end{cases}
$$
Let θ̃t = (θ̃1, t, …, θ̃n, t)
and b̃t = (b̃1, t, …, b̃n, t).
Mechanisms are defined on the augmented report space that includes ⟂, so the platform’s allocation/payment rule
is well-defined even when some agents are absent. Economically, however,
we emphasize that ⟂ is not “just
another message”: it corresponds to opting out and collecting oi, t,
which in many applications cannot be penalized in the same direct way as
an identified misreport.
Each round produces an outcome st in an outcome space S (which may include allocations, prices, and transfers). The platform evaluates outcomes using an objective function Gt(θ̃t, st) ∈ [−1, 1]. The dependence on θ̃t allows the objective to explicitly treat missing participation as part of the state (e.g., welfare over realized trades, service level conditional on active supply, or revenue net of empty allocations). We will work in a full-information feedback model: after the round concludes, the platform observes Gt (equivalently, it can evaluate the per-round payoff that each candidate mechanism in Π would have produced given the realized interaction). This assumption is standard in stability-based mechanism-learning analyses and isolates the strategic-data issue from bandit feedback complications.
A (single-round) mechanism π is a map from effective report
profiles to a distribution over outcomes: for each b̃t, the
mechanism draws st ∼ π(b̃t).
The platform does not commit to a single mechanism ex ante; rather, it
runs an online learning algorithm that selects a mechanism each round as
a function of public history. Formally, let h < t denote
the public history up to t − 1
(including past announced lotteries, realized reports b̃ < t,
realized outcomes, and objectives). In round t the platform samples a “learning
mechanism” πtL
from a distribution qt(h < t)
over a benchmark class Π:
πtL ∼ qt(b̃ < t, G < t).
We interpret Π as a set of
mechanisms that are well-behaved in a one-shot sense (e.g., each π ∈ Π is truthful for
reporting conditional on participation), while the online learner
chooses among them to track the evolving objective sequence.
Because the learning rule itself creates intertemporal
incentives—agents may distort today’s report or participation to
influence future choices of qt′—we
augment the learner with a simple commitment/audit device. In each round
t the platform publicly
announces and implements a mixture
πt = (1 − λ) πtL + λ πcom,
where λ ∈ [0, 1] is a fixed
mixing probability and πcom is a .
Operationally, one can think of πcom as a mechanism with
an enforceable verification/penalty rule, used infrequently but
predictably enough that deviations become unattractive. The platform’s
choice of λ will govern a
familiar tradeoff: larger λ
strengthens incentives but sacrifices learning performance because the
platform sometimes foregoes the learner’s preferred mechanism.
Agents observe h < t and the announced lottery πt when making participation and reporting decisions. The platform observes only effective reports and the objective feedback, as in standard online learning protocols.
Agents may be forward-looking. Let γi(t)
denote a (weakly) non-increasing discount factor applied to payoffs at
time t. Given a strategy
profile σ (which specifies
both participation and reporting contingently on histories and types),
the expected discounted utility of agent i is
Ui(σ; M, A) = 𝔼[∑t ∈ Tiγi(t)(ai, t ui(θi, t, st) + (1 − ai, t) oi, t)],
where the expectation is over types, mechanism randomness, and any
randomness in strategies. Following Huh–Kandasamy, we summarize the
degree of forward-looking behavior within a class of agents A by a long-sightedness parameter
α(A) that
upper-bounds the discounted value of future influence an agent can
obtain when the learning rule is stable. We do not re-derive α(A) here; we treat it as a
primitive carried by the agent population and discounting structure, and
we use it to express incentive constraints in closed form.
The central algorithmic assumption is that the learner’s mapping from histories to distributions over mechanisms is with respect to any single agent’s ability to perturb the observed data stream. We adopt a weak differential privacy notion tailored to the participation channel: the learner should not react too strongly when one agent’s effective report at some round switches between an actual type and the null symbol ⟂.
Concretely, we assume that for each round t, the map qt is weakly
η-differentially private with
respect to one-agent changes of the form b̃i, τ = θ
versus b̃i, τ = ⟂
(holding everything else in the history fixed). In its standard
measurable form, for any two histories h < t and
h < t′
that differ only in one such effective report entry, and any event ℰ ⊆ Π,
qt(h < t)(ℰ) ≤ eη qt(h < t′)(ℰ).
This is weaker than requiring privacy with respect to arbitrary report
substitutions (it focuses on the economically salient deviation of “show
up versus withhold”), but it is strong enough to bound the maximum
change in the distribution over future mechanisms induced by one agent’s
one-round participation choice. The parameter η thus plays a dual role: smaller
η means more stability (harder
to manipulate) but typically slower adaptation; larger η means faster learning but greater
susceptibility to strategic influence.
Stability alone does not rule out profitable misreports when the agent participates; it only bounds the future benefit from any single deviation. To directly deter within-round deviations, we assume access to a commitment mechanism πcom that can impose a minimum expected utility loss on deviations. We summarize this property by a β > 0: informally, conditional on participating, any deviation from truthful reporting under πcom reduces the agent’s expected contemporaneous utility by at least β relative to truthful reporting (for the relevant deviation set; our later results only need this gap against the manipulations that matter for learning). The platform mixes in πcom with probability λ, so an agent contemplating a misreport faces an expected penalty of order λβ, which we will compare to the maximal discounted manipulation benefit implied by η and α(A).
The platform’s goal is to maximize cumulative objective:
$$
\mathrm{Alg}(M)=\mathbb{E}\Big[\sum_{t=1}^T
G_t(\tilde\theta_t,s_t)\Big],
$$
where θ̃t
is endogenously determined by agents’ participation decisions and st is the
realized outcome drawn from the chosen mechanism. Because the realized
data stream is equilibrium-generated, we define regret against a
benchmark that conditions on the realized effective types. Specifically,
for a target class Π of
single-round mechanisms, we measure
$$
\mathrm{Reg}^{\mathrm{part}}_T(M,\Pi)
=\max_{\pi\in\Pi}\sum_{t=1}^T\mathbb{E}_{s\sim\pi(\tilde\theta_t)}\big[G_t(\tilde\theta_t,s)\big]-\mathrm{Alg}(M),
$$
where θ̃t
includes ⟂ entries for agents who
abstain. This benchmark is deliberately conservative: it asks whether,
given the actual selection induced by equilibrium participation, the
platform performs nearly as well as the best fixed mechanism in
hindsight on that same realized sequence. In later sections we discuss
when and how such participant-level guarantees translate into
population-level welfare—an issue that turns on additional structure
(e.g., additive objectives and lower bounds on participation rates).
This completes the description of the environment. The next step is to formalize the strategic solution concepts appropriate to this participation-and-reporting game and to state incentive and learning guarantees under weak stability and commitment mixing.
The defining difficulty in online mechanism learning is that the platform’s future behavior is endogenous to past interaction. As a result, even when each single-round mechanism in the benchmark class Π is truthful in the usual (static) sense, an agent may find it profitable to deviate today—either by misreporting while participating or by strategically abstaining—in order to influence the learner’s future distribution over mechanisms. We therefore separate three layers of incentives: (i) incentives over reports holding participation fixed (as in Huh–Kandasamy), (ii) incentives over participation and reports (our participation-and-reporting notion), and (iii) the participation/IR constraints that ensure the equilibrium we analyze is consistent with voluntary entry.
A (behavioral) strategy for agent i specifies, at each round t ∈ Ti,
how the agent maps the publicly observed history and the current
mechanism lottery into a participation decision, and then (if
participating) into a report. Formally, writing the public history up to
t − 1 as h < t, and
recalling that the platform publicly announces the round-t lottery πt before agents
move, a strategy can be represented as
σi, t: (h < t, πt, θi, t) ↦ ai, t ∈ {0, 1}, ρi, t: (h < t, πt, θi, t) ↦ bi, t ∈ Θi,
with the convention that ρi, t
is only payoff-relevant on histories where σi, t
prescribes ai, t = 1.
We write σi = (σi, t, ρi, t)t ∈ Ti
and σ = (σ1, …, σn).
Given a mechanism-learning protocol M (i.e., a rule that maps histories into the announced lottery over mechanisms, plus the outcome realization rule), any strategy profile σ induces a distribution over the full play path and hence an expected discounted utility Ui(σ; M, A) as defined in the model. Since the game has public monitoring (the learner’s announcements, realized effective reports, and objective feedback are public), it is also convenient to speak in terms of continuation utilities. For any public history h < t and type realization θi, t, let Ui(σ ∣ h < t, θi, t) denote the expected continuation payoff from round t onward when play follows σ thereafter. Our equilibrium notions require that, for each agent and each information set (public history plus the agent’s current type), the prescribed action maximizes this continuation value against others’ prescribed play.
We first isolate the incentive problem studied in stability-based mechanism learning without endogenous entry. Fix an online protocol M and suppose agent i is forced (or committed) to participate whenever available, so ai, t ≡ 1 for all t ∈ Ti. In that reduced game, the only strategic choice is the report bi, t. We say that M is (reporting-NIC) for a class of agents A if truthful reporting is a Nash equilibrium of this reduced reporting-only game.
Concretely, let σtr denote the truthful
reporting strategy (with mandatory participation) under which bi, t = θi, t
for all i, t. Then
reporting-NIC requires that for every agent i and every alternative reporting
strategy ρi′
(allowing dependence on public histories and types),
Ui((ρitr, ρ−itr); M, A) ≥ Ui((ρi′, ρ−itr); M, A).
This notion matches the benchmark in Huh–Kandasamy: it captures
intertemporal manipulations through the reporting channel alone, and it
is the relevant target when participation is exogenous (or when the
platform can perfectly compel participation). In our environment,
however, reporting-NIC is insufficient because agents can also
manipulate the learning dynamics by their data.
Our main solution concept restores the economically salient outside option and treats participation as a strategic action. Intuitively, the platform should be robust to two classes of deviations: (a) an agent participates but misreports in order to steer the learner, and (b) an agent abstains in order to suppress information and thereby steer the learner. The latter is particularly important because, unlike misreporting, abstention is typically hard to punish directly.
We say that an online protocol M is (PR-NIC) for a class of agents
A if the following is a Nash
equilibrium of the full game with voluntary participation: in each
round, participate whenever participation yields at least the outside
option (under truthful reporting), and when participating report
truthfully. Formally, define the σthr by
ai, t = 1 and bi, t = θi, t if 𝔼 [ui(θi, t, st) ∣ h < t, πt, ai, t = 1, bi, t = θi, t] ≥ oi, t,
and ai, t = 0
otherwise (with bi, t
arbitrary off-path). PR-NIC requires that no agent can profitably
deviate from σthr
by choosing alternative joint participation-and-reporting strategy σi′:
∀i, ∀σi′: Ui((σithr, σ−ithr); M, A) ≥ Ui((σi′, σ−ithr); M, A).
Equivalently, we can view PR-NIC as reporting-NIC in an augmented
message space where the null symbol ⟂
is available, together with the understanding that choosing ⟂ yields the outside option oi, t
rather than the mechanism’s outcome-contingent utility. This
interpretation is conceptually useful because it aligns abstention with
a particular kind of data perturbation (switching b̃i, t
between a type and ⟂), which is exactly
the perturbation against which we assume the learner is stable.
Our sufficient conditions in the next section rule out both: the commitment component controls within-round misreports by imposing a penalty gap, while stability controls the discounted continuation benefit from either kind of deviation, and participation margins ensure abstention is not worthwhile in equilibrium.
Because this is a dynamic game with evolving public histories, one might ask whether we require a refinement such as subgame perfection. For our purposes, Nash equilibrium in behavioral strategies is the right baseline: agents condition on the entire publicly observed history h < t and their current private type θi, t, and we require that no agent can improve their discounted payoff by switching to a different such history-dependent strategy. Importantly, our proofs proceed via a backward-induction / one-shot deviation argument (as in Huh–Kandasamy): we show that at every history and every round, the prescribed action is optimal given that future play returns to the equilibrium strategy. Under standard regularity (perfect recall and bounded payoffs), establishing the absence of profitable one-shot deviations at every information set is sufficient to conclude that σthr constitutes a sequentially rational equilibrium path. Thus, while we state PR-NIC as a Nash condition for compactness, the guarantees we derive are inherently and history-by-history.
Voluntary participation introduces a participation constraint even when incentives are otherwise aligned. In our model, the natural outside option is oi, t, which is realized immediately upon abstention. Accordingly, we will use an IR notion: conditional on the public history and the agent’s current type, participating truthfully should yield at least the outside option in expectation over the mechanism’s internal randomness and other agents’ types/strategies.
Fix a candidate equilibrium strategy profile σ (typically σthr) and a protocol
M. We say that M is at (i, t) under σ if, whenever t ∈ Ti
and the strategy recommends participation, we have
𝔼 [ui(θi, t, st) ∣ h < t, πt, σ] ≥ oi, t.
More generally, we say M is at
(i, t) under σ if truthful participation yields
utility within ε of the
outside option:
𝔼 [ui(θi, t, st) ∣ h < t, πt, σ] ≥ oi, t − ε.
The ε-relaxation is useful for
two reasons. First, it accommodates small approximation errors that
arise when we trade off stability and learning performance. Second, it
clarifies the economic role of the participation margin κi, t:
when κi, t
is strictly positive (and uniformly bounded away from 0), IR holds with slack, and that slack can
be “spent” to deter strategic abstention that would otherwise be
attractive purely for its effect on the learner.
To connect IR to the manipulation problem, we will repeatedly
summarize the participation incentive at (i, t) by the
κi, t := 𝔼 [ui(θi, t, st) ∣ h < t, πt, ai, t = 1, bi, t = θi, t] − oi, t.
When κi, t ≥ 0,
the agent weakly prefers truthful participation to abstention the
continuation induced by truthful play. But PR-NIC requires more: we must
ensure that an agent cannot justify a short-run loss (foregoing a
positive κi, t)
by gaining enough discounted future influence over qt′.
This is exactly where stability and the long-sightedness parameter α(A) enter: stability
limits how much any single ⟂ versus
θ perturbation can change
future mechanism distributions, and α(A) aggregates that
bounded influence over time under discounting.
With these definitions in place, our main theorem will provide closed-form sufficient conditions under which σthr is an equilibrium (PR-NIC): a condition of the form λβ large enough to deter misreports, and a condition of the form κi, t large enough (uniformly) to deter strategic abstention. Conditional on PR-NIC equilibrium play, the platform then faces an ordinary online learning problem on the realized effective-type stream, so standard no-regret guarantees can be stated against the benchmark class Π evaluated on realized participation.
The equilibrium problem in our environment is not that agents fail to understand single-round incentive constraints. Rather, the difficulty is inherently : because the platform adapts, an agent’s current action can be valuable primarily through its effect on the platform’s mechanism choices. This creates two manipulation channels that must be controlled simultaneously. First, an agent can while participating to tilt both the current allocation/payment and the learner’s subsequent updates. Second—and more subtly—an agent can in order to suppress information, steering the learner by selectively removing their data from the training stream. The latter channel is the central departure from the baseline “always participate” model: abstention is often difficult to punish directly, so deterring it requires either slack in individual rationality (a participation margin) or a structural restriction on how much abstention can influence the learner.
Our main sufficient conditions combine three ingredients, each playing a distinct economic role. (i) of the learner ensures that changing a single agent’s effective message at a single time—either from θ to some θ′ (misreport) or from θ to ⟂ (abstention)—cannot significantly change the distribution over future mechanisms. This bounds the discounted continuation value of any manipulation. (ii) A component with a penalty gap β > 0 converts the static “truthful reporting” benchmark of Π into an intertemporal deterrent: with positive probability, misreports are disciplined by an outcome rule that makes lying strictly costly relative to truth-telling. (iii) A κi, t ensures that even if abstention could slightly improve future treatment via learning, the agent is unwilling to sacrifice enough current utility to make that manipulation worthwhile.
The stability requirement we impose is intentionally minimal and directly aligned with the participation problem. We do need full differential privacy with respect to arbitrary changes in the entire history; instead, we require a per-round bound on how much the learner’s output distribution can shift when a single agent’s round-t effective report changes between a real type and absence. Concretely, for each t, the learner map from public history to a distribution over Π is assumed to be weakly η-DP with respect to the perturbation b̃i, t ∈ {θ, ⟂ }. Economically, this captures the idea that the platform’s learning rule is not “too reactive” to any one agent’s presence/absence in any one round. This is exactly the form of robustness needed to argue that strategic withholding of data has limited power.
Stability alone cannot rule out profitable misreports: even if misreporting has limited effect, it may still improve utility within the learning mechanism. The commitment mechanism πcom addresses this by guaranteeing a loss from lying. We summarize this by a penalty gap parameter β > 0, interpreted as a lower bound on the (expected) utility difference between truth-telling and the best misreport under πcom, holding fixed others’ reports. Mixing in πcom with probability λ therefore creates an expected one-shot cost of at least λβ for misreporting. This is the familiar “audit probability × sanction severity” logic from enforcement models, imported here into a mechanism-learning context.
Unlike misreports, abstention cannot always be punished by the
mechanism, because an absent agent is outside the platform’s
jurisdiction. The only generic lever is the fact that abstention forgoes
whatever surplus the agent would have obtained by participating
truthfully. The participation margin
κi, t := 𝔼 [ui(θi, t, st) ∣ h < t, πt, ai, t = 1, bi, t = θi, t] − oi, t
is therefore the natural “currency” with which we can pay for incentive
alignment: if κi, t
is uniformly bounded away from zero, then an agent must incur a real,
immediate loss to abstain, and stability ensures the corresponding
future gain is bounded.
We now state the core equilibrium guarantee. The key feature is that the same stability term 4η α(A) upper bounds the value of single-round manipulation (misreport or abstain), while the mechanism provides two distinct short-run deterrents: λβ for misreports and κi, t for abstention.
πt = (1 − λ) πtL + λ πcom, πtL ∼ qt(h < t),
λβ ≥ 4η α(A) and κi, t ≥ 4η α(A) for
all potentially participating (i, t),
The proof follows the backward-induction logic pioneered in stability-based online mechanism design, with one conceptual modification: we treat abstention as the message ⟂, but we remember that choosing ⟂ changes the current payoff to oi, t. The weak DP property implies that changing b̃i, t at a single round can shift future mechanism distributions by at most O(η) in a likelihood-ratio sense; translating this into utilities yields a bound on the maximal change in expected future discounted payoff. The role of α(A) is precisely to aggregate these per-round stability effects across time under the agent’s discounting. Intuitively, α(A) is large when the agent is patient (or otherwise puts significant weight on distant future rounds), in which case even small changes in future distributions could in principle be valuable. The stability guarantee ensures that this “option value of manipulation” is no more than 4η α(A).
Consider a one-shot deviation in which an agent participates but reports bi, t ≠ θi, t. Relative to truthful reporting, the agent potentially gains (i) a current-round improvement under the learning component, plus (ii) a future improvement by shifting the learner’s state. Stability bounds (ii) by 4η α(A). The commitment mixture creates an expected loss of at least λβ from lying. When λβ ≥ 4η α(A), this expected loss dominates the maximal possible discounted continuation benefit, so misreporting is unprofitable at every history. Importantly, this is an “insurance” argument: the commitment mechanism need not run often, but it must run often enough that the expected penalty offsets any bounded manipulation value.
Now consider the abstention deviation, i.e., switching from truthful participation to ai, t = 0 (equivalently, b̃i, t = ⟂). Here the agent gain directly from the current mechanism outcome—indeed, they forgo it—so the only reason to abstain is to (a) take the outside option today and (b) improve the future by withholding information. The immediate utility difference between truthful participation and abstention is exactly κi, t. Stability again bounds the total discounted future benefit from manipulating the learner by at most 4η α(A). Thus, if κi, t ≥ 4η α(A), abstention cannot be justified as an investment in future influence: it costs too much today relative to what it can buy in terms of future mechanism shifts.
The theorem’s inequalities are deliberately “engineering friendly.”
They tell the platform that it can guarantee PR-NIC by ensuring two
slack conditions:
$$
\text{(reporting deterrence)}\quad \lambda \ \ge\
\frac{4\eta\,\alpha(A)}{\beta},
\qquad
\text{(participation deterrence)}\quad \kappa_{i,t}\ \ge\
4\eta\,\alpha(A)\ \ \forall (i,t).
$$
In applications, β is shaped
by how aggressively the commitment mechanism can punish detectable
inconsistencies (or otherwise make misreports unattractive), while κi, t
is shaped by subsidies, service quality, or other participation benefits
relative to the outside option. This makes the policy tradeoff
transparent: if the platform cannot raise κi, t
(e.g., because outside options are high), it must reduce manipulability
by lowering η (more
conservative learning) or accept higher enforcement intensity λ (more commitment). Conversely,
stronger penalty gaps β reduce
the need to rely on frequent commitment, preserving welfare by spending
fewer rounds in enforcement mode.
Two caveats are worth emphasizing. First, the condition κi, t ≥ 4η α(A) is strong: it requires a uniform cushion above the outside option whenever the agent is available. In markets where outside options fluctuate sharply (e.g., drivers choosing between platforms, sellers multi-homing across marketplaces), such uniform margins may fail precisely when learning is most valuable. Second, weak DP is a modeling commitment: it rules out highly reactive learners that could, in principle, achieve faster adaptation but would also be more easily manipulated. The theorem therefore formalizes a real design frontier: robust incentives require limiting how much any one agent can steer the platform’s learning process.
With PR-NIC established, we can treat the realized stream of effective types θ̃t (including ⟂ for abstentions) as the equilibrium “data generating process” seen by the learner. This is the bridge to the no-regret result in the next section: conditional on equilibrium play, the platform faces an ordinary online learning problem over mechanisms evaluated on realized participation.
Once we have a participation-and-reporting equilibrium, the
platform’s learning problem becomes conceptually simpler, but in a very
specific (and deliberately limited) sense. In each round t, the learner does not observe the
“full” type vector θt; instead it
observes the type profile
θ̃t = (θ̃1, t, …, θ̃n, t), θ̃i, t ∈ Θi ∪ { ⟂ },
where θ̃i, t = ⟂
precisely when agent i
abstains. Under PR-NIC equilibrium play, the platform can treat θ̃1 : T as the
realized data stream that its mechanism sequence is evaluated on: the
learner is still interacting with a strategic environment, but the
incentives guarantee that (i) any agent who is present and finds it
worthwhile to participate does so and reports truthfully, and (ii) the
remaining “missingness” is exactly the equilibrium selection induced by
outside options and the announced lottery.
This motivates a regret notion that is both operational and honest
about what the platform can hope to control without further assumptions:
we compare the platform to the best single-round mechanism in Π . Formally, for a
mechanism-learning algorithm M
that generates outcomes st ∼ πt(b̃t),
define
$$
\mathrm{Alg}(M)\ :=\ \mathbb{E}\Big[\sum_{t=1}^T
G_t(\tilde\theta_t,s_t)\Big],
$$
and the (equilibrium) participation-robust regret
$$
\mathrm{Reg}^{\mathrm{part}}_T(M,\Pi)\ :=\
\max_{\pi\in\Pi}\ \sum_{t=1}^T \mathbb{E}_{s\sim
\pi(\tilde\theta_t)}\big[G_t(\tilde\theta_t,s)\big]\ -\ \mathrm{Alg}(M).
$$
Two points are worth emphasizing before we state the bound. First, the
comparator is evaluated on counterfactual types θt; it is
evaluated on θ̃t, i.e., on the
participants who actually showed up in equilibrium (with nonparticipants
recorded as ⟂). Second, the comparator
does not get to “re-select” who participates: we do not imagine
rerunning history under π and
recomputing who would have abstained. This is precisely what makes the
benchmark feasible without a structural model of participation.
πt = (1 − λ) πtL + λ πcom,
$$
\sup_{A\in\mathcal A_T}\ \mathrm{Reg}^{\mathrm{part}}_T(M,\Pi)
\ \in\
\tilde O\!\Big(\big(\log|\Pi|+\beta^{-1}\big)\sqrt{\alpha_T(A)\,T}\Big),
$$
At first glance, it may seem illegitimate to apply an off-the-shelf
regret analysis once participation is endogenous: after all, the
sequence θ̃t is influenced
by the platform’s choices, since agents decide whether to participate
after seeing the announced lottery. The key is that our benchmark is
that results from the platform’s own interaction with the equilibrium.
Fix any realized history of effective types and realized objective
functions,
(θ̃t, Gt(θ̃t, ⋅))t = 1T.
Conditional on this sequence, Hedge’s usual potential-based analysis
goes through verbatim: it upper-bounds the cumulative gap between the
learner’s realized performance and that of the best fixed expert π ∈ Π on the same per-round
payoff vectors. Importantly, this statement does not require that (θ̃t, Gt)
be exogenous or oblivious; it only uses boundedness of per-round payoffs
and the fact that, after the round’s feedback is revealed, the learner
can compute (in our full-information model) the payoffs of each π ∈ Π on that same realized
θ̃t.
The only additional moving part is the commitment mixture. Relative to the “pure learning” benchmark, mixing in πcom with probability λ creates an additive performance loss because, in the worst case, πcom is not chosen to optimize Gt and can be strictly dominated in objective value. With Gt ∈ [−1, 1], this mixing cost is at most on the order of λT in cumulative objective. Since PR-NIC requires λ to be large enough that λβ dominates the maximal continuation benefit from manipulation, the regret bound inherits a term scaling like β−1 (after substituting the minimal feasible λ).
Theorem 2 should be read as a : conditional on equilibrium play (truthful reporting by participants, equilibrium abstention by others), the platform competes with the best fixed member of Π on the observed stream. This is often the right target in applications where the platform’s objective is intrinsically defined on active users—e.g., revenue conditional on bidders who enter an auction, matching quality conditional on workers who log in, or prediction accuracy conditional on users who request recommendations. In such environments, the realized participant set is part of the primitive data the platform must work with; our guarantee says that, within that data, learning is not destroyed by strategic dynamics once incentives are stabilized.
At the same time, we should be explicit about the economic content of this regret notion. Because θ̃t includes ⟂ entries, the best fixed comparator π is itself “penalized” for missing agents in exactly the same way the platform is. Thus the platform is not being asked to solve an impossible task such as achieving the objective value it would have gotten had all agents participated regardless of incentives. Rather, it is being asked to do the best it can given the equilibrium participation process.
Theorem 2 deliberately avoids counterfactual claims of the form: “our algorithm achieves low regret relative to the best fixed mechanism that would have been optimal had we played it from the start.” Such a claim requires specifying how participation would respond to alternative mechanisms throughout time. In our setting, participation is a strategic choice driven by outside options and forward-looking incentives; without additional structure, the mapping from mechanisms to participation is not identified from equilibrium play under one algorithm.
Concretely, even if the platform observes (θ̃t, st, Gt) along the equilibrium path, it generally cannot infer:This is not a technical nuisance; it is a fundamental econometric limitation. The platform’s data are generated under an equilibrium that itself depends on the policy. Without a model that links outside options, beliefs, and participation decisions across policies, counterfactual evaluation is underdetermined.
This is why we frame Theorem 2 as “no-regret on realized participants,” rather than as a welfare-optimality or population-optimality statement. The benchmark is chosen so that it is meaningful given what the platform can observe and control, and so that it is robust to strategic selection.
From a design perspective, Theorem 2 tells us where the hard work has already been done: incentive stabilization (via stability, commitment, and margins) converts a strategic environment into a well-posed online learning problem on the equilibrium data stream. The remaining performance loss decomposes cleanly into (i) the usual learning term (how fast we can compete with the best fixed π ∈ Π on the realized sequence) and (ii) an enforcement term (how costly it is to mix in commitment at the required rate). This decomposition is operational: platforms can often estimate the “enforcement tax” λT directly, while also monitoring whether participation margins are binding (e.g., by tracking entry rates as outside options vary).
Two limitations are immediate. First, our regret bound is stated in a full-information feedback model; with bandit feedback, additional exploration and instrumentation issues arise, and the stability parameters needed for PR-NIC may become more expensive. Second, the theorem does not by itself guarantee that learning improves outcomes when nonparticipants matter normatively or strategically (e.g., market thickness, fairness over all eligible agents, or long-run growth). Addressing that step requires translating participant-conditional performance into population performance under explicit assumptions on how objectives aggregate and how much selection can vary—precisely the selection-bias question we take up next.
The regret notion in Theorem~2 is intentionally : we compete with the best fixed π ∈ Π evaluated on the realized effective types θ̃t, i.e., on those agents who enter in equilibrium (with abstainers recorded as ⟂). This is often exactly the right benchmark when the platform’s objective is intrinsically defined on active users. But in many applications the platform (or a regulator) cares about a broader notion of performance that includes those who are but do not show up: market thickness, unmet demand, social welfare over all potential traders, or group-level coverage constraints.
In these cases, participant-conditional learning guarantees do not automatically translate into population-level welfare guarantees, for the same reason that selection bias confounds standard evaluation: the equilibrium participation process is endogenous and may be correlated with the payoff-relevant types. Without additional structure, we cannot infer what the platform “missed” by observing only participants, nor can we evaluate counterfactual policies that would have induced different entry. What we do, however, is give a simple and operational translation in settings where (i) the objective aggregates additively across agents and (ii) participation is bounded away from zero so that selection cannot be arbitrarily severe.
To make the welfare question explicit, suppose the platform’s
per-round objective is an average (or normalized sum) of per-agent
contributions:
$$
G_t^{\mathrm{pop}}(\theta_t,s)\ :=\ \frac{1}{n}\sum_{i=1}^n
g_{i,t}(\theta_{i,t},s),
\qquad g_{i,t}(\cdot,\cdot)\in[-1,1].
$$
This formulation covers a wide range of objectives: average revenue
contributions, average match quality, average prediction accuracy, or a
normalized welfare index. The boundedness gi, t ∈ [−1, 1]
is not substantive; it simply fixes scale so that additive losses can be
stated cleanly.
When agent i abstains, the
mechanism observes θ̃i, t = ⟂
and, in most platforms, i
receives a default outcome (no trade, no match, no recommendation,
etc.). Accordingly, the platform’s objective can be written as
$$
G_t^{\mathrm{eff}}(\tilde\theta_t,s)\ :=\ \frac{1}{n}\sum_{i=1}^n
g_{i,t}(\tilde\theta_{i,t},s),
$$
where θ̃i, t ∈ Θi ∪ { ⟂ }
and gi, t(⟂,s)
is the per-agent contribution assigned to nonparticipants (often 0, but the argument below only uses
boundedness). Let
$$
p_t\ :=\ \frac{1}{n}\big|\{i:\tilde\theta_{i,t}\neq\perp\}\big|
$$
denote the participation rate at round t.
The central observation is that, under additivity and boundedness, the attributable to missing agents is controlled by 1 − pt. Intuitively, no matter how badly the platform performs on the participant set, it can only lose a bounded amount on each nonparticipant because that agent’s contribution is capped in [−1, 1].
We first state the basic deterministic inequality that underlies the translation.
For any round t, any true
type profile θt, any
effective profile θ̃t obtained by
replacing an arbitrary subset of coordinates by ⟂, and any outcome s ∈ S,
$$
\Big|G_t^{\mathrm{pop}}(\theta_t,s)-G_t^{\mathrm{eff}}(\tilde\theta_t,s)\Big|
\ \le\
\frac{1}{n}\sum_{i:\tilde\theta_{i,t}=\perp}\Big|g_{i,t}(\theta_{i,t},s)-g_{i,t}(\perp,s)\Big|
\ \le\
2(1-p_t).
$$
In particular, if gi, t(⟂,s)
is normalized as a baseline and we only need a one-sided welfare loss
bound, then the discrepancy is O(1 − pt)
per round.
The proof is immediate from additivity and boundedness. Economically, the lemma says that , as long as each missing agent’s contribution is bounded.
We can now combine this per-round selection gap with the participant-robust regret guarantee from Theorem~2. The resulting statement is deliberately modest: it does not claim counterfactual optimality under alternative participation patterns; it only quantifies how far participant-stream learning can be from a population welfare benchmark .
Assume $G_t^{\mathrm{pop}}(\theta_t,s)=\frac{1}{n}\sum_i
g_{i,t}(\theta_{i,t},s)$ with gi, t ∈ [−1, 1],
and define $G_t^{\mathrm{eff}}(\tilde\theta_t,s)=\frac{1}{n}\sum_i
g_{i,t}(\tilde\theta_{i,t},s)$ for θ̃i, t ∈ Θi ∪ { ⟂ }.
Suppose that along the equilibrium path induced by our PR-NIC mechanism,
the participation rate satisfies
$$
p_t\ \ge\ \underline p\ >\ 0\qquad \text{for all }t\in[T].
$$
Then any bound on participation-robust regret implies a population
welfare guarantee up to an additive selection term:
$$
\max_{\pi\in\Pi}\sum_{t=1}^T \mathbb{E}_{s\sim
\pi(\tilde\theta_t)}\!\big[G_t^{\mathrm{eff}}(\tilde\theta_t,s)\big]
\ -\
\mathrm{Alg}(M)
\ \le\
\varepsilon_T
\quad\Longrightarrow\quad
\text{(population gap)}\ \le\ \varepsilon_T + O\big((1-\underline
p)T\big),
$$
where the hidden constant is at most 2
under the bound in Lemma~3.
If at least a $\underline p$ fraction of agents participate every round, then competing with the best fixed mechanism on participants is “almost” competing on the full population, because the total welfare mass of the missing agents is uniformly bounded by $(1-\underline p)T$ (up to constants and normalization). Thus, in high-participation regimes, participant-conditional no-regret is already close to a population statement.
Proposition~4 is best understood as a statement rather than an identification result. It does not require that nonparticipants be missing at random, nor does it require any structural model of how types map into participation. Instead, it treats nonparticipation as a worst-case censoring of a bounded additive objective, and it shows that if censoring is limited in volume (high $\underline p$), then its impact is limited in welfare terms.
At the same time, the proposition clarifies why our earlier equilibrium conditions focus on margins: in the absence of a participation lower bound, the platform can learn perfectly on the participant stream while doing arbitrarily poorly in population terms. In particular:This is not a weakness of the analysis so much as a statement of what is feasible without modeling participation. To go beyond Proposition~4—e.g., to claim low regret relative to the best fixed mechanism evaluated on types θt—one must posit how participation would respond to alternative policies, since that benchmark is inherently counterfactual in a strategic environment.
From a platform-design perspective, Proposition~4 highlights that participation is not merely a nuisance but a policy-relevant state variable. There are three common ways platforms effectively increase $\underline p$ (or its group-specific analogues):Our framework does not endogenize these levers, but it clarifies their role: they reduce the welfare wedge created by strategic selection that no learning algorithm can “wash out” from participant-only data.
One might hope to deduce a participation lower bound from our PR-NIC conditions. While the participation margin condition κi, t ≥ 4η α(A) rules out abstention by marginal agents who would otherwise prefer to participate, it does not guarantee that many agents have positive margins in the first place. If outside options are high (or types place low value on trade), then equilibrium participation may be genuinely low even absent manipulation. In that regime, Proposition~4 correctly predicts that population-level welfare guarantees require additional economic intervention (e.g., subsidies) rather than purely algorithmic fixes.
Proposition~4 delineates a boundary: participant-conditional no-regret be lifted to population welfare when selection is quantitatively limited, but it cannot resolve the hard case where participation can collapse or be strategically suppressed. The next section shows that this boundary is not an artifact of our proof techniques: when outside options make participation margins small and agents are sufficiently long-sighted, strategic abstention can destroy learning itself, yielding impossibility results even for simple mechanism classes.
The previous discussion treated nonparticipation primarily as a problem: if the platform only learns on (and optimizes for) the participant stream, then population welfare can be worse by an additive selection term. Here we make a sharper point. When outside options are high enough that participation margins can be arbitrarily small, : it becomes an instrument that forward-looking agents can use to manipulate the platform’s learning dynamics. In that regime, it is generically impossible to guarantee both (i) participation-and-reporting incentive compatibility (PR-NIC) and (ii) sublinear regret with respect to even very simple benchmark classes, unless the designer introduces new economic slack (e.g., subsidies that create a participation margin) or stronger observability/verification (so that abstention cannot conceal payoff-relevant information).
Recall the logic behind our sufficient conditions: weak stability (our η-DP-style condition) bounds the total discounted future benefit an agent can obtain by perturbing the learner’s state, while the commitment mechanism and the participation margin κi, t provide countervailing losses that deter such perturbations. The key knife-edge is that abstention is, in our model, operationally equivalent to reporting ⟂: it changes the public history in a way that can influence future mechanism choices, but it is not directly punishable abstention carries an opportunity cost. When κi, t can be arbitrarily close to 0, that opportunity cost disappears, and a sufficiently long-sighted agent can rationally “invest” in abstention to steer the platform into favorable future mechanisms.
The impossibility results in this section formalize this intuition. The structure of the lower bound is common in strategic learning problems: a learner that adapts quickly has low statistical regret in truthful environments but becomes manipulable; a learner that commits (or adapts very slowly) is robust to manipulation but necessarily incurs linear regret in adversarial environments. In our setting, the manipulative action is not a misreport—single-round NIC can handle that—but rather that censors the data stream.
We present a stylized environment with a single strategic agent
(n = 1) to isolate the issue.
Let the platform’s mechanism class be posted prices,
Π = {πL, πH},
where πp
posts price p ∈ {pL, pH}
with 0 < pL < pH < 1.
In each round t, if the agent
participates (at = 1) and
buys, the platform earns revenue Gt = p;
otherwise it earns 0. The agent’s type
is a value vt ∈ {vL, vH}
with vL < pL < pH < vH,
and if the agent participates under price p, her per-round utility is u(vt, p) = vt − p
if she buys and 0 if she declines. If
she abstains, she receives the outside option ot.
We endow the agent with long-sightedness (captured by α(A), or equivalently by
patient discounting γ(⋅)) and
allow the outside option to be chosen so that her participation margin
can be made arbitrarily small in early rounds:
κt = 𝔼[u(vt, st) ∣ truthful
participation] − ot ≈ 0.
Crucially, when the agent abstains, the platform does not observe
whether the agent would have bought at price pH; the
effective type is θ̃t = ⟂, and the
history contains no demand signal. Thus, abstention functions as a
censoring action that can keep the learner uncertain about which price
is revenue-optimal.
]
In 𝒲H, the optimal
fixed posted price in Π is
πH
(revenue pH each round if
the agent participates and buys). In 𝒲L, the optimal fixed
posted price is πL (revenue
pL each
round). A standard no-regret learner that observes purchases would
quickly identify the right price. Our point is that if the agent can
cheaply abstain early on, she can prevent the platform from learning
which world it is in while still benefiting later from the platform’s
uncertainty.
In particular, suppose the platform uses any adaptive policy that, in truthful play, would explore and then concentrate probability mass on the empirically better price. In 𝒲H, a forward-looking agent prefers the platform to post the low price pL (since her utility from buying is vH − p), even though the platform’s objective is to post pH. Strategic abstention can make pH look risky to the learner: if the agent abstains (or refuses to buy) whenever pH is posted in early rounds, then the learner’s data makes πH appear low-revenue, pushing it toward πL.
This manipulation is feasible precisely when abstention is approximately costless. If ot is set so that κt ≤ ε for early rounds, then the agent sacrifices at most ε per round of immediate utility by abstaining, while potentially inducing a persistent shift in the platform’s future price distribution.
We now state the qualitative impossibility in the language of our framework.
Fix any mechanism-learning algorithm M (possibly randomized) that maps public histories to distributions over Π. Suppose there exist infinitely many rounds t in which the agent can choose an outside option ot (or faces one exogenously) such that κt ∈ [0, ε] for arbitrarily small ε > 0, and suppose the agent’s long-sightedness satisfies α(A) = Ω(T) (equivalently, she values future rounds sufficiently relative to the present). Then there exists an adversarial environment over (v1 : T, o1 : T) such that at least one of the following holds:If the mechanism is responsive enough to learn (hence potentially no-regret under truthful participation), then it is responsive enough to be steered by low-cost abstention; preventing that steering requires responsiveness to be throttled so aggressively (effectively, commitment) that regret becomes linear in some environments.
The core argument is a two-world indistinguishability construction. Consider any candidate policy M. Either:The role of abstention is to allow the agent to the indistinguishability region endogenously: by choosing at = 0 when the mechanism would otherwise generate informative signals, the agent can keep the learner’s posterior (or weights, in a Hedge-like update) from concentrating on πH.
The impossibility is not a claim that learning is doomed; it is a claim that learning requires or .
Two limitations are worth emphasizing. First, the construction uses a stark informational asymmetry: the platform learns about demand only through participation. That is a realistic feature in many markets (platforms do not observe latent demand from users who do not open the app), but it is not universal. Second, our lower bound is about environments; in benign stochastic settings where outside options are stable and κt is typically positive, strategic abstention may be empirically rare. The point of the impossibility is to delineate a design frontier: without some mechanism that makes participation strictly attractive (or makes absence non-informative), we cannot hope for a guarantee that is simultaneously incentive-robust and statistically strong in adversarial regimes.
The impossibility results clarify why the policy levers in the next section are not merely cosmetic. If we want PR-NIC and no-regret in the same theorem, we must how to pay for it: by increasing enforcement through λ and the commitment design, by ensuring individual rationality through subsidies that raise κ, or by adopting targeted interventions (audits, verification, group-specific guarantees) that reduce the manipulability created by strategic exit.
The impossibility results force a design lesson that is easy to miss if one focuses only on reporting incentives: in adaptive mechanisms, . When strategic exit is a feasible way to censor the platform’s learning signal, the designer must create either (i) (a participation margin) or (ii) (verification/auditing that makes manipulation costly), or else accept that either PR-NIC or sublinear regret will fail in adversarial regimes. In this section we translate our sufficient conditions into actionable choices of the audit probability λ and the commitment mechanism πcom, and we make explicit the welfare–truthfulness–retention frontier that results.
In our construction the platform plays a convex mixture
πt = (1 − λ) πtL + λ πcom,
where πtL
is the learner’s chosen mechanism and πcom is a fixed
commitment/audit'' rule that introduces a penalty gap $\beta>0$ for deviations (misreports) relative to truthful play. The role of $\lambda$ is not primarily statistical; it is an \emph{enforcement intensity}. Increasing $\lambda$ raises the immediate expected cost of deviating (because the deviation is more likely to becaught’’
or rendered unprofitable by πcom), but it also
increases the fraction of rounds in which we do play the
welfare-maximizing learned mechanism. This gives a direct welfare
tradeoff: in the simplest bounded-objective analysis, the welfare loss
from enforcement is proportional to λT.
Our sufficient conditions have a clean comparative-static
implication: if an agent class A is more long-sighted (larger α(A)) or the learner is
more sensitive (larger η),
then enforcement must rise; if the commitment mechanism creates a larger
penalty gap β, enforcement can
fall. Abstracting from constants, the minimal enforcement ensuring
reporting truthfulness scales like λ ≳ ηα/β.
When we incorporate strategic abstention, λ must also be large enough —because
abstention is not directly punishable, its deterrent comes from the
opportunity cost κi, t.
This yields the practical tuning rule (cf. our closed-form
frontier)
$$
\lambda^*(\alpha) \;=\; \max\Big\{\frac{4\eta\alpha}{\beta},\
\frac{4\eta\alpha}{\underline\kappa}\Big\}.
$$
The first term is
truthfulness-driven'' enforcement; the second isretention-driven’’
enforcement, reflecting the fact that even a perfectly designed penalty
for misreports does not discipline data withholding if the agent is
indifferent between participating and leaving.
Operationally, this suggests that . It should pick λ from an : how quickly the learning algorithm reacts (its η), and how forward-looking users are (their α), determine the minimum enforcement needed to keep the adaptive loop stable.
A subtle but central issue is that πcom can easily ``solve’’ truthfulness while simultaneously breaking participation. A harsh commitment rule can create a large penalty gap β but reduce truthful participants’ utility, thereby shrinking κi, t and pushing the system the strategic exit regime. This is the mechanism-design analogue of over-policing: enforcement can backfire by making honest participation unattractive.
The practical design requirement is therefore : we want πcom to generate a strong disadvantage for deviations while keeping a reasonable utility floor for truthful play. There are several standard ways to obtain such a penalty gap without materially lowering truthful utility:What these approaches share is that they enlarge β primarily by increasing the , not by reducing the . In contexts where monetary transfers are infeasible, analogous designs can be implemented via priority, access, throttling, or other rationed platform resources.
Once we view λ and πcom as policy levers,
the designer’s objective becomes a constrained optimization problem:
maximize welfare subject to equilibrium constraints that bind through
β and κ. A convenient reduced-form way to
express the tradeoff is
$$
\text{Welfare} \;\approx\; \underbrace{\text{(learning performance on
realized participants)}}_{\text{improves as }\eta \text{ rises and
}\lambda \text{ falls}}
\;-\; \underbrace{c_1\,\lambda T}_{\text{direct enforcement loss}}
\;-\; \underbrace{c_2\,\text{(selection loss from
abstention)}}_{\text{falls as }\underline\kappa \text{ rises}}.
$$
The constraints pin down that we cannot simultaneously set λ ≈ 0 (for welfare) and $\underline\kappa\approx 0$ (for retention)
when α is large. Put
differently: in adversarial environments, so that the learning system
receives informative data. That cost can be paid as (i) enforcement
(higher λ), (ii) user
subsidies/benefits that raise κ (participation guarantees), or
(iii) reduced adaptivity (lower η), which slows learning but also
reduces manipulability.
This frontier clarifies a common confusion in applied platform design. One might hope to ``fix’’ strategic behavior by increasing transparency or improving the learner, but a more responsive learner generally increases η and therefore the incentive constraints. In the presence of strategic exit, better prediction can raise the value of manipulation and require enforcement or subsidies to remain stable.
Uniform auditing (a single λ for all agents and rounds) is analytically clean but economically blunt. In many applications, participation margins are heterogeneous: some users have high outside options, some are nearly captive; some tasks are easy to reallocate, others are scarce. This heterogeneity naturally motivates .
The key observation is that our deterrence conditions are local: they
compare a bound on discounted manipulation gains (scaling with ηα(A)) to an
immediate loss term (either λβ for misreports or κi, t
for abstention). When κi, t
varies systematically, the platform can economize on welfare loss by
concentrating auditing probability where κi, t
is small or where an agent’s influence on learning is large. Concretely,
one can consider agent- or segment-specific mixtures
πt = (1 − λt) πtL + λt πcom, λt = λ(segmentt),
or even λi, t
when the mechanism permits individualized enforcement. The economic
content is ``risk-based auditing’’: more enforcement where the
temptation to withhold information is greatest.
Targeting, however, introduces two limitations we should acknowledge. First, it can create perceived unfairness or disparate impact, which in turn may change outside options and thus κi, t endogenously. Second, if λi, t is itself predictable, sophisticated agents may sort into segments to reduce scrutiny; in such cases, randomization and coarse segmentation can be preferable to finely tuned but gamable policies. These are not merely normative concerns: they affect equilibrium behavior and therefore the platform’s realized data stream.
In many digital markets the platform can directly influence effective outside options through user experience: response times, baseline recommendations, or minimum payouts. Within our model these are interventions that increase κi, t by raising truthful-participation utility or lowering the value of opting out. The impossibility result tells us that such retention levers are not ``growth hacks’’; they are that can make learning feasible.
A practical implication is that one can treat $\underline\kappa$ as a design target. If the platform can guarantee a minimum expected gain from participating—for example, via a participation rebate, a minimum payment, or a guaranteed matching probability—then the platform can reduce λ and spend more rounds in welfare-maximizing learned mechanisms. This substitution between λ and $\underline\kappa$ is exactly what our tuning rule captures: enforcement can be replaced by margin, and vice versa.
Several extensions preserve the core message. (i) λt can be used to ``front-load’’ audits early when the learner is most sensitive to marginal data (effectively when η is largest), and relax later. (ii) In partial-information or bandit feedback settings, the statistical value of participation rises, which increases the potential benefit from strategic abstention; this typically makes the retention constraint tighter, strengthening the case for explicit participation guarantees. (iii) With continuous types, one can replace the discrete DP notion with stability under small report perturbations; the same structure emerges: the future benefit from manipulation is controlled by stability, and must be dominated by an immediate cost.
Finally, our framework is intentionally worst-case: it delineates a frontier rather than recommending maximal enforcement. In benign regimes where κi, t is typically large (users strongly prefer participating) or where abstention is not informative, the platform can safely set λ near zero and rely on the learner. The point is that without diagnosing the system is near the low-margin region, it is impossible to know whether aggressive adaptation is a feature or a vulnerability.
The next section illustrates these design levers in three canonical applications, emphasizing how one can map abstract quantities like β, η, and κ into measurable primitives in posted pricing, allocation without money, and auction reserve learning with bidder dropout.
We close the design discussion with three canonical application sketches. The common goal is not to force each domain into our exact reporting model, but to show how the same two vulnerabilities— and —arise whenever an adaptive rule maps today’s observed data into tomorrow’s mechanism. In each sketch we highlight (i) what plays the role of types, reports, and outside options, (ii) what a natural benchmark class Π looks like, and (iii) how the abstract quantities η (learner sensitivity), β (commitment penalty gap), and κ (participation margin) can be related to measurable or designable primitives.
Across applications, it is useful to keep the following reduced-form correspondences in mind.With these correspondences, the enforcement frontier λβ ≳ ηα and the retention margin requirement κ ≳ ηα become concrete engineering constraints: ? We now instantiate this logic.
Consider an online seller (the platform) who posts a price pt each round to a single strategic buyer (or a small set of buyers repeatedly interacting). The buyer’s type is their value θt ∈ [0, v̄], and their action is accept/reject. In the simplest posted-price model there is no explicit report bt; nonetheless, the accept/reject decision is an informational signal the platform uses to update future prices. Opt-out is immediate: the buyer can refuse trade today, and in many settings can also (purchase elsewhere), which is naturally modeled as an outside option ot.
A natural benchmark class Π is a finite grid of posted prices, Π = {πp : p ∈ 𝒫}, where πp deterministically posts p. The platform objective might be revenue, Gt(θt, st) = pt ⋅ 1{θt ≥ pt}, or surplus. Learning chooses among prices based on observed accept/reject outcomes, which makes the system especially sensitive early on.
If the buyer is forward-looking, they may reject a price they would
otherwise accept in order to induce lower future prices. This is
precisely our strategic abstention logic: the buyer pays an immediate
cost (foregone surplus) to manipulate the learner. In our notation,
participation'' corresponds totransacting and revealing a
truthful acceptance signal’‘; ``abstention’’ corresponds to refusing
trade (or exiting), which is informationally equivalent to reporting
⟂ insofar as it censors the platform’s
signal about demand.
The participation margin is
κt = 𝔼[θt − pt ∣ θt ≥ pt, truthful
trade] − ot,
where ot
may be the consumer surplus from an outside seller net of switching
costs. This can be estimated from observed substitution patterns or
randomized coupons that shift ot.
Learner sensitivity η depends on how sharply the price-selection rule responds to one accept/reject history. For example, exponential-weights over prices has a stability bound that increases with its learning rate: faster price adjustments raise η and thereby raise manipulation incentives.
Finally, what is πcom and β in posted pricing? A commitment device here is naturally : with probability λ, the platform ignores recent acceptance data and posts a price from a rule that is known to be insensitive to any single buyer’s behavior (e.g., a fixed price, or a price drawn from a wide prior). This reduces the benefit of rejecting today to influence tomorrow. In this domain, β should be read less as an explicit ``punishment’’ and more as a created by commitment: under the commitment rule, rejecting today cannot buy as large a future price reduction, so the deviation’s discounted benefit is capped. The policy lesson is that aggressive dynamic pricing can be self-defeating with repeat customers unless the seller builds in commitment or provides an explicit participation margin (e.g., loyalty benefits) that makes strategic rejection unattractive.
Next consider a platform allocating scarce service capacity (compute slots, moderation effort, delivery windows, or hospital appointments) to agents who submit jobs. Each agent i in round t may have a job with private attributes θi, t (priority, urgency, required resources, completion value). Participation means submitting the job and (possibly) reporting attributes; abstention means withholding the job (doing it internally, delaying submission, or rerouting to another system), yielding outside option oi, t.
A natural mechanism class Π
is a family of priority/queueing rules, such as (i) strict priority by
reported urgency, (ii) randomized priority within classes, or (iii)
threshold admission control. The platform objective Gt could be
throughput, deadline satisfaction, or an additive proxy of user
welfare:
Gt(θ̃t, st) = ∑igi, t(θ̃i, t, st), gi, t ∈ [−1, 1].
Misreporting is direct: users can exaggerate urgency or understate required resources. But even if the mechanism is single-round incentive compatible (or if reports are partially verifiable), withholding is powerful: by not submitting in congested periods, an agent can reshape the data the learner sees about demand, inducing looser admission thresholds or higher future priority weights. This is particularly salient when the platform uses past loads to tune future capacity allocation or to update an admission model.
In non-monetary settings, πcom often takes the form of and . With probability λ, the platform audits a submitted job (e.g., by checking metadata, ex post completion traces, or third-party logs). If a report is found inconsistent with observed attributes, the platform enforces a deterministic penalty: deprioritization for L future rounds, temporary throttling, or loss of access tier. The penalty gap β is then computed in continuation-value units: the minimum expected discounted loss from being sanctioned versus being truthful. Operationally, it is set by policy (how long the cooldown is, how severe the throttling is) and can be A/B tested.
The margin κi, t
is the net benefit of submitting truthfully rather than
withholding:
κi, t = 𝔼[ui(θi, t, st) ∣ truthful
submit] − oi, t.
Here oi, t
is the value of handling the job elsewhere or later. Many platforms can
directly raise κ via
service-level guarantees (e.g., a minimum processing probability),
baseline quotas, or ``submit-and-hold’’ features that preserve queue
position while allowing cancellation. Our retention-driven constraint
emphasizes that such product features are not ancillary: they are the
economic slack that prevents demand censorship from derailing
learning.
Finally, consider repeated auctions where the platform learns a reserve price (or other auction parameters) from observed bids, and bidders can choose whether to participate. This setting captures ad auctions, procurement, and many two-sided marketplaces. Let bidder i’s value be θi, t; participation ai, t = 1 means entering the auction and submitting a bid (a report), while ai, t = 0 means dropout with outside option oi, t (e.g., spending budget elsewhere, alternative channels, or simply waiting).
A natural benchmark class Π is a finite set of reserve rules (or scoring rules), e.g. Π = {2nd-price with reserve r : r ∈ ℛ}, each of which is single-round truthful given participation. The platform objective Gt might be revenue or welfare.
Even when truthful bidding is a dominant strategy conditional on entry (as in a second-price auction), entry itself can be strategic in a learning environment. A forward-looking bidder might skip auctions when reserves are high to reduce observed bid distributions and induce the platform to lower reserves. Unlike classic auction analysis, the key intertemporal lever is , not shading: dropout censors exactly the data the reserve learner uses.
A commitment mechanism can be implemented as ``reserve commitment rounds’’: with probability λ, the platform runs an auction with a fixed reserve drawn from a public distribution that does not depend on recent bids, or applies an exogenous reserve floor. Deviations in bidding (when applicable) can be discouraged by bid verification (for procurement, ex post cost audits; for ads, conversion/fraud audits) coupled with sanctions. The penalty gap β again is a continuation-value quantity: the minimum expected loss from being caught deviating in an audited round. In many auction platforms, sanctions are naturally available (account suspension, loss of preferred status), making β plausibly large for professional bidders.
The participation margin is the expected utility gain from entering
and bidding truthfully relative to the outside option:
κi, t = 𝔼[ui(θi, t, st) ∣ ai, t = 1, bi, t = θi, t] − oi, t.
Empirically, oi, t
can be proxied by opportunity cost of budget, alternative impression
value, or estimated surplus in competing channels; κ can be inferred from entry
decisions across reserve regimes. The key warning from our impossibility
result is immediate: if reserves are tuned so tightly that many bidders
are near indifferent (small κ), then dropout becomes a cheap
manipulation instrument and sublinear regret may be unattainable without
either subsidies (raising κ)
or explicit commitment (raising effective β through reduced sensitivity).
Across posted pricing, non-monetary allocation, and reserve-learning auctions, the same design message reappears: adaptive optimization creates an incentive to censor the feedback loop, and the designer must buy stability either by (η), (λβ), or (κ). The practical value of the framework is that each of these objects corresponds to a knob that engineers and policy teams actually control (learning rate, audit/sanction policy, and participation guarantees), and each can be instrumented and estimated from operational data. This is precisely the sense in which retention is not merely a growth metric but a constraint in mechanism learning.
Our central message is that online mechanism learning is governed by a feedback-loop constraint: whenever today’s reports (or participation decisions) influence tomorrow’s mechanism, forward-looking agents can find it profitable to distort that feedback. In our model, two instruments create discipline. First, of the learning rule (captured by weak η-DP) limits the maximum discounted gain from manipulating the learner. Second, —through a commitment mixture with penalty gap β and through participation margins κ relative to outside options—convert those bounded manipulation gains into unprofitable deviations. The resulting picture is not that truthful learning is “free,” but that it is achievable once the designer pays for it in one of three currencies: lower sensitivity (smaller η), stronger sanctions (larger λβ), or higher retention/participation margins (larger κ). This conclusion is operational: it turns what might look like abstract equilibrium constraints into design knobs that map to learning rates, audit/sanction policies, and product/contract features that raise the value of staying in the system.
That said, the framework is deliberately spare, and several extensions are both technically nontrivial and practically important. We conclude with a set of open questions that, in our view, define a near-term research agenda for mechanism learning under strategic participation.
Our analysis is stated in a full-information objective model: after choosing πt and observing the realized outcome, the platform observes Gt(θ̃t, st) (or enough to update the learner as if it did). In many environments, however, the platform sees only bandit feedback—e.g., realized revenue rather than counterfactual revenue under other mechanisms, or sparse conversion signals rather than complete allocation quality metrics. The first-order difficulty is algorithmic (regret under bandit feedback is larger), but the more subtle difficulty is incentive-related: the platform’s becomes part of the manipulable surface.
A natural direction is to replace Hedge-style learners with bandit algorithms (EXP3, Thompson sampling, or contextual bandits) and to ask whether the same stability-based argument goes through. Two issues arise. (i) Bandit learners often use importance-weighted updates, which can be highly sensitive to individual observations; this can inflate η unless one explicitly regularizes or clips updates. (ii) The noise injected by bandit sampling may itself serve as a commitment device, but only if it is to the agent’s action; if exploration rates adapt to observed participation, strategic abstention can indirectly suppress exploration and thereby change long-run behavior. A concrete open question is thus: can we design bandit learners with provable per-round stability bounds with respect to changing one agent’s effective report between θ and ⟂ (or more generally, changing one agent’s entire contribution to the bandit feedback stream), while retaining near-optimal bandit regret? Put differently, what is the right analogue of “weak η-DP” for partial monitoring, and how does it trade off against sample efficiency?
The discrete type space assumption is a convenient core case, but many economic domains involve continuous values, continuous quality, or high-dimensional attributes. Extending the participation-and-reporting equilibrium logic to continuous Θi raises two separable questions.
First is the mechanism-design question: what does it mean for the benchmark class Π to be single-round NIC when types are continuous and mechanisms may be parameterized families (e.g., reserves in ℝ+, scoring vectors, or menu mechanisms)? One approach is to take Π to be a compact parameter set and require incentive compatibility in the direct-revelation sense for each parameter value. Another is to relax to approximate IC, which is arguably the right object once we discretize for learning anyway.
Second is the learning/stability question: stability in continuous spaces typically depends on Lipschitz properties. If a learner’s update depends smoothly on reported types (rather than via a finite action grid), then changing one report from θ to θ′ might have a bounded influence proportional to ∥θ − θ′∥. In that setting, our “effective report” perturbation θ ↔︎ ⟂ is an extreme change, and it suggests that modeling abstention as a special symbol is not merely notational: it is an analytically meaningful discontinuity. An open question is whether we can build learners and mechanism classes for which (a) misreports have a Lipschitz-bounded effect on future distributions, and (b) abstention can be treated either as a bounded perturbation (e.g., via imputation or explicit missing-data models) or else handled with a separate participation-sensitive stability guarantee. This is where tools from robust statistics and differential privacy for continuous data may interact fruitfully with incentive constraints.
We modeled each agent’s opportunity set Ti as given and allowed ai, t ∈ {0, 1} when present. In many marketplaces, however, “absence” is not exogenous: agents can churn and re-enter, can multi-home, and can strategically time participation based on beliefs about future mechanisms. This introduces state dependence that is not captured by a per-round outside option oi, t alone.
A more realistic model would endogenize a participation state (e.g., active/inactive) with transition probabilities influenced by realized utilities and by platform policies (fees, matching quality, enforcement). The key conceptual question then is: how should we define and enforce participation-robust IC when the deviation is not “skip this round” but “exit for L rounds and return when conditions improve,” possibly under a different identity? Our current margin condition κi, t ≥ 4ηα(A) is an instantaneous inequality; with re-entry, the relevant comparison is a outside option that includes the option value of waiting. This suggests that the right constraint may involve a dynamic programming object—a continuation-value advantage of staying active—and that enforcement may need to be tied to identity persistence (deposits, reputation, or cryptographic linkage) to prevent costless reset. A practical corollary is that platform governance (identity, anti-sybil measures, and re-entry friction) is not orthogonal to learning incentives; it is part of the economic mechanism that determines whether κ is meaningfully positive.
We summarized forward-looking behavior by a class parameter α(A), which is useful for worst-case guarantees but coarse for heterogeneous populations. Empirically, some agents are almost myopic (casual users, one-off buyers), while others are highly strategic and repeat (power sellers, agencies). If we set λ to satisfy the constraints for the most long-sighted agents, we may over-enforce and sacrifice welfare for everyone else; if we tune for the median agent, we risk manipulation by a small strategic minority.
This motivates two related research directions. The first is : can we obtain guarantees that degrade gracefully with a quantile of the αi distribution, perhaps bounding manipulation effects by the “effective mass” of strategic agents? The second is : can a platform segment agents into classes with different enforcement intensity or different commitment rates, without reintroducing new manipulation channels through the segmentation process itself? Here, stability ideas cut both ways: personalization tends to increase sensitivity to an individual’s history (raising η), but it may also allow targeted commitment that is cheaper in welfare terms than uniform commitment.
A recurring theme in applications is that our parameters are measurable or designable. Turning that into a credible empirical program is itself an open problem, with identification challenges that resemble those in dynamic treatment effects and strategic experimentation.
Estimating κi, t requires a model of outside options and counterfactual participation utility. Randomized participation incentives (credits, fee holidays, service guarantees) can identify local slopes of participation with respect to expected surplus, but strategic anticipation complicates interpretation: an agent’s response may reflect both immediate gains and beliefs about future mechanism adaptation. Estimating β requires measuring the expected utility gap induced by sanctions or commitment rounds; this may be feasible via randomized audits or randomized commitment intensity λ, but only if audits are perceived as credible and if agents do not condition entry on audit likelihood in confounding ways.
Estimating η is perhaps the most novel. In principle, η is a property of the learning update map; one can upper bound it analytically from algorithm design (learning rate, clipping, regularization). But the effective sensitivity relevant for incentives is behavioral: it depends on which features of reports actually move the learner in the deployed pipeline. Auditing “algorithmic sensitivity”—how much a single agent’s effective report changes the distribution over future mechanisms—could be measured via shadow experiments or influence-function approximations, connecting our theory to emerging practices in ML governance. Finally, effective α (or discounting) can be inferred from long-run engagement and substitution patterns, but doing so in strategic settings likely requires structural modeling: agents who appear myopic may simply have high outside options, and agents who appear forward-looking may be reacting to perceived algorithmic sensitivity rather than intrinsic patience.
Our equilibrium conditions are sufficient and intentionally conservative. They treat α(A) as a worst-case bound, and they enforce truthfulness by ensuring manipulation cannot pay even by an ηα amount. In practice, platforms may tolerate small incentives to deviate if they are hard to exploit or if deviations are costly to coordinate. This raises normative questions about the right benchmark: should we aim for exact PR-NIC, approximate PR-NIC, or equilibrium refinements that better capture bounded rationality and limited foresight? Relatedly, when raising κ requires subsidies or costly service guarantees, the platform must decide who pays (the platform, other agents via fees, or society via regulation), which brings distributional concerns to the foreground.
We view these questions as complementary to the technical agenda above. The mechanism-learning problem is not only to minimize regret, but to do so in a strategic ecosystem where participation is itself endogenous and policy-relevant. Our framework isolates the basic tradeoff, and the open problems identify where richer economics and richer learning theory must meet: partial feedback, continuous information, dynamic participation, heterogeneous strategic sophistication, and empirical measurement.