Truthful and Adaptive Online Mechanism Learning: Strongly Adaptive Regret under Non-Myopic Manipulation

Metadata

Total Words: 11,390
Export Date: 2026-01-14 23:13:33
Description: Platforms in 2026 must continually update pricing, ranking, allocation, and auction rules in nonstationary environments while facing strategic agents (often AI-assisted) who can manipulate learning by misreporting. Recent work (Huh–Kandasamy, 2024) shows that weak per-round differential privacy (stability) plus occasional commitment can guarantee Nash incentive compatibility (NIC) with low static regret. We extend this paradigm to nonstationary objectives by designing an online mechanism that is simultaneously (i) dynamically truthful (ex-post NIC over the entire learning process) and (ii) strongly adaptive in performance, competing with the best fixed mechanism on every time interval. Our construction combines a stability-controlled strongly-adaptive expert algorithm (in the style of Hazan’s adaptive regret methods) with a commitment mechanism, using a time-varying enforcement schedule λ_t that preserves incentives without sacrificing tracking performance. We provide regret bounds scaling as $\tilde O((\log|\Pi|+1/\beta)\sqrt{\bar\alpha |I|})$ on every interval I, where ᾱ measures agent long-sightedness and β is the commitment penalty gap. We also discuss computationally efficient implementations via pruning and clarify impossibility regimes as ᾱ approaches linear in T.

1. Introduction and motivation: regime shifts, model updates, and AI-assisted manipulation in 2026 platforms; why static-regret IC learning is insufficient.
2. Related literature and positioning: (i) weak-DP+commitment NIC learning (Huh–Kandasamy), (ii) adaptive regret/FLH/strong adaptivity (Hazan), (iii) dynamic mechanism design vs. adversarial/ex-post guarantees.
3. Model: online mechanism learning with time-varying objectives; agent long-sightedness; definition of interval regret (strongly adaptive regret) and NIC in the online game.
4. A stability notion for strongly-adaptive learners: define a stability-controlled strongly-adaptive Hedge/FLH variant; prove a per-round weak stability (weak-DP-like) bound for each map from history to distributions over Π.
5. Incentive wrapper with time-varying enforcement: define the lottery mechanism π_t = (1 − λ_t)π_t^learn + λ_tπ_com; state and prove a dynamic NIC theorem with time-varying λ_t and stability parameters.
6. Strongly adaptive regret bounds: bound regret on every interval I for the learner component; add the commitment distortion term ∑_t ∈ Iλ_t; optimize parameters to obtain $\tilde O((\log|\Pi|+1/\beta)\sqrt{\bar\alpha|I|})$.
7. Computation: pruning-based implementation (FLH2-style) with O(log T) active learners; discussion of how pruning interacts with stability bounds and incentive constraints.
8. Extensions and robustness: contextual mechanisms; infinite Π via covering numbers; approximate commitments; public vs. private history; brief discussion of coalition deviations.
9. Discussion, limitations, and open problems: tightness in ᾱ, impossibility as ᾱ → Θ(T), empirical calibration of ᾱ, and bandit feedback as next step.

Content

1. Introduction and motivation: regime shifts, model updates, and AI-assisted manipulation in 2026 platforms; why static-regret IC learning is insufficient.

Online platforms in 2026 rarely face a stationary environment long enough for “learn once, deploy forever” to be a credible design principle. Market conditions shift (a macro shock changes demand for rides, credit risk, or labor), policies shift (a new transparency rule forces a change in ranking or pricing), and models shift (the platform retrains a recommender, a fraud detector, or a matching model). In such settings, the mechanism the platform implements is not a fixed mapping from reports to outcomes; it is itself updated as data arrive. This creates a basic tension that motivates our framework: we want the platform to adapt quickly to regime changes, but the very act of adapting opens a channel for strategic agents to manipulate the learning process.

To see why this matters, consider the kinds of “objectives” platforms optimize in practice. A marketplace might balance short-run revenue against longer-run retention; an ad platform might target a blend of clicks, conversions, and advertiser satisfaction; a labor platform might combine fulfillment, delay, and fairness constraints; and a content platform might incorporate safety metrics that are revised as threats evolve. Even when the platform’s stated goal is stable, the mapping from actions to outcomes is not: demand patterns, outside options, and agent populations drift, and the platform’s measurement of success can be adversarially noisy. The relevant performance benchmark is therefore not “how well did we do on average over the entire year,” but “how well did we do during each period in which the environment was approximately coherent.” This is the economic motivation for interval-by-interval guarantees: an adaptive platform should be judged on whether it can track changing conditions without paying a prohibitive cost each time conditions change.

The difficulty is that mechanism learning is itself a dynamic game. Agents are not merely participants in a single round; they return, they observe, and they respond to how today’s reports affect tomorrow’s rules. The platform’s learning rule maps histories (reports, realized outcomes, and observed objective values) into a distribution over future mechanisms. As soon as that mapping is nontrivial, a sophisticated agent has a reason to treat current reports as an investment: by shading a report today, the agent may slightly worsen the platform’s immediate objective but steer the learner toward a mechanism that favors the agent in future rounds. In other words, the platform is no longer choosing “the best mechanism in Π” for a fixed world; it is choosing an evolving mechanism sequence, and agents may want to influence the evolution.

This channel for manipulation becomes more salient precisely because modern agents are increasingly “AI-assisted.” A seller on an e-commerce platform can delegate experimentation to a pricing bot; a driver in a gig platform can use an app that predicts surge pricing and recommends when to go offline; an advertiser can run scripts that test how bidding patterns shift allocation rules; and a spammer can use a model to probe a classifier’s decision boundary. These tools reduce the cognitive and operational cost of strategic experimentation. In the repeated setting, even small per-round distortions can compound: if an agent can slightly tilt the platform’s model update, the cumulative benefit may be substantial when discounted patience is high. This is why incentive constraints that are adequate for one-shot interactions are insufficient once learning is endogenous.

At this point it is tempting to fall back on the canonical “static regret” perspective from online learning: compare the platform’s cumulative performance to the best fixed mechanism in hindsight. But static regret implicitly averages over the entire horizon. If the environment contains regime shifts, then the best fixed mechanism is often a poor proxy for what the platform should have done. A simple example is a market that alternates between two demand states, each favoring a different pricing rule. A learner with low static regret might still be slow to adapt after each switch, because static regret does not penalize being systematically late; the learner can “make up” losses in one regime by doing well in the other. Yet from a policy or operational standpoint, this is exactly the failure mode we care about: users experience the platform in real time, and a platform that repeatedly misprices right after a shift is predictably exploitable and predictably inefficient.

This motivates a stronger benchmark: performance on every interval I = [r, s], comparing the platform’s realized objective to the best single mechanism in Π one could have committed to just for that interval. Economically, this corresponds to a platform that is allowed to “re-optimize” at each regime change, but only by selecting from a pre-specified menu of plausible mechanisms. The benchmark is demanding: it asks the platform to be nearly as good as the best mechanism for each local period, without knowing when the periods begin or end. Operationally, this is a natural desideratum when product teams roll out model updates, when regulations impose abrupt constraint changes, or when external shocks shift demand.

However, strengthening the performance benchmark exacerbates the incentive problem. Faster adaptation means the learner becomes more sensitive to recent data, which increases the marginal impact of any single report on future mechanisms. From the agent’s perspective, the return on manipulation rises: if the platform reacts quickly, then a deviation today can meaningfully change allocations tomorrow. Conversely, if we attempt to ensure incentives by making the platform unresponsive—e.g., by using heavy averaging or tiny learning rates—then we can protect against manipulation but lose the ability to track regime shifts. The heart of the design problem is thus a three-way tradeoff: (i) adaptivity to nonstationarity, (ii) robustness to strategic manipulation, and (iii) objective performance.

Our approach starts from the observation that the platform does not need to eliminate all dependence on history; it needs to control the sensitivity of future choices to any single round’s data. This is a notion of stability: if a single report (or a single realized objective value) changes the distribution over future mechanisms only slightly, then the scope for profitable manipulation is limited. In repeated settings, what matters is not just one-step sensitivity, but the discounted sum of downstream effects. The platform therefore benefits from learning rules whose outputs are stable in a weak, multiplicative sense across neighboring histories—precisely because such stability composes over time and provides a handle on the total dynamic gain from deviating once.

Stability alone, however, cannot be the whole story. If the platform were perfectly stable (completely insensitive), it would not learn at all. Moreover, stability bounds the maximum advantage an agent can create through manipulation, but it does not create a direct cost for deviating. In real platforms, incentive compatibility is often maintained not only by careful design of the core allocation rule, but also by enforcement: audits, verification, penalties, escrow, temporary suspensions, or simply a commitment to a conservative fallback policy when manipulation is suspected. Our model abstracts these practices into a “commitment” component that can be invoked with some probability. The economic logic is familiar: if deviations can yield at most a bounded future gain (because the learning rule is stable), and deviations trigger an expected loss with a guaranteed minimum magnitude (because the commitment action imposes a penalty gap), then truth-telling can be sustained as an equilibrium even in a long horizon.

This framing also clarifies a practical design choice. Platforms rarely want to rely entirely on harsh enforcement, because it is costly in user experience and often blunt with respect to welfare. Nor do they want to rely entirely on purely algorithmic adaptation, because that invites gaming. Instead, they mix: mostly learn and optimize, occasionally “commit” or “audit” to make manipulation unprofitable. The key design question becomes quantitative: how much enforcement is needed as a function of (a) how quickly the learner adapts, (b) how patient and forward-looking the agents are, and (c) how severe the enforcement penalty can credibly be. This is the quantitative counterpart to a common product insight: the more the platform iterates on a policy, the more it must invest in anti-gaming measures.

Finally, we acknowledge an important limitation that motivates the assumptions we impose later. If agents are effectively fully patient—willing to sacrifice arbitrarily much today to shape tomorrow—then any learner that meaningfully responds to data becomes a target. In that regime, either the platform commits frequently (and thus sacrifices performance), or it is manipulable. This is not merely a technicality; it reflects a real institutional constraint. Some environments have participants with long horizons and sophisticated tooling (large advertisers, professional sellers, organized fraud rings). In such cases, learning and incentive alignment cannot be treated separately: the platform must design the learning rule and the enforcement policy together, and should not expect “no-regret learning” in the naive sense to survive strategic pressure.

The remainder of the paper formalizes this logic. We specify an online mechanism that learns over a class of single-round mechanisms while maintaining dynamic incentives, and we evaluate it against interval-based benchmarks that capture regime shifts. The central message is not that we can have costless learning with perfect incentives, but that with explicit stability control and occasional commitment, we can make the adaptivity–manipulation tradeoff transparent and tunable—mirroring the operational reality of modern platforms.

A useful way to situate our contribution is that we borrow one key idea from each of three literatures—(i) stability-based incentive arguments for learning in dynamic games, (ii) strongly adaptive (interval) regret in online learning, and (iii) dynamic mechanism design—while changing the performance benchmark and the equilibrium concept to match how modern platforms are evaluated in practice.

First, our incentive logic is closest in spirit to the “weak differential privacy + commitment” approach of Huh and Kandasamy. Their central observation is that, in a repeated interaction, an agent’s incentive to misreport is not only about the immediate allocation, but also about how today’s report perturbs the platform’s future policy. If we can bound this perturbation—in their language, if the platform’s policy is stable with respect to any single-period change in the history—then we can bound the maximum discounted gain from a one-shot deviation. Stability alone does not make truth-telling optimal, but it reduces dynamic manipulation to a bounded “investment opportunity.” They then introduce an explicit enforcement instrument: with some probability, the platform plays a strict commitment mechanism that creates a guaranteed loss from deviating. This yields a clean deterrence inequality of the form “(bounded future gain) ≤ (expected penalty),” which can be made uniform over deviations.

We adopt this logic but push it in two directions that matter for the online-platform setting we have in mind. The first is time variation: rather than a fixed stability parameter and a fixed mixing probability, we allow per-round stability ε_t (coming from the learning rule) and per-round enforcement λ_t (coming from the platform’s policy). This is not a cosmetic generalization. When the platform uses a strongly adaptive learner, the sensitivity of the learner is not constant over time: it depends on restarts, pruning, and the “effective window” of the active experts. Operationally, enforcement intensity is also rarely constant—audits or conservative fallbacks are often turned up when the platform is changing policies quickly, and turned down when the system is stable. Our wrapper result is designed to translate an arbitrary weak stability sequence {ε_t} into an incentive-compatible enforcement schedule {λ_t} via a simple inequality that depends on the agents’ long-sightedness bound ᾱ and the commitment mechanism’s penalty gap β. The second direction is benchmarking: Huh–Kandasamy’s motivation is to learn while preserving incentives, whereas our motivation is to learn with guarantees that are meaningful under regime shifts. This forces us to marry the stability+commitment incentive wrapper with a learning guarantee that is itself interval-by-interval, rather than horizon-average.

That brings us to the second literature: adaptive regret, strongly adaptive algorithms, and the “learning with restarts” toolkits developed in online learning (e.g., Hazan and coauthors). Strong adaptivity asks for low regret simultaneously on every interval I = [r, s], competing with the best single expert (here: the best single-round mechanism in Π) as if we had been allowed to restart the algorithm at r. A long line of work shows that this can be achieved, up to polylogarithmic factors, by maintaining a collection of experts indexed by start times and mixing among them using meta-algorithms such as Follow-the-Leading-History (FLH) and its refinements. The key conceptual point for us is that these constructions are not merely “faster learning”; they are learning rules with structured dependence on history. The distribution q_t at time t can be written as a mixture over a small number of exponentially-weighted components, each of which only “pays attention” to data since its start time. This structure is exactly what we need to connect adaptivity to manipulability.

Our use of this literature is therefore not just to import a regret bound like $\tilde O(\sqrt{|I|\log|\Pi|})$, but to import a stability property that accompanies exponential-weights updates. Exponential weights is well known to be algorithmically stable in a multiplicative sense: changing one loss vector changes the resulting distribution by at most an e^ε factor, with ε proportional to the learning rate. When we combine many exponential-weights components (as in FLH-style strong adaptivity), we must re-prove a comparable stability statement for the mixture—this is where we rely on standard “privacy of mixtures / log-sum-exp Lipschitzness” arguments familiar from differential privacy and from stability analyses of multiplicative weights. The punchline is that strong adaptivity and weak stability are compatible: we can obtain interval regret guarantees while still keeping ε_t small enough to be priced into the incentive constraint via λ_t. Put differently, our mechanism’s learning component is chosen not only because it tracks nonstationarity, but because its sensitivity is explicitly tunable and analyzable.

The third literature is dynamic mechanism design, broadly construed. Classical dynamic mechanism design typically assumes a probabilistic model of types and dynamics (e.g., Markov evolution, independent private values over time), and evaluates mechanisms using Bayesian equilibrium concepts and ex ante performance guarantees. This tradition has produced powerful insights—especially around intertemporal incentives, information rents, and commitment—but its baseline assumptions are not well aligned with a platform that is iterating on a policy under shifting constraints, partially adversarial feedback, and model misspecification. In our setting, the platform’s per-round objective G_t can change over time in a way that is not captured by a stationary stochastic process, and we want guarantees that hold for every realized sequence (or at least uniformly over a rich class of instances), rather than only in expectation under a correct prior.

On the other hand, there is also a growing literature on repeated auctions and dynamic pricing with strategic buyers under weak assumptions, including impossibility results showing that one cannot simultaneously learn efficiently and be robust to manipulation when buyers are very patient. These results are a crucial cautionary backdrop for our assumptions. Our framework builds the limitation into the primitives by parameterizing “long-sightedness” via ᾱ, and by making the enforcement capacity explicit via β. This is not merely a technical device: in practice, a platform’s ability to deter manipulation depends on real institutional levers (verification, penalties, account frictions), and the relevant strategic horizon depends on both discounting and agents’ ability to persist across rounds (e.g., whether they can cheaply rotate identities). Our results should be read as mapping these operational constraints into a quantitative learning–incentives frontier.

Positioned between these strands, our contribution is to provide a single template in which (a) the platform competes against a finite mechanism class Π under a strongly adaptive (interval) regret benchmark, (b) agents play an online reporting game in which the platform’s mechanism selection rule is endogenous, and (c) truth-telling is sustained as an ex post Nash equilibrium by combining weakly stable learning with occasional commitment. Conceptually, we see this as a step toward a “mechanism learning” theory that treats adaptivity as a first-class design goal while keeping incentive constraints explicit and non-asymptotic. At the same time, our approach has clear boundaries: we rely on an enforceable commitment action with a positive penalty gap β, and we require that effective patience (captured by ᾱ) not scale linearly with the horizon if we want sublinear regret. These boundaries are not artifacts; they are the economic content of the manipulation problem, and they guide how we interpret the model in platform settings.

With this positioning in mind, we now formalize the online game, the long-sightedness parameter, and the interval regret benchmark that the mechanism will be designed to satisfy.

3. Model: online mechanism learning with time-varying objectives; agent long-sightedness; definition of interval regret (strongly adaptive regret) and NIC in the online game.

We study a repeated interaction over a finite horizon T between a platform (planner) and a population of n strategic agents. Our goal in this section is to make precise (i) the online game induced by an adaptive mechanism-selection policy, (ii) the sense in which agents may be “long-sighted,” and (iii) the performance benchmark we use for the platform under nonstationary objectives.

Rounds, types, reports, and outcomes. Time is indexed by rounds t ∈ [T]. In each round t, each agent i ∈ [n] either participates with a true type θ_i, t ∈ Θ_i or is absent, which we encode as θ_i, t = ⊥. We write θ_t = (θ_1, t, …, θ_n, t) for the type profile, and similarly b_t = (b_1, t, …, b_n, t) for the profile of reports, where each report satisfies b_i, t ∈ Θ_i ∪ {⊥}. The platform chooses an (observable) single-round mechanism π_t with outcome space S; after observing π_t, agents report; then an outcome s_t ∼ π_t(b_t) is realized.

Each agent’s per-round payoff is u_i(θ_i, t, s_t). We assume utilities are bounded (w.l.o.g. in [0, 1]), so that changes in continuation values can be controlled by stability-type arguments later. Agents discount across time via non-increasing discount factors γ_i(t) ≥ 0, allowing for heterogeneity in patience. When an agent is absent (θ_i, t = ⊥), we interpret u_i(⊥, s_t) = 0 and γ_i(t) still contributes to the agent’s intertemporal weighting in a mechanical way; what matters economically is the set of rounds in which the agent can influence outcomes and hence has incentives to manipulate. Let T_i := {t : θ_i, t ≠ ⊥} be the (instance-dependent) set of rounds in which agent i is present.

The platform objective is time-varying and may be adversarial. In each round t, the platform evaluates performance using a bounded objective G_t(θ_t, s_t) ∈ [−1, 1]. We intentionally do not assume stationarity, a correct prior, or any parametric evolution: G_t can change over time and may encode shifting constraints, changing business goals, or policy-driven targets. Formally, one can view (G_t)_t ≤ T as selected by an adversary (possibly adaptive to public history), subject only to boundedness. This is the sense in which our benchmark and guarantees are “online” rather than Bayesian.

Mechanism class and online policy. The platform competes against a finite reference class Π of candidate single-round mechanisms. Each π ∈ Π is assumed to be Nash incentive-compatible in a single round (NIC): holding the mechanism fixed and taking other agents’ reports as given, truthful reporting maximizes an agent’s expected current-round utility. This assumption captures the idea that each candidate policy is “locally well-designed” as a one-shot mechanism (e.g., a DSIC auction format, a truthful matching rule, or a properly designed scoring rule), but it does not address dynamic manipulation that exploits how the platform selects π_t over time.

An online mechanism M is a policy sequence that maps public history into a (possibly randomized) choice of π_t. A convenient representation is in terms of the public history
h_< t := (π₁, b₁, s₁, G₁; …; π_t − 1, b_t − 1, s_t − 1, G_t − 1),
and a transition rule that selects π_t as a (history-dependent) lottery. The key economic feature is that because π_t depends on the past reports b_< t, an agent can potentially trade off current utility against a change in future platform behavior—precisely the “policy manipulation” channel that makes the repeated problem nontrivial even when every π ∈ Π is truthful in isolation.

Strategies and the online incentive constraint. A (pure) reporting strategy for agent i is a sequence σ_i = (σ_i, t)_t ≤ T where each σ_i, t maps the agent’s information at time t into a report. In the baseline model we allow agents to condition on the public history and their current type:
b_i, t = σ_i, t(h_< t, θ_i, t) ∈ Θ_i ∪ {⊥}.
We focus on the truthful strategy σ_i^tr defined by b_i, t = θ_i, t whenever present.

Given an instance A (which fixes the realized type sequence θ_≤ T, discount factors, and objective sequence G_≤ T), the discounted utility of agent i under a strategy profile σ and online mechanism M is
U_i(σ; M, A) = 𝔼[∑_{t ∈ T_i}γ_i(t) u_i(θ_i, t, s_t)],
where the expectation is over the randomness in M (mechanism selection) and in π_t (outcome selection). Our equilibrium requirement is an ex post Nash condition (sometimes called “online NIC” in this literature): truth-telling must be a best response for each agent, for every realized instance, holding other agents truthful. Concretely, M is online NIC on a class of instances 𝒜 if for all A ∈ 𝒜, all agents i, and all alternative strategies σ_i,
U_i(θ_i, θ_−i; M, A) ≥ U_i(σ_i, θ_−i; M, A).
This is intentionally a strong requirement: it does not average over a prior, and it allows deviations that are history-dependent and forward-looking. The reason we adopt it is practical as well as theoretical: platforms often need incentive guarantees that are robust to misspecification and to strategic “gaming” that does not resemble i.i.d. behavior.

Long-sightedness as a quantitative bound. Online NIC is only achievable in our setting if we control how valuable future manipulation can be. We do this by restricting attention to instance classes 𝒜 with bounded long-sightedness α(A), and we write ᾱ = sup_{A ∈ 𝒜_T}α(A). Intuitively, α(A) upper-bounds the largest continuation weight an agent can place on the future relative to the present at any round in which it can deviate.

A convenient (and standard) form of such a parameter is:
$$ \alpha(A) \;:=\; \max_{i}\;\max_{t\in T_i} \frac{\sum_{t'\in T_i:\,t'\ge t}\gamma_i(t')}{\gamma_i(t)}. $$
If γ_i(t) ≡ 1 and the agent is present in all remaining rounds, then α(A) is on the order of the remaining horizon—capturing near-full patience. Under geometric discounting γ_i(t) = δ^t − 1, one gets α(A) ≤ 1/(1 − δ). If agents participate only sporadically (e.g., each account appears in at most m rounds), then α(A) ≤ m even if δ is close to one. This parameter therefore bundles two operational determinants of manipulability: patience and persistence/recurrence. Later, the enforcement and stability conditions will scale linearly with ᾱ, reflecting the economic fact that when agents place more weight on the future, they are willing to incur larger short-run costs to steer the platform’s learning.

Strongly adaptive (interval) regret. We evaluate platform performance by comparing the cumulative objective achieved by M to the best fixed mechanism in Π on every time interval. For an interval I = [r, s] ⊆ [T], define the platform’s expected cumulative objective under truthful reports as
Alg_I(θ; M, A) := 𝔼[∑_t ∈ IG_t(θ_t, s_t)],
and the best-in-hindsight benchmark within Π as
Opt_I(Π; A) := max_{π ∈ Π}∑_t ∈ I𝔼_{s ∼ π(θ_t)}[G_t(θ_t, s)].
The interval regret is then
Reg_I(θ, M, Π; A) = Opt_I(Π; A) − Alg_I(θ; M, A).
Requiring a small Reg_I for all intervals I is the strongly adaptive guarantee: the platform must track whichever mechanism π ∈ Π would have performed best on that interval, even if the identity of that best mechanism changes over time. This benchmark is materially different from standard (horizon-average) regret. In platform terms, it corresponds to being competitive not only “on average,” but also during regime shifts—periods in which constraints, demand, or strategic composition changes.

Two remarks clarify how this interacts with incentives. First, Alg_I is defined under truthful play, because our target design will ensure truthfulness as an equilibrium property; without an incentive guarantee, any learning benchmark can be vacuous. Second, the interval benchmark is deliberately single-mechanism per interval: we do not compare against an omniscient sequence (π_t) that changes every round, because that would be unattainable under adversarial objectives. Instead, the benchmark captures a realistic notion of “best policy for the current regime,” while remaining learnable.

With these definitions in place—online NIC as our equilibrium requirement, ᾱ as the manipulability parameter, and interval regret as the platform benchmark—we can now turn to the key technical bridge: a stability notion for strongly adaptive learners that simultaneously supports interval-by-interval learning guarantees and bounds the extent to which any single report can perturb the platform’s future mechanism-selection distribution.

4. A stability notion for strongly-adaptive learners: define a stability-controlled strongly-adaptive Hedge/FLH variant; prove a per-round weak stability (weak-DP-like) bound for each map from history to distributions over Π.

A central difficulty in combining online learning with strategic reporting is that standard regret guarantees say nothing about how sensitive the learner’s future behavior is to a single data point. Yet, for incentives, precisely this sensitivity determines how profitable it is for an agent to “invest” in a misreport today in order to steer the platform’s future mechanism choices. We therefore introduce a stability notion tailored to our setting—strong enough to upper-bound manipulation incentives, but weak enough to be achievable by strongly-adaptive learners.

From history to a distribution over mechanisms. Fix a finite mechanism class Π. In every round t, the learning component of the platform outputs a distribution
q_t = q_t(h_< t) ∈ Δ(Π),
from which it draws π_t^learn ∼ q_t. Under truthful play, the platform can associate to each π ∈ Π a bounded “score” (or negative loss) for that round, which we denote
g_t(π) ∈ [−1, 1].
Concretely, one can take g_t(π) = 𝔼_{s ∼ π(θ_t)}[G_t(θ_t, s)] (full-information feedback), which is well-defined given the reported type profile and the public objective function G_t. The important property for what follows is simply boundedness: changing a single round’s input can change each g_t(π) by at most 2.

A weak stability (weak-DP-like) requirement. We measure the sensitivity of q_t to perturbations of the past by a multiplicative bound, in the spirit of differential privacy. Let two histories h_< t and h_< t^′ be neighbors if they differ in exactly one round τ < t in the publicly observed tuple (b_τ, s_τ, G_τ) (equivalently, in the induced score vector g_τ(⋅)), and are identical otherwise. We say the learner is ε_t-stable at time t if for every neighboring pair h_< t, h_< t^′ and every π ∈ Π,
$$ \frac{q_t(\pi\mid h_{<t})}{q_t(\pi\mid h'_{<t})} \;\le\; e^{\varepsilon_t} \qquad\text{and}\qquad \frac{q_t(\pi\mid h'_{<t})}{q_t(\pi\mid h_{<t})} \;\le\; e^{\varepsilon_t}. $$
Equivalently, ∥log q_t(⋅ ∣ h_< t) − log q_t(⋅ ∣ h_< t^′)∥_∞ ≤ ε_t. We emphasize what this does—and does not—mean. It is not a statistical privacy guarantee about hiding an agent’s type (we do not assume a stochastic data-generating process). Rather, it is a worst-case sensitivity bound: a single strategic perturbation in the past cannot change the platform’s future randomized choice of mechanisms by more than a controlled multiplicative factor.

This stability notion is economically interpretable. If q_t is ε_t-stable, then an agent who changes round τ’s report cannot dramatically tilt the odds toward its preferred mechanisms in later rounds; the induced change in continuation values is therefore bounded, once we incorporate discounting. The next section will use exactly this link.

Warm-up: stability of ordinary exponential weights. Consider the standard Hedge update on the full-information scores g_t(π). With learning rate η > 0, define weights
$$ w_{t+1}(\pi) \;=\; w_t(\pi)\,\exp\big(\eta\, g_t(\pi)\big),\qquad q_{t+1}(\pi)\;=\;\frac{w_{t+1}(\pi)}{\sum_{\pi'} w_{t+1}(\pi')}. $$
Suppose two histories differ only at one round τ < t, inducing score vectors g_τ and g_τ^′. For any fixed π, the log-weight log w_t(π) differs by at most η|g_τ(π) − g_τ^′(π)| ≤ 2η. Normalization does not amplify this too much because the map x ↦ x − log ∑_π^′e^{x_π^′} (i.e., log-softmax) is 1-Lipschitz in ℓ_∞. Formally, one obtains for all π,
|log q_t(π ∣ h_< t) − log q_t(π ∣ h_< t^′)| ≤ 4η,
so the per-round stability parameter is ε_t = O(η). This is the familiar “exponential mechanism” style bound: because probabilities are exponentials of bounded scores, changing one score vector shifts log-probabilities by at most a constant times η.

If we only cared about static regret on [T], this would essentially settle the learner side. Our problem, however, is strong adaptivity: we want near-optimal performance simultaneously on every interval I = [r, s], which requires a learner that can rapidly “restart” after regime shifts.

Stability under strong adaptivity: a stability-controlled FLH/Hedge variant. Strongly-adaptive algorithms (e.g., FLH/FLH2-style constructions) can be viewed as running many exponential-weights instances in parallel, one for each candidate start time, and combining them. The high-level idea is: an “expert” indexed by r behaves as if the world started at time r, i.e., it runs Hedge on the subsequence {g_τ}_τ ≥ r. A master algorithm then mixes over these experts to produce q_t. This delivers interval regret guarantees because, for any interval I = [r, s], the expert started at r competes well on that interval, and the master does not lose much relative to the best active expert.

To make this computationally and analytically clean, we use a pruned (a.k.a. geometrically covered) version: at each time t, we maintain only a set ℛ_t ⊆ {1, …, t} of active start times of size O(log T) (for example, a dyadic cover). Each active start time r ∈ ℛ_t runs an internal Hedge distribution q_t^(r) ∈ Δ(Π) using only scores from rounds r, …, t − 1. The master maintains mixing weights p_t(r) over r ∈ ℛ_t, typically updated by another exponential-weights rule based on the realized performance of each internal expert. The resulting learner output is the mixture
q_t(π) = ∑_{r ∈ ℛ_t}p_t(r) q_t^(r)(π).
This is the “stability-controlled strongly-adaptive Hedge/FLH” object we will plug into the mechanism.

Why stability survives mixing (and what it costs). The key observation is that the entire construction remains an exponential-family mapping of past bounded scores into probabilities: both the internal q_t^(r) and the master weights p_t(⋅) are obtained via exponentials of cumulative scores. Consequently, changing one round τ’s score vector can only shift the relevant cumulative sums by O(1), hence log-probabilities by O(η). The only new issue is multiplicity: a perturbation at time τ affects not one Hedge instance but every instance whose window includes τ, i.e., all r ≤ τ that are active at time t. Under pruning, the number of such active start times is at most O(log T), which is precisely why pruning is not just a computational convenience but also an incentive-relevant design choice.

One convenient way to formalize the bound is to work directly with unnormalized mixture weights. For each active r, define an unnormalized contribution
$$ \widetilde q_t^{(r)}(\pi) \;\propto\; \rho(r)\,\exp\Big(\eta\sum_{\tau=r}^{t-1} g_\tau(\pi)\Big), $$
where ρ(r) is a fixed prior over start times (e.g., ρ(r) ∝ 1/r²), and let q̃_t(π) = ∑_{r ∈ ℛ_t}q̃_t^(r)(π), with q_t(π) = q̃_t(π)/∑_π^′q̃_t(π^′). If histories differ only at τ, then for each active r ≤ τ and each π,
$$ \sum_{\ell=r}^{t-1} g_\ell(\pi)\;-\;\sum_{\ell=r}^{t-1} g'_\ell(\pi) \;=\; g_\tau(\pi)-g'_\tau(\pi), $$
so q̃_t^(r)(π) changes by at most a factor e^2η. For r > τ, there is no change. Aggregating over the at most O(log T) active r’s that satisfy r ≤ τ, we obtain
|log q̃_t(π ∣ h_< t) − log q̃_t(π ∣ h_< t^′)| ≤ O(ηlog T),
and applying the same log-softmax Lipschitz argument as in the warm-up yields the per-round weak stability bound
|log q_t(π ∣ h_< t) − log q_t(π ∣ h_< t^′)| ≤ ε_t, ε_t = O(η polylog(T)).
The precise polylog factor depends on the pruning schedule and whether one uses one or two layers of exponential weights (internal and master). What matters for our mechanism design is the qualitative structure: stability is proportional to the learning rate η, and strong adaptivity adds only a polylogarithmic overhead.

Interpretation and limitations. This stability guarantee is the technical bridge between learning and incentives. It formalizes a principle with direct policy relevance: if a platform wants to remain responsive to nonstationarity (strong adaptivity), it must run a learner whose update is smooth in the data, so that any single strategic perturbation cannot sharply redirect future decisions. Our bound delivers exactly such smoothness, and it does so in a worst-case, distribution-free way—consistent with our adversarial objectives.

At the same time, the bound makes clear what we are paying for. Stability deteriorates as η grows: faster adaptation increases manipulability. Moreover, the argument relies on bounded scores and a finite Π; extending to large or continuous mechanism classes would require additional structure (e.g., convexity with Lipschitz losses, or explicit regularization) to recover comparable sensitivity control. With this stability notion in hand, we now turn to the second ingredient of our construction: a time-varying incentive wrapper that converts ε_t-stability into an ex post online NIC guarantee by occasionally mixing in a strict commitment mechanism.

5. Incentive wrapper with time-varying enforcement: define the lottery mechanism π_t = (1 − λ_t)π_t^learn + λ_tπ_com; state and prove a dynamic NIC theorem with time-varying λ_t and stability parameters.

We now turn the stability property into incentives by wrapping the learner in a simple, time-varying enforcement scheme. Economically, the wrapper plays the role of occasional audits or commitment episodes: most of the time the platform uses an adaptive (and hence potentially steerable) rule, but with some probability it switches to a strict mechanism that makes any misreport locally costly by a known margin. Stability then ensures that the only remaining benefit from misreporting—steering the future—stays bounded, so that a sufficiently frequent audit/commitment schedule eliminates profitable manipulation.

Lottery mechanism with time-varying enforcement. Fix (i) a learning component that at each round t outputs a distribution q_t ∈ Δ(Π) and draws π_t^learn ∼ q_t, and (ii) a commitment mechanism π_com that is strictly DSIC with penalty gap β > 0 as defined above. The online mechanism M publicly announces in round t the mixture
π_t = (1 − λ_t) π_t^learn + λ_t π_com, λ_t ∈ [0, 1],
with λ_t potentially depending on t (and in principle on public history, though we state a clean sufficient condition using only time indices).

The key design question is how to set λ_t. Intuitively, higher λ_t makes manipulation less attractive (because misreports are more likely to be “tested” under π_com), but it also reduces performance because we sometimes forgo the learner’s choice. Our objective here is purely incentive-theoretic: we want a condition on {λ_t} guaranteeing online NIC (truth-telling is an ex post Nash equilibrium over the entire horizon), given the learner’s ε_t-stability.

To express the role of patience, we use the standard long-sightedness summary (as in Huh–Kandasamy): for each agent i and round t, define
$$ \alpha_i(t) \;:=\; \sum_{t'=t+1}^{T} \frac{\gamma_i(t')}{\gamma_i(t)} , \qquad \alpha(A)\;:=\;\max_{i}\max_{t}\alpha_i(t), $$
and recall ᾱ := sup_{A ∈ 𝒜_T}α(A). This is the maximal discounted weight an agent can place on the future relative to the present at any history, and it is exactly the quantity that scales the value of “investing” in manipulation today.

We also impose the standard boundedness normalization for the incentive bound: assume per-round utilities satisfy u_i(θ_i, t, s) ∈ [−1, 1]. (If utilities are bounded in [−U, U], the same argument goes through with constants scaled by U.)

Theorem (Dynamic NIC with time-varying stability and enforcement)

Suppose:

Single-round NIC of candidates. Every π ∈ Π is (weakly) Nash incentive-compatible for a single round.
Strict commitment. π_com is strictly DSIC with penalty gap β > 0.
Weak stability of the learner. For each t ≥ 2, the map q_t(⋅ ∣ h_< t) is ε_t-stable with respect to changing the publicly observed data from any single past round, in the multiplicative sense defined earlier. Assume ε_t ≤ 1 (which will hold under our tuning).

Define the forward stability envelope ε̂_t := sup_{t^′ > t}ε_t^′. If the enforcement schedule satisfies, for all t ∈ [T],
$$ \lambda_t \;\ge\; \frac{8\,\bar\alpha}{\beta}\,\widehat\varepsilon_t \qquad\text{and}\qquad \lambda_t\le 1, $$
then truthful reporting is an ex post Nash equilibrium of the resulting online mechanism M for every agent instance A ∈ 𝒜_T. Equivalently, M is online NIC.

Remark. In the common case where ε_t is non-increasing in t (e.g., decreasing learning rates), ε̂_t = ε_t + 1 ≤ ε_t, so it is sufficient (up to constants) to set λ_t ≳ (ᾱ/β)ε_t. We state the envelope form to make the time-varying dependence logically correct even when ε_t fluctuates.

Proof (backward induction with a one-shot deviation bound)

Fix an instance A, fix an agent i, and fix any public history h_< t at round t. We show that, conditional on h_< t and on other agents reporting truthfully from t onward, agent i’s best response in round t is to report b_i, t = θ_i, t. Because the horizon is finite, this implies by backward induction that truthful reporting is an ex post Nash equilibrium over the entire game.

Consider an arbitrary deviation b_i, t ≠ θ_i, t at round t, while the agent reverts to truthful reporting thereafter. We compare the agent’s expected discounted utility from round t onward under the truthful report versus the deviating report, holding fixed the other agents’ truthful play.

Step 1: Immediate-round effect is nonpositive and strictly negative under commitment. Conditioned on π_t, the deviation affects only the current round’s realized outcome and then the public signals that feed into future learning. In the current round itself:

Under π_t^learn, since π_t^learn ∈ Π and every π ∈ Π is single-round NIC, truthful reporting weakly maximizes the agent’s expected utility given b_−i, t. Hence the deviation yields no positive gain in the learning branch.
Under π_com, strict DSIC with penalty gap β implies that, for every b_−i, t, deviating reduces the agent’s expected utility by at least β.

Taking the mixture, the expected loss at time t from misreporting is therefore at least λ_tβ. In discounted terms, this contributes a utility difference of at most −γ_i(t)λ_tβ in favor of truth-telling.

Step 2: Future effect is bounded by stability. The only way the deviation can help is by changing the platform’s future behavior through the history-dependent learner. Let h_< t denote the history under truthful reporting in round t, and h_< t^′ the “neighboring” history that results from changing only agent i’s report at round t (and the induced outcome/objective triple at that round). By construction, for every future round t^′ > t, the two histories h_< t^′ and h_< t^′^′ differ in exactly one past round (namely t). Hence, by ε_t^′-stability,
$$ \frac{q_{t'}(\pi\mid h_{<t'})}{q_{t'}(\pi\mid h'_{<t'})}\le e^{\varepsilon_{t'}}\qquad\forall \pi\in\Pi, $$
and symmetrically with h_< t^′ and h_< t^′^′ swapped.

This multiplicative bound implies a bound on how much the agent’s expected per-round utility can change via the learner’s randomization. Formally, for any bounded function f : Π → [−1, 1],
|𝔼_{π ∼ q_t^′(⋅ ∣ h_< t^′)}[f(π)] − 𝔼_{π ∼ q_t^′(⋅ ∣ h_< t^′^′)}[f(π)]| ≤ 2 (e^{ε_t^′} − 1) ≤ 4 ε_t^′,
where the last inequality uses ε_t^′ ≤ 1. We apply this with f(π) equal to the (conditional-on-types and on truthful reporting in round t^′) expected utility the agent receives in round t^′ when the platform chooses mechanism π in the learning branch. Because utilities are in [−1, 1], this is valid.

Thus, the maximum possible gain at a future round t^′ attributable to the deviation at time t is at most 4ε_t^′ in absolute value, and hence at most 4ε̂_t uniformly over t^′ > t. Discounting and summing,
$$ \text{(max future gain from deviating at $t$)} \;\le\; \sum_{t'=t+1}^T \gamma_i(t')\cdot 4\widehat\varepsilon_t \;=\; 4\,\gamma_i(t)\,\alpha_i(t)\,\widehat\varepsilon_t \;\le\; 4\,\gamma_i(t)\,\bar\alpha\,\widehat\varepsilon_t. $$

Step 3: Choose λ_t so punishment dominates temptation. Combining Step 1 and Step 2, the deviation’s total (discounted) net benefit from round t onward is at most
−γ_i(t)λ_tβ + 4 γ_i(t) ᾱ ε̂_t.
Therefore, if λ_tβ ≥ 8ᾱ ε̂_t, truth-telling strictly dominates any misreport at round t given truthful play in the future. Since this holds at every history and every t, backward induction yields online NIC.

This completes the proof. ▫

Two economic takeaways are worth highlighting before we move on to regret. First, time variation is not cosmetic: when the learner is less stable (e.g., early on, or around restarts), the platform must correspondingly raise λ_t to keep incentives aligned. Second, the condition scales exactly with the two primitives that encode “how hard incentives are”: ᾱ (patience) inflates the value of steering, while β (penalty strength) reduces the audit frequency needed to deter it. The next section quantifies the performance cost of this enforcement through the additive distortion term ∑_t ∈ Iλ_t on each interval.

6. Strongly adaptive regret bounds: bound regret on every interval I for the learner component; add the commitment distortion term ∑_t ∈ Iλ_t; optimize parameters to obtain $\tilde O((\log|\Pi|+1/\beta)\sqrt{\bar\alpha|I|})$.

We next quantify the performance cost of enforcing incentives. The wrapper we analyzed above is deliberately simple: in each round we either (i) follow the learner’s recommendation (which is good for adapting to nonstationarity), or (ii) “audit/commit” by running π_com (which is good for incentives). Regret on an interval I = [r, s] therefore decomposes into two conceptually separate terms: (a) the learner’s own strongly adaptive regret relative to the best fixed π ∈ Π on that interval, and (b) an additive commitment distortion term coming from the rounds in which we do not follow the learner.

To formalize this, let
g_t(π) := 𝔼_{s ∼ π(θ_t)}[G_t(θ_t, s)] ∈ [−1, 1]
denote the platform’s expected single-round objective had it committed to mechanism π at time t under truthful reports. (The bounded range follows from G_t ∈ [−1, 1].) Under truthful play, the learning branch at time t draws π_t^learn ∼ q_t and achieves expected objective 𝔼[g_t(π_t^learn)]. The full mechanism draws instead from the mixture π_t = (1 − λ_t)π_t^learn + λ_tπ_com, so its per-round expected objective is
(1 − λ_t) 𝔼[g_t(π_t^learn)] + λ_t g_t(π_com).

Step 1: a clean decomposition of interval regret. Fix an interval I. Let π_I^⋆ ∈ arg max_{π ∈ Π}∑_t ∈ Ig_t(π) be the best-in-hindsight comparator in Π on I. Then
Reg_I = ∑_t ∈ I(g_t(π_I^⋆) − [(1 − λ_t)𝔼[g_t(π_t^learn)] + λ_tg_t(π_com)]).
Add and subtract 𝔼[g_t(π_t^learn)] to obtain
$$ \mathrm{Reg}_I =\underbrace{\sum_{t\in I}\Big(g_t(\pi_I^\star)-\mathbb E[g_t(\pi_t^{\mathrm{learn}})]\Big)}_{\text{learner regret on }I} \;+\; \underbrace{\sum_{t\in I}\lambda_t\Big(\mathbb E[g_t(\pi_t^{\mathrm{learn}})]-g_t(\pi_{\mathrm{com}})\Big)}_{\text{commitment distortion on }I}. $$
The first term is exactly the regret of the (truthful-play) learner relative to the best fixed π ∈ Π on that interval. The second term is the price we pay for occasionally executing π_com instead of the learner’s draw.

Because g_t(⋅) ∈ [−1, 1], we always have the crude but robust bound
|𝔼[g_t(π_t^learn)] − g_t(π_com)| ≤ 2, hence distortion_I ≤ 2∑_t ∈ Iλ_t.
(If π_com is itself near-optimal for the objective, this can be much smaller; but we do not assume that, because π_com is chosen for incentives, not for performance.)

Thus, up to constants, the platform’s strongly adaptive regret reduces to
Reg_I ≤ Reg_I^learn + O (∑_t ∈ Iλ_t),
with all strategic complications pushed into how λ_t must be set to sustain online NIC.

Step 2: the learner’s interval regret under truthful play. Under truthful reporting, the platform observes a sequence of bounded “payoffs” {g_t(π)}_{π ∈ Π} and runs a strongly adaptive experts algorithm. Standard constructions (e.g., FLH/FLH2-style mixtures over start times, with pruning) guarantee that for every interval I,
$$ \mathrm{Reg}^{\mathrm{learn}}_I \;=\; \sum_{t\in I}\Big(g_t(\pi_I^\star)-\mathbb E[g_t(\pi_t^{\mathrm{learn}})]\Big) \;\le\; \tilde O\!\left(\sqrt{|I|\log|\Pi|}\right), $$
where Õ(⋅) hides polylog(T) factors coming from restarts/mixing across many potential start times. This is the usual “strong adaptivity” promise: the learner competes on every interval, not only from 1 to T.

Crucially for our incentive wrapper, we do not use an arbitrary strongly adaptive learner; we use one whose updates are stable (weak-DP-like). In exponential-weights terms, stability is controlled by the learning rate: more aggressive updates adapt faster but are easier for agents to steer through today’s report, while smaller learning rates are more stable but pay larger regret constants. Operationally, we can view the learner’s interval regret bound in the common “two-term” form
$$ \mathrm{Reg}^{\mathrm{learn}}_I \;\lesssim\; \frac{\log|\Pi|}{\eta_I} \;+\; \eta_I\,|I|, $$
for an interval-appropriate learning-rate scale η_I, with strong adaptivity providing (via a multi-scale mixture) that, on each interval I, the algorithm competes nearly as well as if it had been run with η_I tuned to |I| from the start. Under the same tuning, the stability parameter behaves as ε ≍ η up to polylog(T) factors (as discussed earlier).

Step 3: using the incentive constraint to upper bound distortion. From the dynamic NIC condition proved above, the enforcement probabilities must satisfy (up to constants)
$$ \lambda_t \;\gtrsim\; \frac{\bar\alpha}{\beta}\,\widehat\varepsilon_t. $$
Intuitively, ε̂_t quantifies how much a one-shot misreport at t can perturb future mechanism selection, while β quantifies how much a misreport is punished when the commitment branch is drawn; the factor ᾱ converts per-round perturbations into a bound on discounted future gains.

For regret, this constraint is helpful because it turns the distortion term into a controlled function of the learner’s stability. In the simplest regime where the learner’s effective stability level on an interval of length |I| is ε_I (again, strong adaptivity supplies an appropriate scale), we have
$$ \sum_{t\in I}\lambda_t \;\lesssim\; \frac{\bar\alpha}{\beta}\,\varepsilon_I\,|I|. $$

Step 4: optimizing stability versus enforcement. We now choose the stability level ε_I (equivalently, a learning-rate scale) to balance two forces:

Smaller ε_I increases stability and reduces required enforcement λ_t, but increases the learner term log |Π|/ε_I.
Larger ε_I improves the learner’s responsiveness but forces higher λ_t to deter manipulation.

Plugging the heuristic forms together on an interval I yields the bound
$$ \mathrm{Reg}_I \;\lesssim\; \frac{\log|\Pi|}{\varepsilon_I} \;+\; \underbrace{\varepsilon_I\,|I|}_{\text{statistical/learning term}} \;+\; \underbrace{\frac{\bar\alpha}{\beta}\,\varepsilon_I\,|I|}_{\text{enforcement distortion}} \quad\text{(up to }\mathrm{polylog}(T)\text{ factors)}. $$
When agents are at all forward-looking, the enforcement term is typically the binding linear component, so it is natural to choose ε_I to balance log |Π|/ε_I against (ᾱ/β)ε_I|I|. This gives
$$ \varepsilon_I \;\asymp\; \sqrt{\frac{\beta\,\log|\Pi|}{\bar\alpha\,|I|}} \quad\Longrightarrow\quad \mathrm{Reg}_I \;\lesssim\; \sqrt{\frac{\bar\alpha\,|I|}{\beta}}\;\sqrt{\log|\Pi|}. $$
However, because our enforcement uses a fixed penalty gap β rather than scaling the punishment itself, and because we ultimately want a bound that is uniform over intervals with a single online procedure, it is convenient (and standard in this literature) to adopt a slightly more conservative tuning that treats the enforcement and learning contributions additively. Concretely, setting stability at the scale $\varepsilon_I \asymp 1/\sqrt{\bar\alpha|I|}$ yields
$$ \frac{\log|\Pi|}{\varepsilon_I}\;\asymp\;\log|\Pi|\sqrt{\bar\alpha|I|}, \qquad \frac{\bar\alpha}{\beta}\,\varepsilon_I|I|\;\asymp\;\frac{1}{\beta}\sqrt{\bar\alpha|I|}. $$
Combining with the strongly adaptive guarantee (and absorbing polylog(T) factors from multi-scale mixing and stability envelopes) gives the advertised interval-wise performance:
$$ \mathrm{Reg}_I(\theta,\mathsf M,\Pi;A) \;\le\; \tilde O\!\Big(\big(\log|\Pi|+1/\beta\big)\sqrt{\bar\alpha\,|I|}\Big). $$

Two caveats are worth keeping in view. First, the Õ(⋅) hides the additional overhead needed to obtain both strong adaptivity and per-round stability (the latter is what makes the incentive wrapper work); we treat these as polylogarithmic in T, but they are real design constraints in finite samples. Second, the dependence on ᾱ is not an artifact of our proof technique: as ᾱ approaches T (effectively fully patient agents), the required enforcement becomes so frequent that either learning must be frozen (very small ε) or commitment must be used almost always (large λ_t), and either way linear losses emerge on some instances. In this sense, the regret bound above is best interpreted as a quantitative statement of the economic tradeoff: we can be adaptive and truthful simultaneously precisely when agents are not “too patient” (ᾱ = o(T)) and when the platform has access to a sufficiently sharp enforcement tool (large β).

7. Computation: pruning-based implementation (FLH2-style) with O(log T) active learners; discussion of how pruning interacts with stability bounds and incentive constraints.

7. Computation: pruning-based implementation (FLH2-style) and its interaction with stability/incentives

The strongly-adaptive guarantee we rely on is conceptually easiest to state as “run an experts algorithm from every possible start time, and mix them.” Taken literally, that prescription is computationally infeasible: at time t we would maintain t active base learners (one initialized at each r ≤ t), each itself maintaining a distribution over Π. The purpose of pruning is to recover essentially the same interval-by-interval guarantee while keeping only O(log T) active learners, so that per-round time and memory are near the cost of a single exponential-weights update.

7.1 FLH2-style mixture: what we implement

We maintain a meta-distribution over a set of start times ℛ_t ⊆ {1, …, t}, where each r ∈ ℛ_t indexes a base exponential-weights learner that has been “tracking” payoffs since time r. Concretely, base learner r maintains weights {w_t^(r)(π)}_{π ∈ Π} and induces
$$ q^{(r)}_{t}(\pi)\;=\;\frac{w^{(r)}_{t}(\pi)}{\sum_{\pi'\in\Pi}w^{(r)}_{t}(\pi')}, \qquad w^{(r)}_{t+1}(\pi)\;=\;w^{(r)}_{t}(\pi)\exp\!\big(\eta_{r,t}\,\widehat g_t(\pi)\big), $$
where ĝ_t(π) ∈ [−1, 1] is the payoff signal used by the learner at time t for mechanism π (under truthful play, one can take ĝ_t = g_t; more generally it is the bounded statistic the platform computes from the observed round-t data). The meta-layer assigns weights {W_t(r)}_{r ∈ ℛ_t} and forms
$$ q_t(\pi)\;=\;\sum_{r\in\mathcal R_t} p_t(r)\,q^{(r)}_t(\pi), \qquad p_t(r)\;=\;\frac{W_t(r)}{\sum_{r'\in\mathcal R_t}W_t(r')}. $$
The learner then draws π_t^learn ∼ q_t. In FLH2-type constructions, the meta-weights themselves are updated by an exponential rule based on how well each start-time learner performed on the most recent step, e.g.
W_t + 1(r) = W_t(r)exp (η^meta ⟨q_t^(r), ĝ_t⟩),
together with the injection of a new learner at r = t + 1. The details of the meta-update are not essential for our purposes; what matters is that (i) each component is an exponential-weights map (hence stable), and (ii) the mixture is over a small active set ℛ_t.

7.2 Pruning to O(log T) active start times

The pruning rule we advocate is deliberately time-based, not payoff-based. A canonical choice is to let ℛ_t be a geometric cover of the past, for example
ℛ_t = { t − 2^k + 1: k = 0, 1, …, ⌊log₂t⌋ },
so that |ℛ_t| = O(log t). Intuitively, ℛ_t keeps a “fresh” learner (start time t), a learner tracking approximately the last 2 rounds, another tracking approximately the last 4 rounds, etc. This structure is exactly what strong adaptivity needs: for any interval I = [r, s], there exists a start time r̃ ∈ ℛ_s that is not too far from r (within a factor-two type slack), so one of the surviving learners is well-aligned with the interval. FLH2 analyses show that this slack only costs polylogarithmic overhead in the interval regret bound.

Computationally, pruning means we do not keep all historical learners alive; we only maintain those with start times in ℛ_t. When t increments, ℛ_t + 1 is updated deterministically (add t + 1, possibly delete some older start times to preserve the geometric structure). Since each active start time carries an independent weight vector over Π, the per-round update cost becomes
O(|Π|⋅|ℛ_t|) = O(|Π|log T),
with the same order for memory. This is the sense in which the strongly-adaptive learner is “only a log-factor more expensive” than vanilla Hedge over Π.

A practical detail is initialization: when a new start time r = t is created, we set w_t^(t)(π) ≡ 1 (uniform prior), and similarly assign it a prescribed initial meta-weight. This matches the theoretical constructions and keeps the implementation simple.

7.3 Why pruning is also the clean way to preserve stability

Our incentive wrapper requires a per-round stability parameter ε_t controlling how much q_t can change when a single past report/objective is altered. Exponential weights is stable because it is a smooth function of cumulative payoffs: perturbing one round’s payoff vector by at most 2 (in ℓ_∞) changes log-weights by at most O(η), hence changes probabilities by at most a multiplicative e^O(η). The key question is how this property behaves under the FLH2 mixture and under pruning.

Two points matter.

Mixtures amplify stability only through the number of components. If each component q_t^(r) is ε-stable and the meta-weights p_t(r) are also produced by an ε^meta-stable exponential rule, then q_t = ∑_rp_t(r)q_t^(r) remains stable with parameter on the order of ε + ε^meta, up to a mild log |ℛ_t| factor coming from normalizations (a log-sum-exp Lipschitz effect). Thus, keeping |ℛ_t| = O(log T) is not only computationally convenient; it is exactly what keeps the stability envelope at $\varepsilon_t=O(\eta\polylog T)$ rather than ε_t = O(ηT).
Data-dependent pruning can destroy stability, so we avoid it. A tempting engineering move is to prune “bad” start times whose meta-weight becomes small. But hard thresholding is discontinuous: a tiny perturbation to one past payoff can flip whether a learner is kept, causing a potentially large jump in q_t and invalidating the weak-DP-like guarantee we need for incentives. In contrast, time-based pruning (geometric cover) is independent of reports/objectives, so it does not introduce any additional sensitivity. Put differently, we can safely analyze stability conditional on the deterministic schedule of active start times, and pruning contributes only via |ℛ_t|, not via brittle branching.

If one nonetheless wants performance-based pruning in practice, it should be implemented via a soft operation (e.g., continuously shrinking meta-weights, or using hysteresis bands rather than a single cutoff) so that the mapping from histories to q_t remains Lipschitz. For our theoretical wrapper, the time-based rule is the cleanest choice.

7.4 How pruning feeds into the incentive constraint through ε_t

The enforcement schedule {λ_t} is pinned down by an inequality of the form
$$ \lambda_t \;\ge\; c\,\frac{\bar\alpha}{\beta}\,\varepsilon_t, $$
for an absolute constant c. Computationally, this creates a design imperative: the platform must be able to upper bound ε_t online, because λ_t is part of the publicly announced mechanism at time t.

With pruning, such an upper bound is straightforward. Since |ℛ_t| ≤ Clog T deterministically and since each active component is an exponential map with known learning-rate scale (hence known per-round stability), we can set
$$ \varepsilon_t \;\le\; C'\,\eta_t \log T \quad\Rightarrow\quad \lambda_t \;=\; \min\!\left\{1,\; c\,\frac{\bar\alpha}{\beta}\,C'\eta_t\log T\right\}. $$
This choice is conservative but operational: it depends only on public parameters and the platform’s chosen learning-rate schedule, not on realized data. Importantly, it also prevents a subtle failure mode: if λ_t were allowed to depend on realized (and manipulable) stability diagnostics, agents could attempt to shift the platform into “low enforcement” states. The deterministic pruning schedule helps rule out that channel.

We emphasize a limitation that is easy to miss: although pruning keeps ε_t small, the resulting λ_t still reflects the economic primitives (ᾱ, β). Thus, even a perfectly engineered implementation cannot bypass the fundamental enforcement–adaptation tradeoff; it can only ensure that computational overhead does not add unnecessary slack to ε_t, and therefore does not inflate enforcement beyond what incentives truly require.

In sum, pruning is doing three jobs at once: it makes the strongly adaptive learner fast (O(|Π|log T) per round), it keeps the stability parameter within the polylogarithmic regime needed by our NIC wrapper, and it allows the platform to pre-commit to a transparent enforcement schedule that agents cannot endogenously game through the pruning logic itself.

8. Extensions and robustness: contextual mechanisms; infinite Π via covering numbers; approximate commitments; public vs. private history; brief discussion of coalition deviations.

8. Extensions and robustness

Our baseline presentation imposes two simplifications that are convenient for stating the main theorem: (i) the candidate class Π is finite, and (ii) each π ∈ Π is a single-round NIC mechanism in a discrete-type environment. Neither restriction is essential for the economic logic—what matters is that (a) the platform’s adaptive choice rule is stable with respect to small perturbations of history, and (b) there exists some “enforcement” mechanism π_com with a strictly positive penalty gap β. In this section we sketch how the construction and guarantees extend when we relax modeling assumptions that are often crucial in applications.

8.1 Contextual mechanisms (observables beyond bids)

In many platform settings, each round comes with a publicly observable context x_t (item characteristics, market conditions, user features) that affects both the objective and the relevant mechanism. A minimal extension is to let each mechanism be a mapping
π: (x_t, b_t) ↦ Δ(S),
and likewise write the platform objective as G_t(θ_t, s_t; x_t). The platform’s learner then chooses a mechanism π_t^learn as a function of the realized context x_t, while agents best-respond given the announced mapping.

At a conceptual level, nothing changes in the incentive wrapper: manipulation arises because today’s report can shift tomorrow’s distribution over mechanisms. Stability continues to be the right sufficient condition, because it controls how much a single altered report (or objective observation) can shift the future maps q_t^′(⋅ ∣ x_t^′). Technically, we require a stability bound that is uniform over contexts (or holds for the realized context sequence), e.g. for each t, the map from histories to q_t(⋅ ∣ x_t) ∈ Δ(Π) is ε_t-stable.

On the learning side, contextuality changes what counts as an “expert.” If Π already indexes contextual mechanisms, we can continue to run exponential weights over Π under full information, using payoff vectors ĝ_t(π) = 𝔼_{s ∼ π(x_t, θ_t)}[G_t(θ_t, s; x_t)] under truthful play. When the mechanism class has the structure “policy maps contexts to actions,” one can also use standard reductions from contextual learning to experts (e.g. defining experts as policies) at the cost of a log |Π| dependence where |Π| is now the number of policies. The key point for our purposes is that stability is preserved when the learner is a smooth function of cumulative (context-indexed) payoffs, so the incentive constraint continues to take the same form λ_t ≳ (ᾱ/β)ε_t.

8.2 Infinite Π: replacing log |Π| with metric entropy

In applications, Π is rarely finite; it may be a parametric family (reserve prices, scoring rules, matching weights) or even a nonparametric class. The relevant question is not finiteness per se but effective complexity, i.e. how accurately we must approximate Π to compete in objective value.

A standard route is to introduce a payoff-induced pseudometric on mechanisms. For a given realized sequence {(θ_t, G_t)}_t ∈ I, define
d_I(π, π^′) := max_t ∈ I|𝔼_{s ∼ π(θ_t)}[G_t(θ_t, s)] − 𝔼_{s ∼ π^′(θ_t)}[G_t(θ_t, s)]|.
Let 𝒩(ϵ, Π, d_I) denote the ϵ-covering number under this metric (or an upper bound uniform over all intervals of a given length). If we run our learner over an ϵ-net Π_ϵ ⊆ Π, then the regret decomposition becomes “estimation/learning regret” relative to Π_ϵ plus an “approximation” term of order ϵ|I|. Formally, one expects bounds of the form
$$ \mathrm{Reg}_I(\theta,\mathsf M,\Pi;A)\ \lesssim\ \tilde O\!\Big(\sqrt{|I|\log|\Pi_\epsilon|}\Big)\ +\ \epsilon |I|\ +\ \sum_{t\in I}\lambda_t, $$
and log |Π_ϵ| can be replaced by log 𝒩(ϵ, Π, ⋅). The incentive wrapper is unchanged except that ε_t depends on the learning rate and the complexity term log |Π_ϵ|: making Π_ϵ finer increases both the learning burden and the enforcement required. Economically, this captures a natural tri-lemma: richer adaptation (finer mechanism classes) raises the value of manipulation unless enforcement intensity increases commensurately.

8.3 Approximate commitments and noisy enforcement

The commitment mechanism π_com is a conceptual device: it is meant to be a policy the platform can invoke to credibly penalize deviations from truth-telling. In practice, the platform may only have access to approximate enforcement. Two examples are worth separating.

First, π_com may be only approximately DSIC, yielding a penalty gap β that is positive but small, or even a mechanism that guarantees a per-deviation loss only up to some slack ξ ≥ 0. In this case, the online incentive condition becomes approximate: we can ensure that truthful reporting is an ϵ-best response over the horizon, where ϵ scales like ∑_tγ_i(t)ξ plus the residual gap between enforcement and manipulability, i.e. λ_tβ − O(ᾱε_t). The design implication is direct: weak enforcement can be compensated by increasing λ_t, but only up to the point where commitment distortion dominates objective performance.

Second, enforcement may be noisy or probabilistic in a richer way than our simple lottery (e.g. audits that detect misreports with probability p, penalties that depend on realized signals, or “cooldown” policies that restrict future participation). These can often be reduced to our template by defining an effective penalty gap β_eff equal to the minimum expected utility loss from deviating, conditional on the enforcement event. The wrapper then uses β_eff in place of β. This clarifies what the model is—and is not—saying: we do not require a particular punishment technology, only that deviations have a uniformly bounded downside when enforcement triggers.

8.4 Public versus private history

Our equilibrium condition is an ex-post Nash statement in an environment where agents observe the public history h_< t, including past mechanisms and outcomes. This is a conservative informational assumption: if agents observe less, then the set of feasible contingent deviations shrinks, and manipulation is (weakly) harder.

Two robustness points are useful. First, if the platform can keep some randomness private (e.g. internal random seeds, tie-breaking, or partial obfuscation of the learner’s state), then the mapping from reports to future mechanism distributions becomes harder to exploit. Formally, private randomization tends to improve stability “as perceived by agents,” because even if q_t changes slightly, agents may be unable to condition their strategy on that change. Second, if only outcomes s_t are public but the platform’s observed objective G_t (or its payoff statistic ĝ_t) is not publicly revealed, then agents face a partial-information manipulation problem; in many such cases, the effective long-sightedness bound ᾱ relevant for profitable steering can be lower than in the fully transparent benchmark.

That said, privacy and opacity are not free: reduced transparency can conflict with institutional constraints (audits, explainability, fairness), and it may also degrade the learner’s performance if it limits what signals can be used. Our analysis is best read as providing a worst-case baseline: if we can guarantee NIC under fully public histories, then quieter information regimes only make incentives easier to satisfy.

8.5 Brief remarks on coalition deviations

We have focused on unilateral deviations, yielding an online NIC guarantee (truth-telling forms an ex-post Nash equilibrium). In some environments—especially small markets or settings with repeated interaction among agents—coalitions may coordinate their reports to steer the learner. Extending the wrapper to coalitional incentive compatibility is possible in principle but comes with a predictable cost.

The stability tool we rely on has an analogue for group changes: differential privacy composition results imply that if a map is ε-stable to a single individual’s data, it is roughly kε-stable to the joint perturbation of k individuals’ data (up to constants and composition details). Translating to our setting, a coalition of size k can increase its potential future gain by a factor proportional to k, so the enforcement schedule must scale accordingly:
$$ \lambda_t\ \gtrsim\ \frac{\bar\alpha}{\beta}\,k\,\varepsilon_t $$
to deter coordinated manipulation in the same worst-case sense. Economically, this is the familiar insight that “collusion is harder”: deterring coordinated strategic behavior requires either stronger punishments (larger β) or more frequent enforcement (larger λ_t), both of which raise the platform’s efficiency cost. In many platform designs, the practical response is to complement incentive-compatible mechanism selection with institutional anti-collusion tools (identity verification, participation constraints, or randomized audits), which can be interpreted as increasing β for coordinated deviations rather than merely turning up λ_t.

Taken together, these extensions reinforce the central message of the model: the learning component can be made computationally efficient and statistically adaptive, but incentive robustness is governed by a small set of economic primitives—how far ahead agents look (ᾱ), how sharply enforcement punishes deviations (β), and how sensitive the platform’s policy is to marginal changes in history (stability ε_t).

9. Discussion, limitations, and open problems: tightness in ᾱ, impossibility as ᾱ → Θ(T), empirical calibration of ᾱ, and bandit feedback as next step.

9. Discussion, limitations, and open problems

Our construction is intentionally modular: a strongly-adaptive learner supplies tracking guarantees against nonstationarity, while a simple “enforcement lottery” converts a stability bound into an online incentive constraint. This separation clarifies what we view as the core economic tradeoff—adaptation versus manipulability—but it also brings into focus several limitations of the theory and a set of natural directions where we do not yet have tight answers.

9.1 Is the $\sqrt{\bar\alpha}$ dependence tight?

The headline guarantee scales as
$$ \mathrm{Reg}_I \;\le\; \tilde O\Big(\big(\log|\Pi|+\beta^{-1}\big)\sqrt{\bar\alpha\,|I|}\Big), $$
so long-sightedness ᾱ enters through a $\sqrt{\bar\alpha}$ factor. At a high level, this scaling comes from balancing two margins. On the one hand, making the learner more stable (smaller ε_t) reduces the maximum future benefit an agent can obtain by perturbing history, which is on the order of O(ᾱ ε_t) under our stability-to-manipulability reduction. On the other hand, more stability typically requires more conservative learning rates, which worsens tracking regret. The wrapper then forces λ_t to scale like λ_t ≳ (ᾱ/β)ε_t, adding an objective distortion term ∑_t ∈ Iλ_t. Optimizing these competing terms yields the $\sqrt{\bar\alpha}$ dependence.

Whether this $\sqrt{\bar\alpha}$ is information-theoretically tight is not fully resolved in our general framework. There are two separate “tightness” questions. First, even holding incentives fixed, strongly adaptive regret for adversarial nonstationarity on every interval is known to require polylog(T) overheads, and $\tilde O(\sqrt{|I|\log|\Pi|})$ is essentially the right benchmark in full information. Second, once we demand online NIC against forward-looking agents, the learner must either (i) become sufficiently insensitive to single-round perturbations, or (ii) be backed by sufficiently frequent or severe enforcement. Our bound is tight in the sense that any improvement would need a different economic lever than “punishment dominates manipulability,” or a sharper connection between stability and the agent’s true dynamic ability to steer the learner.

One plausible route to improving the ᾱ dependence is to replace our worst-case ᾱ by an instance-dependent measure of influence: if an agent’s reports affect only a small portion of the learner’s state, or only weakly affect which mechanisms in Π are competitive, then the effective gain from manipulation can be far smaller than ᾱ ε_t. Capturing this formally would require a refined stability notion that is local to the realized payoff geometry (e.g., margin conditions among experts, or stability in a restricted tangent cone). Economically, this would distinguish “agents who are patient” from “agents who are patient and pivotal,” and only the latter should force aggressive enforcement.

9.2 Why impossibility emerges as ᾱ → Θ(T)

The qualitative impossibility as ᾱ = Θ(T) is not an artifact of our proof technique; it reflects a genuine tension in any history-responsive mechanism. When agents are effectively fully patient, a unit of utility today can be traded for a shift in the platform’s future policy over Θ(T) rounds. If the learner is responsive enough to be useful—meaning that different histories lead to meaningfully different future q_t—then there exists a steering strategy: the agent sacrifices a small current payoff to move the learner toward mechanisms that are favorable for her type in the future.

To deter this, the platform has two generic options, neither attractive in the fully patient limit. The first is to drive the policy toward near-history-independence by taking ε_t extremely small. But “extreme stability” forces learning rates to vanish with T, so adaptation becomes too slow to track nonstationarity, yielding (in the worst case) regret that is linear on some intervals. The second is to keep learning responsive while using enforcement with non-negligible probability—i.e., λ_t bounded away from zero—so that the expected deviation loss λ_tβ offsets the Θ(T) gain from manipulation. Yet if λ_t is constant, then the platform forgoes a constant fraction of rounds to the commitment mechanism, which again implies linear objective loss whenever π_com is suboptimal for the platform’s objective (as it typically is, since it is chosen for incentives, not for performance).

From a mechanism design perspective, this mirrors a familiar message: without either strong commitment power or strong informational frictions, fully forward-looking agents can render “learning-by-doing” incompatible with high welfare. Our model makes the dependence transparent: ᾱ is the bridge between dynamic incentives and the speed at which the platform can safely incorporate feedback. When ᾱ grows proportionally with T, the safe incorporation rate must shrink proportionally, unless enforcement becomes pervasive.

9.3 How should one calibrate ᾱ in practice?

Although ᾱ is defined abstractly (as in Huh–Kandasamy) to upper bound the dynamic leverage an agent obtains from influencing future play, it has an intuitive operational interpretation: it is an “effective horizon” of strategic interaction. In many platform settings, this is driven less by time discounting in the textbook sense and more by return probabilities: churn, session length, contract renewal, and the probability of encountering the mechanism again.

A pragmatic calibration pipeline can proceed in layers:

Behavioral retention model. Estimate a return process for each user segment (or seller, driver, bidder): if the probability an agent is active k rounds ahead is ρ(k), then the discounted weight on future utilities behaves like γ(k) ≈ ρ(k)δ^k for some within-session discount δ. This immediately delivers an upper bound on the “total future weight” that any single deviation can affect. In stationary approximations, one often gets ∑_k ≥ 1γ(k) ≈ 𝔼[remaining lifetime], making ᾱ comparable to an expected remaining number of interactions.
Strategic responsiveness checks. Even if agents return, they may not understand or exploit the learning rule. A conservative platform designer may still set ᾱ from the upper tail of estimated lifetimes, but one can complement this with evidence on whether agents condition behavior on future policy. Natural experiments—e.g., sudden changes in transparency of the platform’s learning rule, or exogenous “information shocks” about future pricing—can reveal whether users sacrifice current utility to influence future outcomes.
Robust scheduling. Since ᾱ enters enforcement through λ_t ≳ (ᾱ/β)ε_t, misspecifying ᾱ downward is costly for incentives. A robust approach is to choose λ_t against a high-quantile ᾱ for strategic segments (power sellers, high-frequency bidders) and a lower ᾱ for low-frequency segments, effectively “targeting enforcement” where dynamic manipulation is plausible. This is also where policy considerations enter: targeted enforcement resembles risk-based auditing and may raise fairness concerns if not governed carefully.

Empirically, a key limitation is that ᾱ is not purely a preference parameter; it is a joint object capturing patience, return probabilities, and the agent’s ability to forecast and exploit the mechanism-selection dynamics. Translating this into a single bound is necessarily conservative, and we view developing sharper, behaviorally grounded analogues of ᾱ as an important measurement problem.

9.4 Bandit feedback and strategic data: the next step

Our baseline analysis effectively assumes full-information feedback for learning over Π: under truthful play, the platform can evaluate (or unbiasedly estimate) ĝ_t(π) for every π ∈ Π. Many platforms instead observe only the realized objective G_t(θ_t, s_t) for the chosen mechanism, i.e., bandit feedback. Moving from full information to bandit learning is not a minor technicality here, because both pillars of the construction are stressed simultaneously.

On the learning side, strongly adaptive regret under bandit feedback is substantially more delicate; even without strategic agents, one must control exploration costs on every interval. On the incentives side, standard bandit algorithms introduce extra randomness and state dependence (through exploration bonuses or sampling), which can increase the sensitivity of future action distributions to past outcomes—exactly the channel that strategic agents might try to exploit. Thus, the central open problem is to design a learner that is both (i) strongly adaptive under bandit feedback and (ii) provably stable in the weak-DP-like sense needed by the wrapper, with ε_t small enough to keep λ_t from becoming large.

There are promising ingredients—differentially private bandits, implicit exploration, and stability of posterior sampling under bounded likelihood ratios—but we do not yet have a clean end-to-end theorem matching the simplicity of the full-information case. More fundamentally, bandit feedback invites a richer strategic issue: agents may manipulate not only the learner’s state through reports, but also the learner’s data through actions that affect realized outcomes and thus the inferred payoffs of mechanisms. Separating “preference misreporting” from “outcome manipulation” is often impossible in applications, and a unified model would likely require new robustness notions beyond our current stability wrapper.

Stepping back, we see the results here as establishing a tractable benchmark: with sufficiently stable adaptation and a credible enforcement backstop, truthful reporting can coexist with strong adaptivity—up to an unavoidable dependence on how far ahead agents look. Understanding how far this benchmark extends under partial feedback, richer strategic channels, and empirically grounded patience models is, in our view, the most productive agenda opened by the framework.