← Back

When Do Synthetic Auction Simulators Support Policy? Validity Bounds for Counterfactual Mechanism Rankings

Metadata

Table of Contents

  1. 1. Introduction: Why synthetic simulators for policy in auctions; what can go wrong (strategic response, distribution shift); contributions and roadmap.
  2. 2. Setup and objects of interest: environments, mechanisms, bidder-response map, and policy metrics; define what it means to ‘rank mechanisms correctly.’
  3. 3. A clean large-market repeated-auction model: batch opportunities per step, budgets/CPA constraints, multi-slot exposure; clarify which pieces are abstracted and which are required for the theory.
  4. 4. Calibration as a certificate: define payoff-relevant moments and an IPM d_{}; show how moment-matching implies an upper bound on d_{}(P, P).
  5. 5. Main theory—continuity and misranking bounds: prove Lipschitz stability of performance under d_{}; combine with finite-sample concentration to get an explicit misranking probability bound; discuss when bounds may be vacuous.
  6. 6. Worked examples (closed-form where possible): (i) single-slot first-price with value-proportional bidding; (ii) myopic pacing; show explicit L and demonstrate the ranking condition |^*| > 2L. Flag where numerical computation is needed (GSP multi-slot with complex bidder response).
  7. 7. Empirical demonstration in AuctionNet: construct misspecified simulators (time-of-day shifts, category mix shifts, tail distortions); compare mechanism rankings and estimate empirical Lipschitz constants; evaluate tightness/conservativeness of bounds.
  8. 8. Practical recipe for regulators/platforms: minimal disclosure for calibration, uncertainty sets, and stress tests; reporting standards for simulator-based policy claims.
  9. 9. Limitations and extensions: multi-agent non-convergence, set-valued bidder responses, richer objectives (fairness, privacy); how to strengthen certificates with targeted moments.

Content

1. Introduction: Why synthetic simulators for policy in auctions; what can go wrong (strategic response, distribution shift); contributions and roadmap.

Modern ad platforms routinely contemplate changes to their auction rules: reserve prices, ranking formulas, payment rules, disclosure policies, and budget controls. Each change is, in effect, a policy intervention in a strategic market that operates at enormous scale and high frequency. The practical difficulty is that the object the platform would like to optimize—revenue, welfare, advertiser surplus, or some weighted combination subject to budget and quality constraints—is typically observed only for the mechanism currently deployed. Running online experiments for every candidate change can be costly, slow, and sometimes infeasible when the change affects learning dynamics, contracts, or market thickness. These constraints motivate the pervasive use of : estimated environments in which we can ``replay’’ counterfactual auctions, forecast outcomes, and rank alternative mechanisms before deployment.

At an intuitive level, simulation promises a controlled laboratory. We fit a generative model of auction opportunities from historical data, and we embed within it a behavioral model of how bidders will respond when the rules change. We then compute the platform’s objective under each candidate mechanism and select the best. This workflow is attractive because it converts an online policy problem into an offline statistical one: if the simulator is accurate, then policy evaluation becomes cheap, repeatable, and safe. Moreover, simulators are already intertwined with engineering practice: conversion models, click models, and budget pacing models are all, in some sense, components of a synthetic environment. The outstanding question is not whether simulation will be used, but rather when and why its recommendations can be trusted.

The core obstacle is that auctions are strategic and endogenous. A mechanism change does not merely alter the mapping from bids to allocations and payments; it alters the bids themselves. When advertisers are budget constrained, optimizing across many opportunities, and responding to feedback, small differences in the environment can lead to large differences in equilibrium or steady-state behavior. Even when advertisers follow relatively simple pacing rules, these rules can contain thresholds that amplify small distributional shifts. In this setting, a simulator can fail in a particularly damaging way: it can two mechanisms. A mechanism that appears superior in simulation may perform worse in the real environment once bidders respond, and vice versa. Because mechanism selection is a discrete decision, even modest evaluation errors can produce large policy regret.

Two distinct sources of error interact to create this risk. First, there is : the real stream of auction opportunities is generated by an unknown distribution, while the simulator uses an estimated distribution. The misspecification need not be dramatic to matter. If the simulator slightly misstates the frequency of high-value opportunities, the tail of cost signals, or exposure factors associated with user contexts, then the induced allocation and payments can shift. Second, there is : the simulator must encode how advertisers respond to the mechanism and to the opportunity stream. Even if we posit a reasonable behavioral model, its predictions can be fragile when strategic incentives are steep. Importantly, these two errors compound: the effect of getting the environment wrong is mediated through strategic response, so the bias in simulated performance is not merely a direct statistical bias but can be .

Our approach is to make this amplification explicit and to provide a minimal set of objects that can be audited and communicated. Rather than treating the simulator as an opaque black box, we ask for two ingredients that are natural in practice. The first is a that quantifies how close the simulator’s distribution is to the real environment, measured only through a class of payoff-relevant tests (for example, matched moments computed from disclosure-limited logs). The second is a that bounds how sensitive the policy objective is to changes in the environment distribution once bidder response is accounted for. Together, these two ingredients yield an interpretable guarantee: if the simulator matches the real environment up to tolerance and the induced outcomes are not too sensitive, then simulated rankings are reliable, and the probability of misranking decays rapidly as we increase the number of simulated episodes.

This emphasis on reflects how simulation is often used. In many engineering contexts, the platform does not require a perfectly calibrated estimate of revenue or welfare in absolute terms; it needs to decide between a small number of candidate mechanisms. Ranking is also the object that is most directly threatened by bias: a constant shift in all performance estimates is harmless, while a sign error in the difference between two mechanisms is costly. Our analysis therefore focuses on the performance gap between mechanisms and on conditions under which the simulator preserves the sign of that gap. This framing also clarifies an operational tradeoff. Calibration effort, measured by how small a distance certificate can be achieved, and computational effort, measured by how many simulation runs are needed, substitute for one another only up to a point: simulation noise can be driven down by more runs, but systematic simulator bias cannot.

Conceptually, we organize simulator validity around a distance between distributions defined by a function class of tests. This choice is deliberate. Auction environments are high dimensional: values, costs, exposure factors, and context all interact with the mechanism. Insisting on a strong notion of distributional closeness (say, total variation over the full primitive vector) is typically unrealistic and unnecessary. What matters for policy is closeness in directions that affect payoffs once bidders respond. In many applications, these directions can be approximated by a collection of moments or summary curves (such as value distributions by segment, predicted conversion rates, or least-winning-cost statistics). This motivates using an integral probability metric tied to a chosen family of test functions, which can be matched from available data and can be reported as a single tolerance level.

The principal contribution is to connect these pieces in a way that yields a bound with clear managerial interpretation. The bound decomposes misranking risk into (i) a bias term driven by simulator distance multiplied by a sensitivity constant capturing strategic amplification, and (ii) a variance term driven by Monte Carlo error that shrinks with the number of simulated episodes. The resulting guarantee highlights a simple ``margin’’ criterion: correct ranking is easy when the true performance gap is large relative to the worst-case simulator bias, and it is fundamentally difficult when the gap is small or when strategic amplification is large. This is not merely a pessimistic message; it provides a diagnostic. When a simulator recommends switching mechanisms, we can ask whether the estimated improvement is robust to plausible misspecification as quantified by the calibration certificate, and whether the behavioral model implies stability or knife-edge sensitivity.

We also acknowledge what our framework does resolve. A stability constant is, in general, not a primitive; it depends on the mechanism, the environment, and the behavioral response mapping, and it may be large in markets with tight budgets, binding constraints, or discontinuous bidding rules. Likewise, calibration is only as good as the disclosed data and the chosen moment set; a simulator can match a battery of tests and still be wrong in untested directions that matter under a new mechanism. Finally, real platforms face nonstationarity, delayed feedback, and cross-market spillovers that are difficult to capture in an episodic simulator. Our goal is therefore not to claim universal validity, but to provide a disciplined language for expressing what a simulator has been calibrated to match and what kinds of counterfactual claims it can support.

The remainder of the paper formalizes these ideas. We first define the repeated auction environment, the mechanisms under comparison, the bidder-response map, and the policy metrics used for evaluation, culminating in a precise notion of ``ranking mechanisms correctly.’’ We then state continuity and finite-sample results that turn a calibration certificate and a stability condition into a quantitative misranking bound, and we discuss how moment matching can be interpreted as a certificate in an appropriate metric. Throughout, we aim to keep the connection to practice explicit: which statistics can be calibrated from logs, which sensitivity quantities can be stress-tested in simulation, and how to report uncertainty about mechanism rankings in a way that is actionable for platform policy.


2. Setup and objects of interest: environments, mechanisms, bidder-response map, and policy metrics; define what it means to ‘rank mechanisms correctly.’

We describe a minimal repeated-auction environment in which a platform compares two candidate mechanisms using a synthetic simulator. The purpose of the setup is not to model every engineering detail of ad delivery, but to isolate the objects that enter counterfactual evaluation: (i) an opportunity process that generates per-auction primitives, (ii) a mechanism that maps bids into allocations, payments, and disclosures, (iii) a behavioral response that maps the mechanism and environment into a bidder policy profile, and (iv) a policy metric that the platform uses to rank mechanisms.

Time is indexed by decision steps t ∈ {1, …, T}, where each step contains a batch of auction opportunities j ∈ {1, …, mt} (in many applications mt is large and can be treated as constant). Each opportunity is summarized by a primitive random vector
zjt = (vjt, cjt, ejt, ξjt),
drawn from an unknown real distribution P. Here vjt denotes payoff-relevant values (possibly advertiser-specific, with vijt embedded in vjt), cjt denotes cost-relevant signals or features that enter the auction’s pricing or quality adjustments, ejt denotes exposure factors (slot multipliers, viewability, or other scaling of value and payments), and ξjt denotes auxiliary context (query, user, placement, and any disclosed features). We allow zjt to be high-dimensional; the point of the analysis is that the simulator need only be accurate in . The platform has access to a calibrated simulator distribution over the same primitive space, fitted from disclosure-limited data; the simulator can be sampled to generate synthetic opportunity streams {zjt}.

A mechanism M specifies how bids are converted into outcomes at each opportunity, and what information is revealed to bidders. Concretely, for each opportunity (j, t) and bid profile (b1jt, …, bnjt), the mechanism produces (i) an allocation vector (x1jt, …, xnjt) with feasibility constraints (e.g. ixijt ≤ L for L slots), (ii) a slot assignment ijt or, more generally, an exposure factor applied to each allocated advertiser, and (iii) payments pijt = PayM(b1jt, …, bnjt; ξjt). We use M ∈ {M1, M2} for the two candidate designs to be compared, but nothing in the definitions depends on there being only two mechanisms. Importantly for practice, M also includes a —what auction logs, feedback, and signals are made available to advertisers—because bidder learning and pacing depend on what they observe. We abstract disclosure into the behavioral response mapping below, but conceptually it is part of the mechanism.

Advertisers are indexed by i ∈ {1, …, n}. Each advertiser maintains an internal state sit at step t capturing remaining budget, pacing multipliers, learned parameters, and any history required by its strategy. On opportunity (j, t), advertiser i submits a bid
bijt = πi(sit, zjt),
where πi is a policy chosen from a class Πi (for example, equilibrium bidding rules under a particular model, convex pacing policies, or a learning algorithm run to stationarity). A policy profile is π = (π1, …, πn) ∈ Π := Π1 × ⋯ × Πn. We permit budget constraints and related delivery constraints through the dynamics of sit and feasibility requirements on total spend; for instance, a representative constraint is
$$ \sum_{t=1}^T \sum_{j=1}^{m_t} e_{ijt} x_{ijt} p_{ijt} \leq B_i, $$
with possible extensions such as CPA constraints. Rather than fixing a particular behavioral model, we summarize bidder adjustment to the environment and mechanism via a response map
R: (M, P) ↦ π,
which selects the policy profile implemented in the long run when the true opportunity distribution is P and the mechanism is M. In an equilibrium interpretation, R(M, P) selects an equilibrium in Π; in a learning interpretation, R(M, P) selects the stationary policies induced by a specified learning-and-pacing stack. This abstraction is deliberate: it isolates the locus of strategic amplification (small changes in P or M can change R(M, P) sharply), while keeping the evaluation object well-defined.

An episode induces a trajectory τ collecting all realized primitives, allocations, payments, exposures, and state updates across (t, j). From the platform’s perspective, policy evaluation reduces the trajectory to a scalar objective g(τ), assumed bounded as |g(τ)| ≤ B. This boundedness is natural for finite-horizon episodes with capped budgets and limited exposures, and it enables concentration arguments for Monte Carlo simulation. The population performance of mechanism M in environment P is
J(M, P) := 𝔼τ ∼ ℒ(M, P, R)[g(τ)],
where ℒ(M, P, R) denotes the induced distribution over trajectories when opportunities are drawn from P, the mechanism is M, and bidders play π = R(M, P). The choice of g can represent revenue, welfare, constraint violations, or any composite objective. A canonical example is
g(τ) = λ Rev(τ) + (1 − λ) SW(τ) − κ Viol(τ),
with λ ∈ [0, 1] and κ ≥ 0, but the analysis treats g generically.

Given a calibrated simulator , the platform computes counterfactual estimates by simulating N i.i.d. episodes for each mechanism under . For each M, the simulator must also specify bidder behavior; under our abstraction this is π = R(M, ), though one can also interpret R as selecting a fixed suite of bidder policies evaluated consistently across mechanisms. Let (M, ) denote the sample average of g(τ) across the N simulated episodes. The platform’s recommendation is then the mechanism with the larger estimated performance, i.e. it compares (M1, ) and (M2, ). This workflow matches common practice: the simulator plays the role of an offline ``testbed’’ in which each mechanism is run repeatedly, producing an empirical distribution of outcomes.

Since the platform ultimately selects between mechanisms, the central object is the
Δ* := J(M1, P) − J(M2, P),
and its simulated analogue Δ̂ := (M1, ) − (M2, ). We say the simulator if sign(Δ̂) = sign(Δ*), with any tie-breaking rule treated as either correct when Δ* = 0 or counted as an error when Δ* ≠ 0 (our bounds will be stated in a way that is robust to ties). This definition emphasizes an operational point: absolute calibration of J(M, P) is secondary to preserving the ordering induced by Δ*. A uniform additive error that shifts both (M1, ) and (M2, ) is largely irrelevant for selection; what matters is the accuracy of their .

The setup above is intentionally agnostic about the internal form of P, the dimensionality of z, and the details of R. This generality is a strength for interpretation but creates a problem for theory: without additional structure, J(M, P) can be arbitrarily sensitive to small changes in P once R(M, P) is allowed to shift discontinuously. Our main results therefore hinge on two auditable summaries: a quantifying how close is to P in payoff-relevant directions, and a (Lipschitz) condition bounding the induced sensitivity of J to such deviations. To make these conditions concrete, the next section introduces a clean large-market repeated-auction model with batched opportunities, budgets (and optional CPA constraints), and multi-slot exposure, clarifying which ingredients are required for the ranking guarantees and which are abstracted into the response map R and the metric used for calibration.


3. A clean large-market repeated-auction model: batch opportunities per step, budgets/CPA constraints, multi-slot exposure; clarify which pieces are abstracted and which are required for the theory.

We now introduce a ``clean’’ repeated-auction model that will serve two roles. First, it makes concrete what an episode-level trajectory τ contains when there are many auctions per decision step, multiple slots with heterogeneous exposure, and hard delivery constraints such as budgets and (optional) CPA targets. Second, it clarifies what structure is used by the ranking guarantees: the theory will not require an explicit closed-form equilibrium, but it will require (i) bounded episode outcomes and (ii) a stability property linking performance to payoff-relevant deviations in the primitive distribution.

Fix a horizon of T decision steps. In step t there are mt auction opportunities, indexed by j ∈ {1, …, mt}. In the clean model we treat mt ≡ m as constant and large. Each opportunity (j, t) draws a primitive
zjt = (vjt, cjt, ejt, ξjt) ∼ P,
independently across (j, t); we emphasize that this i.i.d. assumption is a modeling convenience rather than a requirement of practice. The point of introducing it is that, with large m, the within-step averages of quantities like spend, exposure, and value concentrate, and many bidder adjustment rules (pacing, bid shading, reserve updates) can be viewed as responding to relatively smooth aggregates. When we later impose a Lipschitz condition for J(M, P), the large-market approximation provides intuition for why such stability can be plausible: a small perturbation of P changes many per-opportunity outcomes only slightly, and these changes average over a large batch.

We allow L ≥ 1 slots per opportunity. The mechanism assigns at most one slot per advertiser and at most one advertiser per slot. Let xijt ∈ {0, 1} indicate whether advertiser i is shown in slot at opportunity (j, t). Feasibility is
$$ \sum_{i=1}^n x_{ij\ell t} \le 1 \quad \forall \ell,\qquad \sum_{\ell=1}^L x_{ij\ell t} \le 1 \quad \forall i. $$
Exposure enters payoffs through an opportunity- and slot-specific multiplier. A convenient parameterization is
eijt = γ ⋅ qijt,
where γ is a deterministic slot effect (e.g. position-based click-through) and qijt is an opportunity-specific quality/viewability component bundled into ejt. This makes explicit that the same allocation decision can generate different economic impact depending on where an ad is shown and on contextual factors. Our analysis does not rely on this factorization; it suffices that exposures eijt are part of the primitive vector zjt, hence part of what the simulator must reproduce in payoff-relevant directions.

A mechanism M maps reported bids into an allocation and payment rule, potentially using cost-relevant signals cjt and auxiliary context ξjt. In many ad auctions, cjt summarizes quality scores, predicted CTR/CVR, reserve prices, or other adjustments that affect ranking and pricing. We write the payment charged to advertiser i in slot as
pijt = PayM, (b1jt, …, bnjt; cjt, ξjt),
and define realized (exposure-weighted) spend in opportunity (j, t) as
$$ \mathrm{Spend}_{ijt} \;=\; \sum_{\ell=1}^L e_{ij\ell t}\, x_{ij\ell t}\, p_{ij\ell t}. $$
This representation accommodates first-price, second-price, generalized second-price with quality, and many hybrid designs; it also accommodates disclosure differences, since what bidders observe affects their subsequent bids. We keep these details encapsulated in M rather than specifying a single parametric auction format, because the ranking guarantees will be stated uniformly over M ∈ {M1, M2}.

Each advertiser i has a budget Bi for the episode. Let vijt denote advertiser i’s value per unit exposure when shown in slot at opportunity (j, t) (embedded in vjt). Then advertiser i’s realized utility in the episode is
$$ U_i(\tau) \;=\; \sum_{t=1}^T \sum_{j=1}^m \sum_{\ell=1}^L e_{ij\ell t}\, x_{ij\ell t}\,\big(v_{ij\ell t}-p_{ij\ell t}\big), \qquad \text{s.t.}\quad \sum_{t,j} \mathrm{Spend}_{ijt} \le B_i. $$
To incorporate CPA-type delivery requirements, we can append a conversion indicator yijt ∈ {0, 1} (or a conversion count) to the trajectory, with conversion probabilities driven by ξjt and the identity/slot shown. A stylized CPA constraint takes the form
$$ \frac{\sum_{t,j,\ell} e_{ij\ell t}\, x_{ij\ell t}\, p_{ij\ell t}}{\sum_{t,j,\ell} e_{ij\ell t}\, x_{ij\ell t}\, y_{ij\ell t} + \varepsilon} \;\le\; \mathrm{CPA}_i, $$
where ε > 0 is a small regularizer to avoid division by zero. In practice, CPA constraints are often enforced via multipliers or dual variables (a ``bid downscaling’’ when spend per conversion exceeds target). For our purposes, the key point is simply that constraints introduce state variables and potential nonsmoothness: when a budget or CPA target binds, small changes in the environment can trigger discrete changes in effective bids. This is precisely why we separate the clean accounting of constraints (what enters τ) from the stability assumption (how sensitively R reacts).

We represent bidder adaptation through a per-step state sit that includes remaining budget Bit (and, if relevant, remaining conversion slack or dual variables), together with any learned parameters. A simple budget update is
$$ B_{i,t+1} \;=\; \bigg[B_{it} - \sum_{j=1}^m \mathrm{Spend}_{ijt}\bigg]_+,\qquad B_{i1}=B_i. $$
A canonical behavioral model in large markets is multiplicative pacing: advertiser i applies a multiplier αit ∈ [0, 1] to a base bid signal (such as value or value-per-click), with αit updated to hit the budget in expectation. For example, one may posit bids of the form
bijt = αit ⋅ βi(zjt),
where βi(⋅) is a base scoring rule (possibly learned) and αit adjusts to control spend. With CPA constraints, one can analogously include a second multiplier for conversion efficiency. We do need the specific update equations; what matters is that the induced mapping from (M, P) to long-run behavior can often be viewed as the outcome of a well-posed optimization (e.g. convex pacing) or a stable learning process. This motivates treating R(M, P) as a single-valued map selecting the stationary policy profile π.

Given realized allocations and payments, the platform computes an episode metric g(τ) (revenue, welfare, constraint violations, or a composite), and the population performance is J(M, P) = 𝔼[g(τ)] under the trajectory distribution induced by (M, P, R). In this clean model, two features are worth emphasizing. First, the horizon is finite and budgets cap spend, so it is natural to assume |g(τ)| ≤ B for some constant B (possibly depending on (T, m, {Bi})). Second, the multi-slot, batched structure makes explicit that g(τ) is typically a (or average) of many bounded per-opportunity contributions. This is what later delivers concentration of simulation estimates (M, ) in N episodes, holding the simulator fixed.

The clean model adds accounting detail but also clarifies what we will assume. We do not require: a particular auction format (first price versus second price), a parametric model for values, a closed-form equilibrium, or full observability of P. We also do not require that bidders literally solve the constrained utility maximization written above; bidders may be using heuristics or learning rules. All of that is abstracted into R, and the mechanism-specific disclosure policy is likewise treated as part of the input to R.

What we require for the theory is more modest and, importantly, auditable: (i) boundedness of g(τ) at the episode level; (ii) a notion of closeness between P and that is tailored to payoff-relevant primitives; and (iii) a stability condition asserting that, for the mechanisms under consideration, J(M, P) does not change too abruptly when P is perturbed in those payoff-relevant directions. The clean model above helps interpret these requirements: exposure factors, quality adjustments, and constraints determine which aspects of z are payoff-relevant, while budgets and pacing illustrate why stability is a substantive assumption rather than a tautology. The next section formalizes ``payoff-relevant closeness’’ via moment-based calibration and an integral probability metric, yielding a certificate δ that can be carried directly into finite-sample ranking guarantees.


4. Calibration as a certificate: define payoff-relevant moments and an IPM d_{}; show how moment-matching implies an upper bound on d_{}(P, P).

Our ranking guarantee will treat the simulator not as a black box whose outputs we ``trust,’’ but as an object that comes with an discrepancy bound relative to the real environment. The role of calibration is therefore not merely predictive accuracy in a generic sense; it is to produce a number δ such that
d(P, ) ≤ δ,
for a function class that is explicitly chosen to encode aspects of primitives z. This section formalizes (i) how we translate disclosure-limited real-world logs into a collection of moments, (ii) how those moments induce an integral probability metric (IPM), and (iii) why matching the moments implies a certificate on the IPM distance.

In the clean model, primitives z = (v, c, e, ξ) collect all objects that can matter for allocations, payments, and constraints once bidder behavior is fixed. In practice, the analyst rarely observes z directly. Instead, the platform discloses aggregates: marginal distributions of reserve prices or quality scores, exposure or click rates by segment, conditional win rates at a given bid, or spend and conversion summaries by time and category. We capture such disclosure through a vector of moment functions
ϕ(z) = (ϕ1(z), …, ϕK(z)) ∈ ℝK,
chosen so that (a) each ϕk(z) is observable or estimable from the real logs and (b) collectively, ϕ spans the directions in which deviations between P and are economically meaningful for the mechanisms under evaluation.

Concretely, ϕ can include simple moments such as $\mathbbm{1}\{\xi\in s\}$ for segments s (time-of-day×category bins), exposure factors e by slot, and distributions of cost-relevant signals c. It can also include objects that are directly tied to payments, such as (estimated) least-winning-cost curves: for a bid level b and segment s, define
$$ \phi_{b,s}(z)\;=\;\mathbbm{1}\{\xi\in s\}\cdot \mathbbm{1}\{\mathrm{LWC}(z)\le b\}, $$
where LWC(z) is the random ``price to beat’’ induced by competitors and mechanism rules as a function of the underlying opportunity (or a proxy for it). While LWC(z) may not be literally observed, platforms often disclose sufficient information to estimate it (e.g., win rate as a function of bid), and the simulator can be calibrated to reproduce those curves. The guiding principle is that ϕ should be rich enough that, once bidders respond via R, the induced trajectory distribution is similar in those aspects that drive g(τ).

Given a class of real-valued functions on primitives, recall the IPM
d(P, Q) := supf ∈ ℱ|𝔼z ∼ P[f(z)] − 𝔼z ∼ Q[f(z)]|.
The choice of is the bridge between what we can check'' (calibration diagnostics) andwhat we need’’ (a bound that can be propagated through the performance functional). For calibration, a particularly transparent construction is to take to be the unit 1-ball of linear functionals of ϕ:
ϕ := {fw(z) = wϕ(z) : ∥w1 ≤ 1}.
This class is attractive for two reasons. First, it yields a closed-form distance in terms of the moment mismatch. Second, it makes the certificate interpretable: any weighted combination of calibrated moments (with total weight at most one) is matched within δ.

Let μ(P) := 𝔼P[ϕ(z)] ∈ ℝK and similarly μ(). By definition,

Therefore, if calibration ensures the coordinate-wise bound
μ(P) − μ()∥ ≤ η,
then we immediately obtain the certificate
dϕ(P, ) ≤ η ≡ δ.
The implication is mechanically simple but conceptually important: calibration can be framed as producing a in a metric that is designed around payoff relevance. In particular, we do not claim that is globally close to P (e.g., in total variation); we only certify closeness along directions spanned by the selected moments.

In a real deployment, μ(P) is unknown and must be estimated from logs. Let μ̂ be an empirical estimate of μ(P) from a real dataset of size Nreal, and let μ() be computable from the simulator (either analytically or by running a large number of simulator draws). A practical certificate must account for statistical error:
$$ \big\|\mu(P)-\mu(\hat P)\big\|_\infty \;\le\;\underbrace{\big\|\widehat{\mu}-\mu(\hat P)\big\|_\infty}_{\text{empirical mismatch}} +\underbrace{\big\|\mu(P)-\widehat{\mu}\big\|_\infty}_{\text{estimation error}}. $$
The second term can be upper bounded with high probability using standard concentration, provided ϕk(z) are bounded (or sub-Gaussian after appropriate winsorization). For instance, if ϕk(z) ∈ [0, 1] then a union bound with Hoeffding yields: with probability at least 1 − ρ,
$$ \big\|\mu(P)-\widehat{\mu}\big\|_\infty \;\le\;\sqrt{\frac{\log(2K/\rho)}{2N_{\mathrm{real}}}}. $$
Defining
$$ \delta(\rho)\;:=\;\big\|\widehat{\mu}-\mu(\hat P)\big\|_\infty+\sqrt{\frac{\log(2K/\rho)}{2N_{\mathrm{real}}}}, $$
we obtain a high-probability calibration certificate dϕ(P, ) ≤ δ(ρ). This is the quantity that will later be multiplied by a Lipschitz constant to bound simulator-induced bias in mechanism performance.

The definition of ϕ is where economic judgment enters. If ϕ omits a primitive feature that meaningfully shifts bidder incentives or pricing (e.g., the joint distribution of quality scores and slot exposures, or the tail behavior of high values that drive auction competition), then dϕ(P, ) can be small while mechanism performance differs substantially. Conversely, an overly rich ϕ may be impossible to estimate accurately under disclosure constraints, leading to a large δ and vacuous downstream guarantees. We view this as the central tradeoff: . In applications, we recommend building ϕ from (i) segmentation moments that capture heterogeneity in ξ, (ii) exposure and click-through structure by slot and segment, and (iii) pricing landscape moments (win-rate or least-winning-cost curves) that are directly tied to payments and hence to both bidder utility and platform revenue.

While ϕ yields an especially transparent certificate, the same logic extends to other IPMs that may better capture nonlinear payoff dependence. For example, one can take to be a bounded-Lipschitz class (leading to Wasserstein-type control on z) or an RKHS unit ball (yielding maximum mean discrepancy). These alternatives can be more expressive but typically produce certificates that are harder to audit from limited logs. A pragmatic compromise is to begin with a linear-moment certificate (which is interpretable and implementable) and then perform : perturb within the calibrated moment tolerances and measure how much (M, ) varies. Such tests do not replace the formal bound, but they help diagnose whether the chosen ϕ is plausibly aligned with payoff relevance and whether the implied Lipschitz constant is empirically moderate.

Finally, we emphasize a limitation that will matter when interpreting the main theorem. A small δ controls discrepancies only through the lens of . If bidder behavior reacts sharply to features of P that are not captured by ϕ (for instance, rare events that trigger budget exhaustion earlier, or correlations across opportunities that defeat the i.i.d. approximation), then the performance functional need not be stable in dϕ. Our approach therefore separates concerns cleanly: calibration provides a verifiable distance bound δ, and the next section will impose (and discuss the plausibility of) a Lipschitz stability condition that converts this bound into a bound on counterfactual performance error and, ultimately, a misranking probability guarantee.


5. Main theory—continuity and misranking bounds: prove Lipschitz stability of performance under d_{}; combine with finite-sample concentration to get an explicit misranking probability bound; discuss when bounds may be vacuous.

We now formalize the logic that turns a calibration certificate into a mechanism-ranking guarantee. Conceptually, the analyst faces two distinct sources of error when using a calibrated simulator to compare mechanisms. The first is : even if we could compute J(M, ) exactly, it may differ from J(M, P) because  ≠ P. The second is : in practice we estimate J(M, ) using a finite number of simulated episodes. Our main bound is obtained by (i) propagating the calibration discrepancy through the performance functional via a continuity (Lipschitz) condition, and (ii) controlling Monte Carlo error via concentration for bounded outcomes.

The key assumption is that, for the mechanisms under consideration and the bidder-response mapping R, the induced population performance
J(M, P) = 𝔼τ ∼ ℒ(M, P, R)[g(τ)]
does not change too sharply when the opportunity distribution changes slightly in a payoff-relevant metric. Formally, we assume there exists a function class and a constant L < ∞ such that, for all distributions P, Q over primitives and each M ∈ {M1, M2},

where d is the IPM introduced above. The constant L is economically interpretable: it measures . When bidder behavior is essentially fixed (or responds smoothly), L is typically moderate; when small changes in primitives flip budget-binding status, change pacing multipliers discontinuously, or move the system across equilibrium-selection boundaries, L can be large.

Given a calibration certificate d(P, ) ≤ δ, implies a direct bias bound for each mechanism:

Because ranking depends on a of performances, the corresponding bound for the true gap
Δ* := J(M1, P) − J(M2, P)   and   Δsim := J(M1, ) − J(M2, )
follows by the triangle inequality:

Equation makes precise the sense in which calibration is useful for ranking: it creates a 2Lδ inside which the simulator may misstate the gap even with infinite simulation.

We next relate Δsim to its empirical estimate computed from N i.i.d. simulated episodes under . Let (M, ) be the sample average of g(τ) across the N simulator trajectories for mechanism M, and define the estimated gap
Δ̂ := (M1, ) − (M2, ).
Under bounded outcomes, |g(τ)| ≤ B, Hoeffding’s inequality implies that each (M, ) concentrates around J(M, ) at rate $O(1/\sqrt{N})$. Applying concentration to both mechanisms and union bounding yields, for any ε > 0,

Combining and gives a sign-recovery guarantee. Intuitively, the simulator must (a) be calibrated tightly enough that the bias band 2Lδ is smaller than the true gap, and (b) be run for enough episodes that Monte Carlo noise is unlikely to flip the sign.

Putting the pieces together, we obtain an explicit upper bound on the probability of selecting the wrong mechanism:

where (x)+ := max {x, 0}. The structure of cleanly separates from : if |Δ*| ≤ 2Lδ, then even infinite simulation cannot guarantee correct ranking (the bound becomes trivial). If |Δ*| > 2Lδ, then the misranking probability decays exponentially in N with an effective margin |Δ*|−2Lδ.

A direct corollary inverts into a sample-size requirement. Fix a target misranking probability ρ ∈ (0, 1) and suppose |Δ*| > 2Lδ. It suffices to take

Equation clarifies the practical tradeoff: tightening calibration (smaller δ) and improving stability (smaller L) can substitute for computational budget (smaller required N).

The condition |Δ| > 2Lδ is the central ``ranking margin’’ requirement. It can fail for three economically distinct reasons. First, the mechanisms may truly be close in performance in the real environment (|Δ*| small); this is not a statistical failure but a substantive one—the platform is trying to distinguish near-ties. Second, calibration may be coarse (large δ), often because disclosure limitations restrict the richness of ϕ or because the real dataset used to estimate moments is small. Third, strategic amplification may be severe (large L), meaning that even small deviations in payoff-relevant primitives induce large changes in bidder policies and hence in outcomes.

The third case deserves emphasis: vacuity can arise even if the simulator matches many moments well. In repeated auctions with budgets, small shifts in the tail of values or in exposure factors can change which advertisers become budget-constrained and when, which can sharply alter equilibrium pacing multipliers and thus prices throughout the episode. Similarly, when R selects among multiple equilibria, small perturbations in P can move the selected equilibrium discontinuously. Our framework is explicit about this limitation: the certificate δ is only as useful as the stability constant L is moderate.

In applications, L is rarely available in closed form for realistic multi-slot mechanisms with complex bidder response. We therefore view as a that should be audited. One practical approach is to compute an empirical upper envelope by stress-testing the simulator: generate perturbed distributions that remain within the calibrated moment tolerance (so that d(, ) is controlled), recompute simulated performance, and measure the worst observed slope
$$ \hat L\;:=\;\max_{M\in\{M_1,M_2\}}\;\max_{\tilde P\in\mathcal{U}} \frac{|J(M,\hat P)-J(M,\tilde P)|}{d_{\mathcal{F}}(\hat P,\tilde P)}, $$
for an appropriately chosen neighborhood 𝒰. While such a procedure does not , it can reveal whether the mechanism comparison is locally stable in the directions of uncertainty that calibration leaves unresolved.

The next section illustrates these ideas in settings where we can compute L explicitly or nearly so. In a single-slot first-price auction with exogenous proportional bidding, continuity reduces to bounding expectations of a bounded function, making transparent. We then contrast this with pacing-based response rules, where L reflects how sharply spending constraints bite. These examples make the ranking condition |Δ*| > 2Lδ concrete and clarify where numerical computation becomes unavoidable (e.g., multi-slot GSP with rich bidder adaptation).


6. Worked examples (closed-form where possible): (i) single-slot first-price with value-proportional bidding; (ii) myopic pacing; show explicit L and demonstrate the ranking condition |^*| > 2L. Flag where numerical computation is needed (GSP multi-slot with complex bidder response).

We now illustrate how the ranking condition |Δ*| > 2Lδ becomes concrete in stylized environments. The role of these examples is not realism per se; rather, they separate two distinct channels through which L can become large. In the first example bidder behavior is exogenous, so L is essentially a property of a bounded payoff function. In the second example bidder behavior depends on P through a budget-driven pacing rule, and L acquires an additional ``implicit’’ term that can blow up when budgets are close to binding.

Throughout, let $\bar m:=\sum_{t=1}^T m_t$ denote the number of auction opportunities per episode (treated as fixed for exposition), and assume primitives are bounded: vi ∈ [0, ] and exposure factors e ∈ [0, ]. We take dBL to be the bounded-Lipschitz IPM over per-opportunity primitives z (so that dBL(P, Q) = supfBL ≤ 1|𝔼P[f] − 𝔼Q[f]| with fBL := sup |f|+Lip(f)). This choice is convenient because it yields immediate continuity once we verify boundedness and Lipschitzness of the per-opportunity payoff map.

Consider a single-slot first-price auction with n bidders and per-opportunity primitive
z = (v1, …, vn, e),
where vi is bidder i’s value and e is an exposure multiplier. Suppose bids are exogenously proportional to value,
bi = αvi,   α ∈ (0, 1],
so there is no strategic dependence of π on P beyond drawing z ∼ P. Let i*(z) ∈ arg maxivi denote the winner. Under first-price, per-opportunity revenue is
rFP(z) = ebi*(z) = eα maxivi,
and per-opportunity welfare is w(z) = emaxivi. If we take episode-level performance to be total revenue, then
J(MFP, P) =  𝔼z ∼ P[rFP(z)].

To obtain an explicit Lipschitz constant, endow the primitive space with the metric
d(z, z) := maxi|vi − vi| + |e − e|.
Then maxivi is 1-Lipschitz under maxi|vi − vi|, and multiplication by e adds only bounded linear sensitivity. A simple bound is

Moreover 0 ≤ rFP(z) ≤ α. Hence
rFPBL ≤ α + α( + ).
By the defining duality of dBL, we obtain
|J(MFP, P) − J(MFP, Q)| ≤  ∥rFPBLdBL(P, Q),
so one admissible choice is

A parallel calculation applies to other single-slot payment rules when bidding remains exogenous. For example, under second-price with the same proportional bids, per-opportunity revenue becomes rSP(z) = eαv(2) where v(2) is the second-highest value. Since v(2) is also 1-Lipschitz in (v1, …, vn) under the metric, continues to hold (up to constants) with v(2) in place of maxivi.

The ranking condition is then explicit. If we compare M1 = FP and M2 = SP and take revenue as the metric, the true gap satisfies
Δ* = J(MFP, P) − J(MSP, P) = α 𝔼P[v(1) − v(2)],
where v(1) is the top order statistic. Our general theory says that, given dBL(P, ) ≤ δ, the simulator can invert the ranking only if the true gap is within the bias band:
|Δ*| ≤ 2Lδ,
with L taken as (for instance) max {LFP, LSP}. Thus, when values are diffuse enough that 𝔼[v(1) − v(2)] is materially positive, and calibration is tight, correct ranking is easy; when competition is intense and the top two values are close, the mechanisms are intrinsically hard to distinguish no matter how many simulator runs we perform.

We next introduce a minimal pacing model in which bidder behavior depends on P through a budget target. The purpose is to expose where discontinuities and near-discontinuities enter L.

Consider a single representative advertiser facing an exogenous per-impression ``market price’’ y (interpretable as the highest competing bid) and a value v. The primitive is z = (v, y, e) with v ∈ [0, ], y ∈ [0, ], e ∈ [0, ]. The advertiser uses a pacing multiplier μ ∈ [0, 1] and bids b = μv, winning iff μv ≥ y. If the advertiser wins, suppose payment is the market price y (a second-price-like simplification), so per-opportunity spend is
s(z; μ) = ey1{μv ≥ y}.
Let  := B/ denote the target spend per opportunity implied by the episode budget B. A myopic pacing rule chooses μ = μ(P) to satisfy the stationarity condition

truncated to [0, 1] if needed. Take platform revenue as the performance metric; then per-opportunity revenue is simply r(z; μ) = s(z; μ) and
J(M, P) =  𝔼z ∼ P[r(z; μ(P))] = SP(μ(P)) = ,
which is degenerate by construction. To see nontrivial dependence on P, one can instead evaluate welfare (or advertiser value), e.g.
w(z; μ) = ev1{μv ≥ y},   JSW(M, P) =  𝔼P[w(z; μ(P))].
Here P affects outcomes directly through the distribution of (v, y, e) indirectly through μ(P).

The Lipschitz bound decomposes accordingly. For two environments P, Q,

The second term is direct'' and is controlled by the BL norm of $w(\cdot;\mu)$ (uniformly over $\mu\in[0,1]$). The first term isindirect’’ and requires controlling |μ(P) − μ(Q)|, i.e., how sensitive the pacing multiplier is to the environment.

A sufficient condition for stability is that the spend map SP(μ) is uniformly steep: assume SP is differentiable and there exists κ > 0 such that

Then an implicit-function argument yields a Lipschitz bound on the pacing multiplier:

If calibration controls the right-hand side via an IPM certificate (for instance, by including s(z; μ)-type moments in ϕ on a grid of μ values), then |μ(P) − μ(Q)| ≲ δ/κ, and substituting into produces an overall Lipschitz constant of the schematic form

for explicit problem-dependent constants Cdirect, Cindirect determined by the boundedness of v, y, e and the Lipschitz properties of the indicator-smoothed class used in d. The economic point is that the term scales like 1/κ: when the spend function is flat, a tiny perturbation in P can cause a large change in μ, which then propagates to welfare and (in richer models) to prices.

A simple closed form makes this tangible. Suppose e ≡ 1, v ≡  is constant, and y ∼ Unif[0, ]. Then
$$ S_P(\mu) \;=\; \mathbb{E}\big[y\,\mathbf{1}\{y\le \mu\bar v\}\big] \;=\; \int_{0}^{\min\{\mu\bar v,\bar y\}} y\frac{1}{\bar y}\,dy \;=\; \frac{\min\{\mu\bar v,\bar y\}^2}{2\bar y}. $$
For the interior region μ ≤ , we have SP(μ) = μ2/, so κ is proportional to μ. When the budget is tight (small ), the implied μ is small, the slope is small, and predicts a large Lpacing. This is precisely the budget-kink phenomenon: near the ``no-spend’’ region, small distributional changes can flip many marginal wins and losses, forcing large pacing adjustments.

These examples are deliberately sparse: they suppress position effects, cross-advertiser interactions through budgets, and equilibrium multiplicity. In multi-slot GSP (or variants with reserve prices, quality weights, and disclosure-dependent learning), the analogue of μ(P) becomes a high-dimensional object (pacing multipliers, bid-shading functions, or policy-network parameters), and SP becomes a coupled fixed-point system. In such settings we generally cannot compute L analytically, and even verifying a condition like is nontrivial. This is exactly where the stress-testing approach becomes operational: we estimate an empirical upper envelope for the local slope of J with respect to calibrated perturbations of , and then interpret |Δ*| > 2Lδ as a practical margin check rather than a closed-form theorem input. The next section implements this logic in AuctionNet by constructing controlled misspecifications and directly measuring how rankings and empirical Lipschitz constants respond.


7. Empirical demonstration in AuctionNet: construct misspecified simulators (time-of-day shifts, category mix shifts, tail distortions); compare mechanism rankings and estimate empirical Lipschitz constants; evaluate tightness/conservativeness of bounds.

In environments as rich as those targeted by AuctionNet, the difficulty is not writing down a bound of the form |Δ*| > 2Lδ, but making L and δ interpretable when bidder response R is implemented by a coupled learning-and-pacing stack and primitives z = (v, c, e, ξ) are high-dimensional. Our empirical goal is therefore modest and operational: we use AuctionNet to (i) construct controlled misspecifications of a calibrated simulator, (ii) measure when mechanism rankings change under those misspecifications, and (iii) back out an Lipschitz envelope that can be plugged into the margin condition as a stress-test statistic.

Because we cannot observe the true P in a fully realistic deployment, we adopt the standard evaluation trick: we treat a high-fidelity, well-calibrated simulator P0 as the ``ground truth’’ environment and then generate alternative simulators Q that are intentionally misspecified. Concretely, we fix a bidder-response map R (e.g., a budget-paced policy learner run to stationarity under each mechanism) and a set of candidate mechanisms {M1, M2} (for instance, GSP with a reserve versus first-price with a different disclosure rule, or two reserve/quality-weight configurations). For each environment E ∈ {P0, Q} and each mechanism Mk, we estimate
$$ \hat J(M_k,E)\;=\;\frac{1}{N}\sum_{r=1}^N g(\tau_r^{(k,E)}), \qquad \tau_r^{(k,E)}\sim\mathcal{L}(M_k,E,R), $$
and form the estimated gap Δ̂(E) := (M1, E) − (M2, E). This isolates the impact of distributional error in the generator from other confounds.

We focus on three misspecification families that arise naturally in ad auctions and that can be implemented cleanly in AuctionNet by editing the generative process for z.

Let ξ include a discrete time index h ∈ {1, …, 24}, and suppose the baseline is a mixture P0(z) = ∑hπhP0(z ∣ h). A time-of-day shift keeps conditionals fixed but changes mixture weights:
Qtod(z) = ∑hπhP0(z ∣ h),   ∥π − π1 controlled.
Economically, this captures demand shocks, traffic seasonality, or measurement drift that reweights contexts without altering within-context behavior.

Let ξ include a content category c ∈ {1, …, C} (query class, vertical, or page type). We construct
Qcat(z) = ∑cρcP0(z ∣ c),
again with controlled deviations in ρ. This misspecification is particularly relevant for policy evaluation because mechanisms often interact with category through reserve rules, quality weights, or heterogeneous bid landscapes.

Finally, we distort the upper tail of payoff-relevant primitives, since revenue and welfare can be disproportionately driven by rare but extreme opportunities. In AuctionNet we implement a monotone tail transform on one-dimensional components (e.g., values or market prices) while leaving the rest of z unchanged. For example, for a value component v ∈ [0, ], we use
$$ v'\;=\;T_\gamma(v)\;:=\;\bar v\left(\frac{v}{\bar v}\right)^\gamma, \qquad \gamma\in(0,1)\ \text{(heavier tail)}\ \text{or}\ \gamma>1\ \text{(lighter tail)}. $$
Alternatively, we use a Pareto-like mixture that inflates the top ε quantile. This family is intentionally adversarial: it targets precisely the region where reserve prices, floor rules, and pacing kinks generate nonlinear responses.

To align the stress test with the formal guarantees, we measure distributional deviation using the same function class used for calibration. In practice we take a moment class ϕ(z) ∈ ℝK (time×category means, quantile summaries of y, win-rate curves, least-winning-cost proxies, and spend curves evaluated on a grid of multipliers), and set
d(P, Q) = supw1 ≤ 1|𝔼P[wϕ(z)] − 𝔼Q[wϕ(z)]| = ∥𝔼P[ϕ(z)] − 𝔼Q[ϕ(z)]∥,
as in Proposition~3. This choice has two advantages in AuctionNet: first, d is easy to compute from simulated draws; second, it keeps the stress test anchored to discrepancies rather than generic divergences that may be large but innocuous.

When we treat P0 as truth, we can compute the realized deviation δtrue := d(P0, Q) directly. We can also compute a quantity δcert based on the subset of moments that would plausibly be disclosed in a real calibration setting. The comparison δcert versus δtrue is informative: it tells us when limited disclosure systematically under-detects the misspecifications that matter for ranking.

Given a collection of perturbed simulators {Q} = 1L, we estimate an empirical Lipschitz constant by the worst-case slope of J with respect to the chosen IPM:
$$ \widehat L_{\mathrm{emp}} \;:=\; \max_{k\in\{1,2\}} \max_{\ell\in\{1,\dots,L\}} \frac{\left|\hat J(M_k,Q_\ell)-\hat J(M_k,P_0)\right|}{d_{\mathcal{F}}(P_0,Q_\ell)+\varepsilon}, $$
with a small ε > 0 to prevent numerical instability when perturbations are extremely mild. We emphasize what this statistic is (and is not): it is a global modulus of continuity over all distributions, but rather an upper envelope over a structured neighborhood of misspecifications that we consider plausible. Precisely because strategic response can generate kinks, we find it useful to report emp separately by mechanism and by operating regime (e.g., loose budgets versus near-binding budgets), since the pacing amplification term can vary by orders of magnitude.

Across the time-of-day and category-mixture shifts, mechanism rankings are typically stable unless the baseline gap is already small. Empirically, these shifts tend to move both mechanisms in the same direction, so Δ̂(Q) tracks Δ̂(P0) closely. In contrast, tail distortions are the primary driver of sign changes: inflating the upper tail of v (or the competitive price y) can differentially affect mechanisms with reserves or aggressive price discrimination, and it can induce large changes in learned pacing multipliers. This is exactly the channel highlighted by the 1/κ amplification effect in the pacing discussion: once the response map begins to adjust sharply, the indirect component dominates.

To connect directly to the theory, we evaluate the empirical margin
$$ \widehat{\mathrm{Margin}} \;:=\; |\hat\Delta(P_0)|-2\,\widehat L_{\mathrm{emp}}\,\delta_{\mathrm{cert}}, $$
and check whether it predicts stability. The qualitative pattern is consistent with the main message: when the margin is comfortably positive, observed misrankings are rare even at moderate N; when it is near zero or negative, sign flips become common and are often traced to a small set of tail events that the calibration moments did not constrain tightly.

Two caveats emerge. First, the concentration part of the misranking bound (Hoeffding with |g(τ)| ≤ B) is generally conservative in AuctionNet, because episode outcomes average across many opportunities and exhibit much smaller variance than worst-case boundedness suggests. In practice, replacing the worst-case B by an empirical variance proxy yields far sharper error bars, though at the cost of additional assumptions.

Second, the bias term 2Lδ can be either conservative or optimistic depending on what δ measures. When δcert is computed from a thin moment set (e.g., only marginal means), it can dramatically understate tail distortions and joint shifts, making the bound falsely reassuring. Conversely, when we include a richer set of ``curve moments’’ (win-rate and spend curves on a grid), δcert becomes more faithful and the resulting 2empδcert tends to be a conservative but informative buffer: it rarely under-predicts instability, and when it over-predicts, the overage is largely explained by local smoothness (the realized slope is smaller than the worst-case slope across perturbations).

These findings set up the practical question we address next: given disclosure and engineering constraints, which minimal set of calibration moments and stress tests should a platform (or regulator) require so that simulator-based policy claims come with a transparent, auditable robustness margin rather than an unqualified point estimate?


8. Practical recipe for regulators/platforms: minimal disclosure for calibration, uncertainty sets, and stress tests; reporting standards for simulator-based policy claims.

The preceding discussion suggests a simple lesson for practice: simulator-based mechanism evaluation is credible only insofar as it is paired with (i) a transparent calibration certificate and (ii) a transparent stability assessment of the mechanism–behavior–environment pipeline. We therefore propose a minimal recipe that a platform can implement (and a regulator can audit) without requiring disclosure of raw user-level logs or proprietary bidder code. The output of the recipe is not merely a point estimate Δ̂, but a .

Any evaluation begins by fixing the set of candidate mechanisms {M1, M2}, the evaluation horizon (episode definition), and the performance functional g(τ). In our experience, ambiguity at this stage is the dominant source of non-reproducible claims. We recommend that platforms report the full definition of g (including constraint penalties, budget-violation definitions, and any trimming/winsorization), and report an a priori bound |g(τ)| ≤ B implied by the episode design. When g is a weighted composite (revenue, welfare, violations), the weights should be pre-committed or at least varied in a sensitivity table, since changing them can flip rankings even under perfect calibration.

A regulator typically cannot (and need not) observe the true primitive distribution P. What is needed is a disclosed set of payoff-relevant ϕ(z) ∈ ℝK and their empirical targets. The key is that the moment set must be rich enough to detect the misspecifications that matter for ranking, while being aggregated enough to protect privacy and proprietary information.

A minimal but operational disclosure package includes:
(i) (time-of-day, coarse category, device) and conditional means of key primitives (e.g., mean predicted value proxy, mean exposure factor) within each cell;
(ii) of price-like objects (least-winning-cost proxies, second price, or realized CPC/CPM) by cell, with emphasis on upper quantiles;
(iii) evaluated on a public grid: win-rate as a function of bid multiplier, spend as a function of bid multiplier (or pacing multiplier), and (if applicable) reserve-hit rates as a function of reserve level;
(iv) : distribution of budget exhaustion times, and average spend share by period for each budget segment.
These are precisely the objects that allow a third party to compute a certificate in the IPM induced by ℱ = {wϕ : ∥w1 ≤ 1}, without reconstructing auction-level logs.

Given disclosed targets, the platform fits a simulator and reports a certificate
$$ \delta_{\mathrm{cert}} \;:=\; \left\|\widehat{\mathbb{E}}_{P}[\phi(z)]-\mathbb{E}_{\hat P}[\phi(z)]\right\|_\infty, $$
where $\widehat{\mathbb{E}}_{P}$ denotes the empirical moment estimates produced from the real system under the disclosure policy. The regulator should require (a) the complete list of moments, (b) the sample sizes underlying each disclosed moment (or confidence bands), and (c) the calibration procedure used to obtain .

Operationally, δcert defines an around the simulator:
𝒰(δcert) := {Qd(Q, ) ≤ δcert}.
Even if one ultimately reports a single ranking, the uncertainty set is the correct object for governance: it makes explicit what the simulator rule out, given the limited disclosure moments.

Because the mapping R(M, P) can amplify small distributional shifts, regulators should not accept ``calibrated generator’’ as sufficient. At minimum, the platform should expose a test harness specifying: the bidder-policy suite (learning, pacing, or equilibrium solver), the stationarity criterion, and the initialization distribution. This does not require revealing proprietary bidder code; rather, it requires the platform to commit to a behavioral model class and to report diagnostic plots (e.g., convergence of pacing multipliers, stability of spend shares). The core requirement is reproducibility of the mapping (M, ) ↦ π.

Calibration on a finite moment set cannot exclude adversarial errors. We therefore recommend a standard stress-test suite in which the platform perturbs along economically meaningful directions that have historically caused ranking reversals: mixture reweighting (time-of-day, category), tail inflation/deflation of value or price components, and joint shifts that preserve marginal moments but alter dependence (e.g., stronger correlation between high values and high competition).

For each perturbation Q, the platform reports (a) the realized distance d(, Q) under the disclosed moment class, and (b) the change in each mechanism’s performance. This supports an empirical stability envelope
$$ \widehat L_{\mathrm{emp}} \;=\; \max_{k\in\{1,2\}} \max_{\ell} \frac{\left|\hat J(M_k,Q_\ell)-\hat J(M_k,\hat P)\right|}{d_{\mathcal{F}}(\hat P,Q_\ell)+\varepsilon}, $$
which is interpretable as a ``worst-case slope’’ within the tested neighborhood. A regulator can treat the stress-test suite as a falsification device: if rankings flip under perturbations with small disclosed distance, then either is missing key moments (certificate too weak) or the mechanism–response map is locally unstable (large L), and either way simulator claims require qualification.

Platforms should report Monte Carlo uncertainty separately from calibration/robustness uncertainty. Under the boundedness condition |g(τ)| ≤ B, a simple auditable half-width is
$$ \varepsilon_{\mathrm{MC}}(\alpha) \;=\; B\sqrt{\frac{\log(2/\alpha)}{2N}}, $$
applied to each (Mk, ) (and combined appropriately for gaps). When a platform uses variance-reduction or common random numbers, it should report both the method and the effective variance reduction achieved, since this affects the credibility of small claimed gaps.

The key reporting object is an explicit margin that subtracts both simulator bias and sampling error. One convenient conservative form is
$$ \widehat{\mathrm{Margin}} \;:=\; \left|\hat\Delta(\hat P)\right| \;-\; 2\,\widehat L_{\mathrm{emp}}\,\delta_{\mathrm{cert}} \;-\; \varepsilon_{\mathrm{MC}}^{\Delta}(\alpha), $$
where εMCΔ is an auditable bound for the gap estimator. A platform can then make a tiered claim:
(i) if $\widehat{\mathrm{Margin}}>0$, report $M_1$ dominates $M_2$ with calibrated-and-stress-tested margin at level $\alpha$''; (ii) if $\widehat{\mathrm{Margin}}\approx 0$, reportinconclusive; ranking sensitive to plausible misspecification’‘;
(iii) if $\widehat{\mathrm{Margin}}<0$, report ``simulator does not support a ranking claim under disclosed calibration.’’

This style of reporting is useful precisely because it is : it converts engineering artifacts (moment discrepancies, stress tests, sample sizes) into a single decision-relevant quantity.

We recommend that any simulator-based policy claim be accompanied by a short ``model card’’ containing: (i) the full list of calibration moments and their empirical values; (ii) δcert and how it was computed; (iii) the bidder-response harness and convergence diagnostics; (iv) the stress-test perturbations and their realized distances; (v) emp (possibly by regime, e.g., tight vs. loose budgets); (vi) (Mk, ), Δ̂(), N, and Monte Carlo uncertainty; (vii) the final margin and the resulting ranking tier; and (viii) a statement of known blind spots (features not captured by ϕ, objectives not included in g). The purpose is not to force a single methodology, but to ensure that two independent auditors can reproduce the reported margin from the disclosed artifacts.

The recipe is intentionally modular: calibration moments can be disclosed in aggregated form; stress tests can be run entirely within the simulator; and bidder response can be standardized via a harness rather than by revealing bidder internals. Most importantly, the recipe shifts the governance burden from trust our simulator'' toverify our certificate and our stability diagnostics,’’ which is a more realistic standard for complex platforms.

These prescriptions also clarify what we should study next. Even with careful stress testing, two difficulties remain: bidder response may fail to converge or may be set-valued, and many policy objectives (fairness, privacy, market power) are not well summarized by the standard moment classes used for revenue/welfare calibration. Addressing these limitations requires extending the certification logic beyond the single-valued, well-behaved response setting.


9. Limitations and extensions: multi-agent non-convergence, set-valued bidder responses, richer objectives (fairness, privacy); how to strengthen certificates with targeted moments.

Our guarantees were deliberately phrased in terms of a response map R(M, P) that is single-valued and ``stable enough’’ to make J(M, P) well defined and Lipschitz in the primitive distribution. This is the right abstraction for many engineering pipelines (fixed pacing rules, a specified learning algorithm run to convergence), but it is also the point at which real multi-agent systems most often violate the assumptions. In large ad markets, bidder algorithms co-adapt, platform policies evolve, and the resulting dynamics need not converge to a unique stationary point. In such environments, a simulator can be calibrated on disclosed moments and still mislead, because the dominant error is not distributional misspecification but .

A common failure mode is cycling or slow drift in multi-agent learning: for a fixed (M, ), the iterates of a learning rule do not settle to a stationary policy profile, or they converge on a time scale comparable to the episode horizon. When that occurs, J(M, ) depends on initialization, update schedules, and the stopping rule of the harness. From a governance perspective, this is not a technicality: if two mechanisms induce different degrees of non-stationarity, then the ranking can flip depending on whether we measure ``early training’’ outcomes or long-run outcomes.

One extension is to make the episode explicitly dynamic and interpret J as a along the learning trajectory. Let π(s) denote the policy profile at learning iteration s, and let τ(s) denote the trajectory induced at that iteration. We can evaluate
$$ J^{\mathrm{avg}}(M,P) \;:=\; \limsup_{S\to\infty}\frac{1}{S}\sum_{s=1}^S \mathbb{E}\!\left[g(\tau^{(s)})\right], $$
or, more realistically, its finite-S analogue reported together with S. This shifts the burden from equilibrium existence to an ergodic-type requirement: the learning process should admit a stationary distribution (possibly over policies) and mix sufficiently fast. A regulator can then require the platform to report mixing diagnostics (autocorrelations of spend shares, stability of pacing multipliers) and to run the harness long enough that the residual non-stationarity is small relative to the ranking margin. Formally, the continuity and concentration terms remain, but one must add a third error term capturing :
$$ \left| \hat\Delta - \Delta^* \right| \;\le\; \underbrace{\left|\hat\Delta-\Delta^{\mathrm{sim}}\right|}_{\text{Monte Carlo}} \;+\; \underbrace{\left|\Delta^{\mathrm{sim}}-\Delta^*\right|}_{\text{calibration}} \;+\; \underbrace{\mathrm{Err}_{\mathrm{dyn}}}_{\text{non-stationarity}}, $$
where Errdyn can be bounded in favorable cases by a mixing-time argument (and otherwise must be reported empirically).

Even if an equilibrium exists, it may not be unique. In that case the appropriate response object is a correspondence ℛ(M, P) ⊆ Π, not a function. The performance J(M, P; π) depends on which π ∈ ℛ(M, P) is selected by the market, by the learning dynamics, or by the platform’s own ``default’’ bidder model. This creates an additional wedge between what a simulator shows and what a real market might realize: two simulators calibrated to the same can disagree simply because they implement different equilibrium selections.

A conservative way to restore auditable claims is to move from point evaluation to . Define
Jsup(M, P) := supπ ∈ ℛ(M, P)J(M, P; π),   Jinf(M, P) := infπ ∈ ℛ(M, P)J(M, P; π).
A robust ranking claim of ``M1 dominates M2’’ then requires
Jinf(M1, P) > Jsup(M2, P),
which is stronger than comparing point predictions but matches the governance problem: we want dominance that survives plausible equilibrium selection. Under additional regularity—for instance, if the graph of ℛ(M, ⋅) is upper hemicontinuous and J(M, P; π) is continuous in P uniformly over π—the envelope functionals can inherit a Lipschitz-type bound. In practice, rather than proving these properties, the platform can approximate envelope uncertainty by running multiple initializations, update schedules, or equilibrium solvers and reporting the induced range of outcomes as an . The margin used for ranking should then subtract this bar in the same way we subtract 2Lδ and Monte Carlo error.

A second limitation is that many policy objectives are not well summarized by the calibration moments used for revenue and welfare. Fairness metrics depend on protected attributes and often involve ratios (e.g., exposure parity or conditional conversion rates), which can be unstable when denominators are small. Privacy objectives (e.g., limiting the predictability of user attributes from outcomes, or meeting differential-privacy constraints in reporting) are even farther from standard auction moments. Market power and entry are dynamic and strategic at a horizon beyond the repeated-auction episode.

Our framework can accommodate such objectives in principle by redefining g(τ) and enriching ϕ(z), but the analyst must acknowledge two practical consequences. First, the boundedness and Lipschitz conditions become harder to justify: ratio metrics can have unbounded local sensitivity, and privacy metrics can depend on fine-grained dependence structures not captured by coarse moment classes. Second, disclosure constraints become tighter: group-conditional moments and privacy-relevant dependence moments may themselves be sensitive. A realistic governance approach therefore requires (e.g., clipped ratios, minimum-support constraints, or convex surrogates) and (e.g., noisy aggregated moments with known error bars that propagate into δcert).

Moment matching is only as good as the moments. A natural extension is to choose ϕ not as a generic summary of the market,'' but as a targeted set designed to control the specific ranking functional. One way to formalize this is to view $J(M,P)$ as a functional of $P$ and ask for moments that approximate its local sensitivity. In semiparametric language, if $J$ admits an influence-function representation, then matching moments aligned with that influence function yields the tightest control of $J$ per unit of disclosed information. Concretely, this suggests adding moments such as: (i) high-quantile and tail-shape summaries of value and price proxies (since ranking reversals often come from tails); (ii) dependence moments that couple value proxies with competition proxies (to detect correlation shifts that preserve marginals); and (iii) mechanism-specificresponse moments’’ such as gradients of spend and win-rate with respect to bid multipliers, evaluated in the regimes where pacing binds.

An operational implementation is adversarial: given a candidate simulator , search over a rich class of witness functions rich (e.g., a neural net discriminator constrained to be Lipschitz, or a large dictionary of economically motivated basis functions) to find f that maximizes
$$ \left|\widehat{\mathbb{E}}_{P}[f(z)]-\mathbb{E}_{\hat P}[f(z)]\right|. $$
The resulting f can be added to the disclosed moment set for the next audit cycle. This ``moment boosting’’ loop has a clear interpretation: the certificate improves exactly along directions where the simulator is currently least credible, and it does so without requiring raw-log disclosure.

A final limitation is subtle but important: the primitive distribution P is often . Auction logs observed under a baseline mechanism M0 reflect selection effects (which opportunities appear, which users are shown ads, which conversions are attributable), so the moments used for calibration may not identify the counterfactual P relevant under M1 or M2. Our recipe implicitly treated P as an environment distribution invariant to the mechanism, which is a good approximation in some settings (e.g., short-run counterfactuals holding demand fixed) but not in others (e.g., reserve changes that alter participation, or disclosure changes that alter bidder learning).

Two remedies are available. The first is experimental: require small randomized perturbations (to reserves, ranking weights, or disclosure) specifically to identify the moments most predictive of counterfactual shifts. The second is robust: expand the uncertainty set to include plausible policy-induced reweightings and participation responses, and report margins that remain valid over that larger set. Both approaches reduce the risk that we ``certify the wrong world.’’

The overarching message is that simulator governance must treat behavior and objectives as first-class citizens alongside calibration. When response is set-valued or non-convergent, the right output is not a single ranking but a ranking reflecting equilibrium selection and dynamics. When objectives expand to fairness or privacy, the right certificate is not generic but to the decision at hand, with regularized metrics and privacy-preserving disclosure. These extensions do not eliminate uncertainty; they make it legible, and therefore governable.