Modern ad platforms routinely contemplate changes to their auction rules: reserve prices, ranking formulas, payment rules, disclosure policies, and budget controls. Each change is, in effect, a policy intervention in a strategic market that operates at enormous scale and high frequency. The practical difficulty is that the object the platform would like to optimize—revenue, welfare, advertiser surplus, or some weighted combination subject to budget and quality constraints—is typically observed only for the mechanism currently deployed. Running online experiments for every candidate change can be costly, slow, and sometimes infeasible when the change affects learning dynamics, contracts, or market thickness. These constraints motivate the pervasive use of : estimated environments in which we can ``replay’’ counterfactual auctions, forecast outcomes, and rank alternative mechanisms before deployment.
At an intuitive level, simulation promises a controlled laboratory. We fit a generative model of auction opportunities from historical data, and we embed within it a behavioral model of how bidders will respond when the rules change. We then compute the platform’s objective under each candidate mechanism and select the best. This workflow is attractive because it converts an online policy problem into an offline statistical one: if the simulator is accurate, then policy evaluation becomes cheap, repeatable, and safe. Moreover, simulators are already intertwined with engineering practice: conversion models, click models, and budget pacing models are all, in some sense, components of a synthetic environment. The outstanding question is not whether simulation will be used, but rather when and why its recommendations can be trusted.
The core obstacle is that auctions are strategic and endogenous. A mechanism change does not merely alter the mapping from bids to allocations and payments; it alters the bids themselves. When advertisers are budget constrained, optimizing across many opportunities, and responding to feedback, small differences in the environment can lead to large differences in equilibrium or steady-state behavior. Even when advertisers follow relatively simple pacing rules, these rules can contain thresholds that amplify small distributional shifts. In this setting, a simulator can fail in a particularly damaging way: it can two mechanisms. A mechanism that appears superior in simulation may perform worse in the real environment once bidders respond, and vice versa. Because mechanism selection is a discrete decision, even modest evaluation errors can produce large policy regret.
Two distinct sources of error interact to create this risk. First, there is : the real stream of auction opportunities is generated by an unknown distribution, while the simulator uses an estimated distribution. The misspecification need not be dramatic to matter. If the simulator slightly misstates the frequency of high-value opportunities, the tail of cost signals, or exposure factors associated with user contexts, then the induced allocation and payments can shift. Second, there is : the simulator must encode how advertisers respond to the mechanism and to the opportunity stream. Even if we posit a reasonable behavioral model, its predictions can be fragile when strategic incentives are steep. Importantly, these two errors compound: the effect of getting the environment wrong is mediated through strategic response, so the bias in simulated performance is not merely a direct statistical bias but can be .
Our approach is to make this amplification explicit and to provide a minimal set of objects that can be audited and communicated. Rather than treating the simulator as an opaque black box, we ask for two ingredients that are natural in practice. The first is a that quantifies how close the simulator’s distribution is to the real environment, measured only through a class of payoff-relevant tests (for example, matched moments computed from disclosure-limited logs). The second is a that bounds how sensitive the policy objective is to changes in the environment distribution once bidder response is accounted for. Together, these two ingredients yield an interpretable guarantee: if the simulator matches the real environment up to tolerance and the induced outcomes are not too sensitive, then simulated rankings are reliable, and the probability of misranking decays rapidly as we increase the number of simulated episodes.
This emphasis on reflects how simulation is often used. In many engineering contexts, the platform does not require a perfectly calibrated estimate of revenue or welfare in absolute terms; it needs to decide between a small number of candidate mechanisms. Ranking is also the object that is most directly threatened by bias: a constant shift in all performance estimates is harmless, while a sign error in the difference between two mechanisms is costly. Our analysis therefore focuses on the performance gap between mechanisms and on conditions under which the simulator preserves the sign of that gap. This framing also clarifies an operational tradeoff. Calibration effort, measured by how small a distance certificate can be achieved, and computational effort, measured by how many simulation runs are needed, substitute for one another only up to a point: simulation noise can be driven down by more runs, but systematic simulator bias cannot.
Conceptually, we organize simulator validity around a distance between distributions defined by a function class of tests. This choice is deliberate. Auction environments are high dimensional: values, costs, exposure factors, and context all interact with the mechanism. Insisting on a strong notion of distributional closeness (say, total variation over the full primitive vector) is typically unrealistic and unnecessary. What matters for policy is closeness in directions that affect payoffs once bidders respond. In many applications, these directions can be approximated by a collection of moments or summary curves (such as value distributions by segment, predicted conversion rates, or least-winning-cost statistics). This motivates using an integral probability metric tied to a chosen family of test functions, which can be matched from available data and can be reported as a single tolerance level.
The principal contribution is to connect these pieces in a way that yields a bound with clear managerial interpretation. The bound decomposes misranking risk into (i) a bias term driven by simulator distance multiplied by a sensitivity constant capturing strategic amplification, and (ii) a variance term driven by Monte Carlo error that shrinks with the number of simulated episodes. The resulting guarantee highlights a simple ``margin’’ criterion: correct ranking is easy when the true performance gap is large relative to the worst-case simulator bias, and it is fundamentally difficult when the gap is small or when strategic amplification is large. This is not merely a pessimistic message; it provides a diagnostic. When a simulator recommends switching mechanisms, we can ask whether the estimated improvement is robust to plausible misspecification as quantified by the calibration certificate, and whether the behavioral model implies stability or knife-edge sensitivity.
We also acknowledge what our framework does resolve. A stability constant is, in general, not a primitive; it depends on the mechanism, the environment, and the behavioral response mapping, and it may be large in markets with tight budgets, binding constraints, or discontinuous bidding rules. Likewise, calibration is only as good as the disclosed data and the chosen moment set; a simulator can match a battery of tests and still be wrong in untested directions that matter under a new mechanism. Finally, real platforms face nonstationarity, delayed feedback, and cross-market spillovers that are difficult to capture in an episodic simulator. Our goal is therefore not to claim universal validity, but to provide a disciplined language for expressing what a simulator has been calibrated to match and what kinds of counterfactual claims it can support.
The remainder of the paper formalizes these ideas. We first define the repeated auction environment, the mechanisms under comparison, the bidder-response map, and the policy metrics used for evaluation, culminating in a precise notion of ``ranking mechanisms correctly.’’ We then state continuity and finite-sample results that turn a calibration certificate and a stability condition into a quantitative misranking bound, and we discuss how moment matching can be interpreted as a certificate in an appropriate metric. Throughout, we aim to keep the connection to practice explicit: which statistics can be calibrated from logs, which sensitivity quantities can be stress-tested in simulation, and how to report uncertainty about mechanism rankings in a way that is actionable for platform policy.
We describe a minimal repeated-auction environment in which a platform compares two candidate mechanisms using a synthetic simulator. The purpose of the setup is not to model every engineering detail of ad delivery, but to isolate the objects that enter counterfactual evaluation: (i) an opportunity process that generates per-auction primitives, (ii) a mechanism that maps bids into allocations, payments, and disclosures, (iii) a behavioral response that maps the mechanism and environment into a bidder policy profile, and (iv) a policy metric that the platform uses to rank mechanisms.
Time is indexed by decision steps t ∈ {1, …, T}, where each
step contains a batch of auction opportunities j ∈ {1, …, mt}
(in many applications mt is large and
can be treated as constant). Each opportunity is summarized by a
primitive random vector
zjt = (vjt, cjt, ejt, ξjt),
drawn from an unknown real distribution P. Here vjt
denotes payoff-relevant values (possibly advertiser-specific, with vijt
embedded in vjt),
cjt
denotes cost-relevant signals or features that enter the auction’s
pricing or quality adjustments, ejt
denotes exposure factors (slot multipliers, viewability, or other
scaling of value and payments), and ξjt
denotes auxiliary context (query, user, placement, and any disclosed
features). We allow zjt to
be high-dimensional; the point of the analysis is that the simulator
need only be accurate in . The platform has access to a calibrated
simulator distribution P̂ over
the same primitive space, fitted from disclosure-limited data; the
simulator can be sampled to generate synthetic opportunity streams {zjt}.
A mechanism M specifies how bids are converted into outcomes at each opportunity, and what information is revealed to bidders. Concretely, for each opportunity (j, t) and bid profile (b1jt, …, bnjt), the mechanism produces (i) an allocation vector (x1jt, …, xnjt) with feasibility constraints (e.g. ∑ixijt ≤ L for L slots), (ii) a slot assignment ℓijt or, more generally, an exposure factor applied to each allocated advertiser, and (iii) payments pijt = PayM(b1jt, …, bnjt; ξjt). We use M ∈ {M1, M2} for the two candidate designs to be compared, but nothing in the definitions depends on there being only two mechanisms. Importantly for practice, M also includes a —what auction logs, feedback, and signals are made available to advertisers—because bidder learning and pacing depend on what they observe. We abstract disclosure into the behavioral response mapping below, but conceptually it is part of the mechanism.
Advertisers are indexed by i ∈ {1, …, n}. Each
advertiser maintains an internal state sit at
step t capturing remaining
budget, pacing multipliers, learned parameters, and any history required
by its strategy. On opportunity (j, t), advertiser i submits a bid
bijt = πi(sit, zjt),
where πi
is a policy chosen from a class Πi (for example,
equilibrium bidding rules under a particular model, convex pacing
policies, or a learning algorithm run to stationarity). A policy profile
is π = (π1, …, πn) ∈ Π := Π1 × ⋯ × Πn.
We permit budget constraints and related delivery constraints through
the dynamics of sit and
feasibility requirements on total spend; for instance, a representative
constraint is
$$
\sum_{t=1}^T \sum_{j=1}^{m_t} e_{ijt} x_{ijt} p_{ijt} \leq B_i,
$$
with possible extensions such as CPA constraints. Rather than fixing a
particular behavioral model, we summarize bidder adjustment to the
environment and mechanism via a response map
R: (M, P) ↦ π,
which selects the policy profile implemented in the long run when the
true opportunity distribution is P and the mechanism is M. In an equilibrium interpretation,
R(M, P)
selects an equilibrium in Π;
in a learning interpretation, R(M, P) selects
the stationary policies induced by a specified learning-and-pacing
stack. This abstraction is deliberate: it isolates the locus of
strategic amplification (small changes in P or M can change R(M, P) sharply),
while keeping the evaluation object well-defined.
An episode induces a trajectory τ collecting all realized
primitives, allocations, payments, exposures, and state updates across
(t, j). From the
platform’s perspective, policy evaluation reduces the trajectory to a
scalar objective g(τ), assumed bounded as
|g(τ)| ≤ B.
This boundedness is natural for finite-horizon episodes with capped
budgets and limited exposures, and it enables concentration arguments
for Monte Carlo simulation. The population performance of mechanism
M in environment P is
J(M, P) := 𝔼τ ∼ ℒ(M, P, R)[g(τ)],
where ℒ(M, P, R) denotes
the induced distribution over trajectories when opportunities are drawn
from P, the mechanism is M, and bidders play π = R(M, P).
The choice of g can represent
revenue, welfare, constraint violations, or any composite objective. A
canonical example is
g(τ) = λ Rev(τ) + (1 − λ) SW(τ) − κ Viol(τ),
with λ ∈ [0, 1] and κ ≥ 0, but the analysis treats g generically.
Given a calibrated simulator P̂, the platform computes counterfactual estimates by simulating N i.i.d. episodes for each mechanism under P̂. For each M, the simulator must also specify bidder behavior; under our abstraction this is π = R(M, P̂), though one can also interpret R as selecting a fixed suite of bidder policies evaluated consistently across mechanisms. Let Ĵ(M, P̂) denote the sample average of g(τ) across the N simulated episodes. The platform’s recommendation is then the mechanism with the larger estimated performance, i.e. it compares Ĵ(M1, P̂) and Ĵ(M2, P̂). This workflow matches common practice: the simulator plays the role of an offline ``testbed’’ in which each mechanism is run repeatedly, producing an empirical distribution of outcomes.
Since the platform ultimately selects between mechanisms, the central
object is the
Δ* := J(M1, P) − J(M2, P),
and its simulated analogue Δ̂ := Ĵ(M1, P̂) − Ĵ(M2, P̂).
We say the simulator if sign(Δ̂) = sign(Δ*),
with any tie-breaking rule treated as either correct when Δ* = 0 or counted as an
error when Δ* ≠ 0
(our bounds will be stated in a way that is robust to ties). This
definition emphasizes an operational point: absolute calibration of
J(M, P) is
secondary to preserving the ordering induced by Δ*. A uniform additive
error that shifts both Ĵ(M1, P̂)
and Ĵ(M2, P̂)
is largely irrelevant for selection; what matters is the accuracy of
their .
The setup above is intentionally agnostic about the internal form of P, the dimensionality of z, and the details of R. This generality is a strength for interpretation but creates a problem for theory: without additional structure, J(M, P) can be arbitrarily sensitive to small changes in P once R(M, P) is allowed to shift discontinuously. Our main results therefore hinge on two auditable summaries: a quantifying how close P̂ is to P in payoff-relevant directions, and a (Lipschitz) condition bounding the induced sensitivity of J to such deviations. To make these conditions concrete, the next section introduces a clean large-market repeated-auction model with batched opportunities, budgets (and optional CPA constraints), and multi-slot exposure, clarifying which ingredients are required for the ranking guarantees and which are abstracted into the response map R and the metric used for calibration.
We now introduce a ``clean’’ repeated-auction model that will serve two roles. First, it makes concrete what an episode-level trajectory τ contains when there are many auctions per decision step, multiple slots with heterogeneous exposure, and hard delivery constraints such as budgets and (optional) CPA targets. Second, it clarifies what structure is used by the ranking guarantees: the theory will not require an explicit closed-form equilibrium, but it will require (i) bounded episode outcomes and (ii) a stability property linking performance to payoff-relevant deviations in the primitive distribution.
Fix a horizon of T decision
steps. In step t there are
mt auction
opportunities, indexed by j ∈ {1, …, mt}.
In the clean model we treat mt ≡ m
as constant and large. Each opportunity (j, t) draws a
primitive
zjt = (vjt, cjt, ejt, ξjt) ∼ P,
independently across (j, t); we emphasize that
this i.i.d. assumption is a modeling convenience rather than a
requirement of practice. The point of introducing it is that, with large
m, the within-step averages of
quantities like spend, exposure, and value concentrate, and many bidder
adjustment rules (pacing, bid shading, reserve updates) can be viewed as
responding to relatively smooth aggregates. When we later impose a
Lipschitz condition for J(M, P), the
large-market approximation provides intuition for why such stability can
be plausible: a small perturbation of P changes many per-opportunity
outcomes only slightly, and these changes average over a large
batch.
We allow L ≥ 1 slots per
opportunity. The mechanism assigns at most one slot per advertiser and
at most one advertiser per slot. Let xijℓt ∈ {0, 1}
indicate whether advertiser i
is shown in slot ℓ at
opportunity (j, t).
Feasibility is
$$
\sum_{i=1}^n x_{ij\ell t} \le 1 \quad \forall \ell,\qquad
\sum_{\ell=1}^L x_{ij\ell t} \le 1 \quad \forall i.
$$
Exposure enters payoffs through an opportunity- and slot-specific
multiplier. A convenient parameterization is
eijℓt = γℓ ⋅ qijt,
where γℓ
is a deterministic slot effect (e.g. position-based click-through) and
qijt
is an opportunity-specific quality/viewability component bundled into
ejt.
This makes explicit that the same allocation decision can generate
different economic impact depending on where an ad is shown and on
contextual factors. Our analysis does not rely on this factorization; it
suffices that exposures eijℓt
are part of the primitive vector zjt,
hence part of what the simulator must reproduce in payoff-relevant
directions.
A mechanism M maps reported
bids into an allocation and payment rule, potentially using
cost-relevant signals cjt and
auxiliary context ξjt. In
many ad auctions, cjt
summarizes quality scores, predicted CTR/CVR, reserve prices, or other
adjustments that affect ranking and pricing. We write the payment
charged to advertiser i in
slot ℓ as
pijℓt = PayM, ℓ(b1jt, …, bnjt; cjt, ξjt),
and define realized (exposure-weighted) spend in opportunity (j, t) as
$$
\mathrm{Spend}_{ijt} \;=\; \sum_{\ell=1}^L e_{ij\ell t}\, x_{ij\ell t}\,
p_{ij\ell t}.
$$
This representation accommodates first-price, second-price, generalized
second-price with quality, and many hybrid designs; it also accommodates
disclosure differences, since what bidders observe affects their
subsequent bids. We keep these details encapsulated in M rather than specifying a single
parametric auction format, because the ranking guarantees will be stated
uniformly over M ∈ {M1, M2}.
Each advertiser i has a
budget Bi
for the episode. Let vijℓt
denote advertiser i’s value
per unit exposure when shown in slot ℓ at opportunity (j, t) (embedded in vjt).
Then advertiser i’s realized
utility in the episode is
$$
U_i(\tau) \;=\; \sum_{t=1}^T \sum_{j=1}^m \sum_{\ell=1}^L e_{ij\ell t}\,
x_{ij\ell t}\,\big(v_{ij\ell t}-p_{ij\ell t}\big),
\qquad \text{s.t.}\quad \sum_{t,j} \mathrm{Spend}_{ijt} \le B_i.
$$
To incorporate CPA-type delivery requirements, we can append a
conversion indicator yijℓt ∈ {0, 1}
(or a conversion count) to the trajectory, with conversion probabilities
driven by ξjt and
the identity/slot shown. A stylized CPA constraint takes the form
$$
\frac{\sum_{t,j,\ell} e_{ij\ell t}\, x_{ij\ell t}\, p_{ij\ell
t}}{\sum_{t,j,\ell} e_{ij\ell t}\, x_{ij\ell t}\, y_{ij\ell t} +
\varepsilon} \;\le\; \mathrm{CPA}_i,
$$
where ε > 0 is a small
regularizer to avoid division by zero. In practice, CPA constraints are
often enforced via multipliers or dual variables (a ``bid downscaling’’
when spend per conversion exceeds target). For our purposes, the key
point is simply that constraints introduce state variables and potential
nonsmoothness: when a budget or CPA target binds, small changes in the
environment can trigger discrete changes in effective bids. This is
precisely why we separate the clean accounting of constraints (what
enters τ) from the stability
assumption (how sensitively R
reacts).
We represent bidder adaptation through a per-step state sit
that includes remaining budget Bit
(and, if relevant, remaining conversion slack or dual variables),
together with any learned parameters. A simple budget update is
$$
B_{i,t+1} \;=\; \bigg[B_{it} - \sum_{j=1}^m
\mathrm{Spend}_{ijt}\bigg]_+,\qquad B_{i1}=B_i.
$$
A canonical behavioral model in large markets is multiplicative pacing:
advertiser i applies a
multiplier αit ∈ [0, 1]
to a base bid signal (such as value or value-per-click), with αit
updated to hit the budget in expectation. For example, one may posit
bids of the form
bijt = αit ⋅ βi(zjt),
where βi(⋅) is a base
scoring rule (possibly learned) and αit
adjusts to control spend. With CPA constraints, one can analogously
include a second multiplier for conversion efficiency. We do need the
specific update equations; what matters is that the induced mapping from
(M, P) to long-run
behavior can often be viewed as the outcome of a well-posed optimization
(e.g. convex pacing) or a stable learning process. This motivates
treating R(M, P) as a
single-valued map selecting the stationary policy profile π.
Given realized allocations and payments, the platform computes an episode metric g(τ) (revenue, welfare, constraint violations, or a composite), and the population performance is J(M, P) = 𝔼[g(τ)] under the trajectory distribution induced by (M, P, R). In this clean model, two features are worth emphasizing. First, the horizon is finite and budgets cap spend, so it is natural to assume |g(τ)| ≤ B for some constant B (possibly depending on (T, m, {Bi})). Second, the multi-slot, batched structure makes explicit that g(τ) is typically a (or average) of many bounded per-opportunity contributions. This is what later delivers concentration of simulation estimates Ĵ(M, P̂) in N episodes, holding the simulator fixed.
The clean model adds accounting detail but also clarifies what we will assume. We do not require: a particular auction format (first price versus second price), a parametric model for values, a closed-form equilibrium, or full observability of P. We also do not require that bidders literally solve the constrained utility maximization written above; bidders may be using heuristics or learning rules. All of that is abstracted into R, and the mechanism-specific disclosure policy is likewise treated as part of the input to R.
What we require for the theory is more modest and, importantly, auditable: (i) boundedness of g(τ) at the episode level; (ii) a notion of closeness between P and P̂ that is tailored to payoff-relevant primitives; and (iii) a stability condition asserting that, for the mechanisms under consideration, J(M, P) does not change too abruptly when P is perturbed in those payoff-relevant directions. The clean model above helps interpret these requirements: exposure factors, quality adjustments, and constraints determine which aspects of z are payoff-relevant, while budgets and pacing illustrate why stability is a substantive assumption rather than a tautology. The next section formalizes ``payoff-relevant closeness’’ via moment-based calibration and an integral probability metric, yielding a certificate δ that can be carried directly into finite-sample ranking guarantees.
Our ranking guarantee will treat the simulator not as a black box
whose outputs we ``trust,’’ but as an object that comes with an
discrepancy bound relative to the real environment. The role of
calibration is therefore not merely predictive accuracy in a generic
sense; it is to produce a number δ such that
dℱ(P, P̂) ≤ δ,
for a function class ℱ that is
explicitly chosen to encode aspects of primitives z. This section formalizes (i) how
we translate disclosure-limited real-world logs into a collection of
moments, (ii) how those moments induce an integral probability metric
(IPM), and (iii) why matching the moments implies a certificate on the
IPM distance.
In the clean model, primitives z = (v, c, e, ξ)
collect all objects that can matter for allocations, payments, and
constraints once bidder behavior is fixed. In practice, the analyst
rarely observes z directly.
Instead, the platform discloses aggregates: marginal distributions of
reserve prices or quality scores, exposure or click rates by segment,
conditional win rates at a given bid, or spend and conversion summaries
by time and category. We capture such disclosure through a vector of
moment functions
ϕ(z) = (ϕ1(z), …, ϕK(z)) ∈ ℝK,
chosen so that (a) each ϕk(z)
is observable or estimable from the real logs and (b) collectively,
ϕ spans the directions in
which deviations between P and
P̂ are economically meaningful
for the mechanisms under evaluation.
Concretely, ϕ can include
simple moments such as $\mathbbm{1}\{\xi\in
s\}$ for segments s
(time-of-day×category bins), exposure
factors e by slot, and
distributions of cost-relevant signals c. It can also include objects that
are directly tied to payments, such as (estimated) least-winning-cost
curves: for a bid level b and
segment s, define
$$
\phi_{b,s}(z)\;=\;\mathbbm{1}\{\xi\in s\}\cdot
\mathbbm{1}\{\mathrm{LWC}(z)\le b\},
$$
where LWC(z) is the random
``price to beat’’ induced by competitors and mechanism rules as a
function of the underlying opportunity (or a proxy for it). While LWC(z) may not be literally
observed, platforms often disclose sufficient information to estimate it
(e.g., win rate as a function of bid), and the simulator can be
calibrated to reproduce those curves. The guiding principle is that
ϕ should be rich enough that,
once bidders respond via R,
the induced trajectory distribution is similar in those aspects that
drive g(τ).
Given a class ℱ of real-valued
functions on primitives, recall the IPM
dℱ(P, Q) := supf ∈ ℱ|𝔼z ∼ P[f(z)] − 𝔼z ∼ Q[f(z)]|.
The choice of ℱ is the bridge between
what we can check'' (calibration diagnostics) andwhat we
need’’ (a bound that can be propagated through the performance
functional). For calibration, a particularly transparent construction is
to take ℱ to be the unit ℓ1-ball of linear
functionals of ϕ:
ℱϕ := {fw(z) = w⊤ϕ(z) : ∥w∥1 ≤ 1}.
This class is attractive for two reasons. First, it yields a closed-form
distance in terms of the moment mismatch. Second, it makes the
certificate interpretable: any weighted combination of calibrated
moments (with total weight at most one) is matched within δ.
Let μ(P) := 𝔼P[ϕ(z)] ∈ ℝK
and similarly μ(P̂).
By definition,
Therefore, if calibration ensures the coordinate-wise bound
∥μ(P) − μ(P̂)∥∞ ≤ η,
then we immediately obtain the certificate
dℱϕ(P, P̂) ≤ η ≡ δ.
The implication is mechanically simple but conceptually important:
calibration can be framed as producing a in a metric that is designed
around payoff relevance. In particular, we do not claim that P̂ is globally close to P (e.g., in total variation); we
only certify closeness along directions spanned by the selected
moments.
In a real deployment, μ(P) is unknown and must be
estimated from logs. Let μ̂ be
an empirical estimate of μ(P) from a real dataset of
size Nreal, and let
μ(P̂) be computable
from the simulator (either analytically or by running a large number of
simulator draws). A practical certificate must account for statistical
error:
$$
\big\|\mu(P)-\mu(\hat P)\big\|_\infty
\;\le\;\underbrace{\big\|\widehat{\mu}-\mu(\hat
P)\big\|_\infty}_{\text{empirical mismatch}}
+\underbrace{\big\|\mu(P)-\widehat{\mu}\big\|_\infty}_{\text{estimation
error}}.
$$
The second term can be upper bounded with high probability using
standard concentration, provided ϕk(z)
are bounded (or sub-Gaussian after appropriate winsorization). For
instance, if ϕk(z) ∈ [0, 1]
then a union bound with Hoeffding yields: with probability at least
1 − ρ,
$$
\big\|\mu(P)-\widehat{\mu}\big\|_\infty
\;\le\;\sqrt{\frac{\log(2K/\rho)}{2N_{\mathrm{real}}}}.
$$
Defining
$$
\delta(\rho)\;:=\;\big\|\widehat{\mu}-\mu(\hat
P)\big\|_\infty+\sqrt{\frac{\log(2K/\rho)}{2N_{\mathrm{real}}}},
$$
we obtain a high-probability calibration certificate dℱϕ(P, P̂) ≤ δ(ρ).
This is the quantity that will later be multiplied by a Lipschitz
constant to bound simulator-induced bias in mechanism performance.
The definition of ϕ is where economic judgment enters. If ϕ omits a primitive feature that meaningfully shifts bidder incentives or pricing (e.g., the joint distribution of quality scores and slot exposures, or the tail behavior of high values that drive auction competition), then dℱϕ(P, P̂) can be small while mechanism performance differs substantially. Conversely, an overly rich ϕ may be impossible to estimate accurately under disclosure constraints, leading to a large δ and vacuous downstream guarantees. We view this as the central tradeoff: . In applications, we recommend building ϕ from (i) segmentation moments that capture heterogeneity in ξ, (ii) exposure and click-through structure by slot and segment, and (iii) pricing landscape moments (win-rate or least-winning-cost curves) that are directly tied to payments and hence to both bidder utility and platform revenue.
While ℱϕ yields an especially transparent certificate, the same logic extends to other IPMs that may better capture nonlinear payoff dependence. For example, one can take ℱ to be a bounded-Lipschitz class (leading to Wasserstein-type control on z) or an RKHS unit ball (yielding maximum mean discrepancy). These alternatives can be more expressive but typically produce certificates that are harder to audit from limited logs. A pragmatic compromise is to begin with a linear-moment certificate (which is interpretable and implementable) and then perform : perturb P̂ within the calibrated moment tolerances and measure how much Ĵ(M, P̂) varies. Such tests do not replace the formal bound, but they help diagnose whether the chosen ϕ is plausibly aligned with payoff relevance and whether the implied Lipschitz constant is empirically moderate.
Finally, we emphasize a limitation that will matter when interpreting the main theorem. A small δ controls discrepancies only through the lens of ℱ. If bidder behavior reacts sharply to features of P that are not captured by ϕ (for instance, rare events that trigger budget exhaustion earlier, or correlations across opportunities that defeat the i.i.d. approximation), then the performance functional need not be stable in dℱϕ. Our approach therefore separates concerns cleanly: calibration provides a verifiable distance bound δ, and the next section will impose (and discuss the plausibility of) a Lipschitz stability condition that converts this bound into a bound on counterfactual performance error and, ultimately, a misranking probability guarantee.
We now formalize the logic that turns a calibration certificate into a mechanism-ranking guarantee. Conceptually, the analyst faces two distinct sources of error when using a calibrated simulator to compare mechanisms. The first is : even if we could compute J(M, P̂) exactly, it may differ from J(M, P) because P̂ ≠ P. The second is : in practice we estimate J(M, P̂) using a finite number of simulated episodes. Our main bound is obtained by (i) propagating the calibration discrepancy through the performance functional via a continuity (Lipschitz) condition, and (ii) controlling Monte Carlo error via concentration for bounded outcomes.
The key assumption is that, for the mechanisms under consideration
and the bidder-response mapping R, the induced population
performance
J(M, P) = 𝔼τ ∼ ℒ(M, P, R)[g(τ)]
does not change too sharply when the opportunity distribution changes
slightly in a payoff-relevant metric. Formally, we assume there exists a
function class ℱ and a constant L < ∞ such that, for all
distributions P, Q
over primitives and each M ∈ {M1, M2},
where dℱ is the IPM
introduced above. The constant L is economically interpretable: it
measures . When bidder behavior is essentially fixed (or responds
smoothly), L is typically
moderate; when small changes in primitives flip budget-binding status,
change pacing multipliers discontinuously, or move the system across
equilibrium-selection boundaries, L can be large.
Given a calibration certificate dℱ(P, P̂) ≤ δ,
implies a direct bias bound for each mechanism:
Because ranking depends on a of performances, the corresponding bound
for the true gap
Δ* := J(M1, P) − J(M2, P) and Δsim := J(M1, P̂) − J(M2, P̂)
follows by the triangle inequality:
Equation makes precise the sense in which calibration is useful for
ranking: it creates a 2Lδ inside which the
simulator may misstate the gap even with infinite simulation.
We next relate Δsim to its empirical
estimate computed from N
i.i.d. simulated episodes under P̂. Let Ĵ(M, P̂) be the
sample average of g(τ) across the N simulator trajectories for
mechanism M, and define the
estimated gap
Δ̂ := Ĵ(M1, P̂) − Ĵ(M2, P̂).
Under bounded outcomes, |g(τ)| ≤ B,
Hoeffding’s inequality implies that each Ĵ(M, P̂)
concentrates around J(M, P̂) at rate
$O(1/\sqrt{N})$. Applying concentration
to both mechanisms and union bounding yields, for any ε > 0,
Combining and gives a sign-recovery guarantee. Intuitively, the
simulator must (a) be calibrated tightly enough that the bias band 2Lδ is smaller than the
true gap, and (b) be run for enough episodes that Monte Carlo noise is
unlikely to flip the sign.
Putting the pieces together, we obtain an explicit upper bound on the
probability of selecting the wrong mechanism:
where (x)+ := max {x, 0}.
The structure of cleanly separates from : if |Δ*| ≤ 2Lδ,
then even infinite simulation cannot guarantee correct ranking (the
bound becomes trivial). If |Δ*| > 2Lδ,
then the misranking probability decays exponentially in N with an effective margin |Δ*|−2Lδ.
A direct corollary inverts into a sample-size requirement. Fix a
target misranking probability ρ ∈ (0, 1) and suppose |Δ*| > 2Lδ.
It suffices to take
Equation clarifies the practical tradeoff: tightening calibration
(smaller δ) and improving
stability (smaller L) can
substitute for computational budget (smaller required N).
The condition |Δ| > 2Lδ is the central ``ranking margin’’ requirement. It can fail for three economically distinct reasons. First, the mechanisms may truly be close in performance in the real environment (|Δ*| small); this is not a statistical failure but a substantive one—the platform is trying to distinguish near-ties. Second, calibration may be coarse (large δ), often because disclosure limitations restrict the richness of ϕ or because the real dataset used to estimate moments is small. Third, strategic amplification may be severe (large L), meaning that even small deviations in payoff-relevant primitives induce large changes in bidder policies and hence in outcomes.
The third case deserves emphasis: vacuity can arise even if the simulator matches many moments well. In repeated auctions with budgets, small shifts in the tail of values or in exposure factors can change which advertisers become budget-constrained and when, which can sharply alter equilibrium pacing multipliers and thus prices throughout the episode. Similarly, when R selects among multiple equilibria, small perturbations in P can move the selected equilibrium discontinuously. Our framework is explicit about this limitation: the certificate δ is only as useful as the stability constant L is moderate.
In applications, L is
rarely available in closed form for realistic multi-slot mechanisms with
complex bidder response. We therefore view as a that should be audited.
One practical approach is to compute an empirical upper envelope by
stress-testing the simulator: generate perturbed distributions P̃ that remain within the calibrated
moment tolerance (so that dℱ(P̂, P̃)
is controlled), recompute simulated performance, and measure the worst
observed slope
$$
\hat L\;:=\;\max_{M\in\{M_1,M_2\}}\;\max_{\tilde P\in\mathcal{U}}
\frac{|J(M,\hat P)-J(M,\tilde P)|}{d_{\mathcal{F}}(\hat P,\tilde P)},
$$
for an appropriately chosen neighborhood 𝒰. While such a procedure does not , it can
reveal whether the mechanism comparison is locally stable in the
directions of uncertainty that calibration leaves unresolved.
The next section illustrates these ideas in settings where we can compute L explicitly or nearly so. In a single-slot first-price auction with exogenous proportional bidding, continuity reduces to bounding expectations of a bounded function, making transparent. We then contrast this with pacing-based response rules, where L reflects how sharply spending constraints bite. These examples make the ranking condition |Δ*| > 2Lδ concrete and clarify where numerical computation becomes unavoidable (e.g., multi-slot GSP with rich bidder adaptation).
We now illustrate how the ranking condition |Δ*| > 2Lδ becomes concrete in stylized environments. The role of these examples is not realism per se; rather, they separate two distinct channels through which L can become large. In the first example bidder behavior is exogenous, so L is essentially a property of a bounded payoff function. In the second example bidder behavior depends on P through a budget-driven pacing rule, and L acquires an additional ``implicit’’ term that can blow up when budgets are close to binding.
Throughout, let $\bar m:=\sum_{t=1}^T m_t$ denote the number of auction opportunities per episode (treated as fixed for exposition), and assume primitives are bounded: vi ∈ [0, v̄] and exposure factors e ∈ [0, ē]. We take dBL to be the bounded-Lipschitz IPM over per-opportunity primitives z (so that dBL(P, Q) = sup∥f∥BL ≤ 1|𝔼P[f] − 𝔼Q[f]| with ∥f∥BL := sup |f|+Lip(f)). This choice is convenient because it yields immediate continuity once we verify boundedness and Lipschitzness of the per-opportunity payoff map.
Consider a single-slot first-price auction with n bidders and per-opportunity
primitive
z = (v1, …, vn, e),
where vi
is bidder i’s value and e is an exposure multiplier. Suppose
bids are exogenously proportional to value,
bi = αvi, α ∈ (0, 1],
so there is no strategic dependence of π on P beyond drawing z ∼ P. Let i*(z) ∈ arg maxivi
denote the winner. Under first-price, per-opportunity revenue is
rFP(z) = e bi*(z) = e α maxivi,
and per-opportunity welfare is w(z) = emaxivi.
If we take episode-level performance to be total revenue, then
J(MFP, P) = m̄ 𝔼z ∼ P[rFP(z)].
To obtain an explicit Lipschitz constant, endow the primitive space
with the metric
d(z, z′) := maxi|vi − vi′| + |e − e′|.
Then maxivi
is 1-Lipschitz under maxi|vi − vi′|,
and multiplication by e adds
only bounded linear sensitivity. A simple bound is
Moreover 0 ≤ rFP(z) ≤ αē v̄.
Hence
∥rFP∥BL ≤ αē v̄ + α(ē + v̄).
By the defining duality of dBL, we obtain
|J(MFP, P) − J(MFP, Q)| ≤ m̄ ∥rFP∥BL dBL(P, Q),
so one admissible choice is
A parallel calculation applies to other single-slot payment rules when bidding remains exogenous. For example, under second-price with the same proportional bids, per-opportunity revenue becomes rSP(z) = e α v(2) where v(2) is the second-highest value. Since v(2) is also 1-Lipschitz in (v1, …, vn) under the ℓ∞ metric, continues to hold (up to constants) with v(2) in place of maxivi.
The ranking condition is then explicit. If we compare M1 = FP and M2 = SP and take revenue
as the metric, the true gap satisfies
Δ* = J(MFP, P) − J(MSP, P) = m̄ α ē 𝔼P[v(1) − v(2)],
where v(1) is the
top order statistic. Our general theory says that, given dBL(P, P̂) ≤ δ,
the simulator can invert the ranking only if the true gap is within the
bias band:
|Δ*| ≤ 2L δ,
with L taken as (for instance)
max {LFP, LSP}.
Thus, when values are diffuse enough that 𝔼[v(1) − v(2)]
is materially positive, and calibration is tight, correct ranking is
easy; when competition is intense and the top two values are close, the
mechanisms are intrinsically hard to distinguish no matter how many
simulator runs we perform.
We next introduce a minimal pacing model in which bidder behavior depends on P through a budget target. The purpose is to expose where discontinuities and near-discontinuities enter L.
Consider a single representative advertiser facing an exogenous
per-impression ``market price’’ y (interpretable as the highest
competing bid) and a value v.
The primitive is z = (v, y, e)
with v ∈ [0, v̄],
y ∈ [0, ȳ], e ∈ [0, ē]. The advertiser
uses a pacing multiplier μ ∈ [0, 1] and bids b = μv, winning
iff μv ≥ y.
If the advertiser wins, suppose payment is the market price y (a second-price-like
simplification), so per-opportunity spend is
s(z; μ) = e y 1{μv ≥ y}.
Let s̄ := B/m̄
denote the target spend per opportunity implied by the episode budget
B. A myopic pacing rule
chooses μ = μ(P) to
satisfy the stationarity condition
truncated to [0, 1] if needed. Take
platform revenue as the performance metric; then per-opportunity revenue
is simply r(z; μ) = s(z; μ)
and
J(M, P) = m̄ 𝔼z ∼ P[r(z; μ(P))] = m̄ SP(μ(P)) = m̄ s̄,
which is degenerate by construction. To see nontrivial dependence on
P, one can instead evaluate
welfare (or advertiser value), e.g.
w(z; μ) = e v 1{μv ≥ y}, JSW(M, P) = m̄ 𝔼P[w(z; μ(P))].
Here P affects outcomes
directly through the distribution of (v, y, e)
indirectly through μ(P).
The Lipschitz bound decomposes accordingly. For two environments
P, Q,
The second term is
direct'' and is controlled by the BL norm of $w(\cdot;\mu)$ (uniformly over $\mu\in[0,1]$). The first term isindirect’’
and requires controlling |μ(P) − μ(Q)|,
i.e., how sensitive the pacing multiplier is to the environment.
A sufficient condition for stability is that the spend map SP(μ)
is uniformly steep: assume SP is
differentiable and there exists κ > 0 such that
Then an implicit-function argument yields a Lipschitz bound on the
pacing multiplier:
If calibration controls the right-hand side via an IPM certificate (for
instance, by including s(z; μ)-type
moments in ϕ on a grid of
μ values), then |μ(P) − μ(Q)| ≲ δ/κ,
and substituting into produces an overall Lipschitz constant of the
schematic form
for explicit problem-dependent constants Cdirect, Cindirect
determined by the boundedness of v, y, e and the
Lipschitz properties of the indicator-smoothed class used in dℱ. The economic point is
that the term scales like 1/κ:
when the spend function is flat, a tiny perturbation in P can cause a large change in μ, which then propagates to welfare
and (in richer models) to prices.
A simple closed form makes this tangible. Suppose e ≡ 1, v ≡ v̄ is constant, and
y ∼ Unif[0, ȳ].
Then
$$
S_P(\mu)
\;=\;
\mathbb{E}\big[y\,\mathbf{1}\{y\le \mu\bar v\}\big]
\;=\;
\int_{0}^{\min\{\mu\bar v,\bar y\}} y\frac{1}{\bar y}\,dy
\;=\;
\frac{\min\{\mu\bar v,\bar y\}^2}{2\bar y}.
$$
For the interior region μv̄ ≤ ȳ, we have
SP′(μ) = μv̄2/ȳ,
so κ is proportional to μ. When the budget is tight (small
s̄), the implied μ is small, the slope is small, and
predicts a large Lpacing. This is
precisely the budget-kink phenomenon: near the ``no-spend’’ region,
small distributional changes can flip many marginal wins and losses,
forcing large pacing adjustments.
These examples are deliberately sparse: they suppress position effects, cross-advertiser interactions through budgets, and equilibrium multiplicity. In multi-slot GSP (or variants with reserve prices, quality weights, and disclosure-dependent learning), the analogue of μ(P) becomes a high-dimensional object (pacing multipliers, bid-shading functions, or policy-network parameters), and SP becomes a coupled fixed-point system. In such settings we generally cannot compute L analytically, and even verifying a condition like is nontrivial. This is exactly where the stress-testing approach becomes operational: we estimate an empirical upper envelope for the local slope of J with respect to calibrated perturbations of P̂, and then interpret |Δ*| > 2Lδ as a practical margin check rather than a closed-form theorem input. The next section implements this logic in AuctionNet by constructing controlled misspecifications and directly measuring how rankings and empirical Lipschitz constants respond.
In environments as rich as those targeted by AuctionNet, the difficulty is not writing down a bound of the form |Δ*| > 2Lδ, but making L and δ interpretable when bidder response R is implemented by a coupled learning-and-pacing stack and primitives z = (v, c, e, ξ) are high-dimensional. Our empirical goal is therefore modest and operational: we use AuctionNet to (i) construct controlled misspecifications of a calibrated simulator, (ii) measure when mechanism rankings change under those misspecifications, and (iii) back out an Lipschitz envelope that can be plugged into the margin condition as a stress-test statistic.
Because we cannot observe the true P in a fully realistic deployment,
we adopt the standard evaluation trick: we treat a high-fidelity,
well-calibrated simulator P0 as the ``ground
truth’’ environment and then generate alternative simulators Q that are intentionally
misspecified. Concretely, we fix a bidder-response map R (e.g., a budget-paced policy
learner run to stationarity under each mechanism) and a set of candidate
mechanisms {M1, M2}
(for instance, GSP with a reserve versus first-price with a different
disclosure rule, or two reserve/quality-weight configurations). For each
environment E ∈ {P0, Q}
and each mechanism Mk, we
estimate
$$
\hat J(M_k,E)\;=\;\frac{1}{N}\sum_{r=1}^N g(\tau_r^{(k,E)}),
\qquad
\tau_r^{(k,E)}\sim\mathcal{L}(M_k,E,R),
$$
and form the estimated gap Δ̂(E) := Ĵ(M1, E) − Ĵ(M2, E).
This isolates the impact of distributional error in the generator from
other confounds.
We focus on three misspecification families that arise naturally in ad auctions and that can be implemented cleanly in AuctionNet by editing the generative process for z.
Let ξ include a discrete
time index h ∈ {1, …, 24}, and
suppose the baseline is a mixture P0(z) = ∑hπhP0(z ∣ h).
A time-of-day shift keeps conditionals fixed but changes mixture
weights:
Qtod(z) = ∑hπh′P0(z ∣ h), ∥π − π′∥1
controlled.
Economically, this captures demand shocks, traffic seasonality, or
measurement drift that reweights contexts without altering
within-context behavior.
Let ξ include a content
category c ∈ {1, …, C} (query class,
vertical, or page type). We construct
Qcat(z) = ∑cρc′P0(z ∣ c),
again with controlled deviations in ρ′. This misspecification
is particularly relevant for policy evaluation because mechanisms often
interact with category through reserve rules, quality weights, or
heterogeneous bid landscapes.
Finally, we distort the upper tail of payoff-relevant primitives,
since revenue and welfare can be disproportionately driven by rare but
extreme opportunities. In AuctionNet we implement a monotone tail
transform on one-dimensional components (e.g., values or market prices)
while leaving the rest of z
unchanged. For example, for a value component v ∈ [0, v̄], we use
$$
v'\;=\;T_\gamma(v)\;:=\;\bar v\left(\frac{v}{\bar v}\right)^\gamma,
\qquad \gamma\in(0,1)\ \text{(heavier tail)}\ \text{or}\ \gamma>1\
\text{(lighter tail)}.
$$
Alternatively, we use a Pareto-like mixture that inflates the top ε quantile. This family is
intentionally adversarial: it targets precisely the region where reserve
prices, floor rules, and pacing kinks generate nonlinear responses.
To align the stress test with the formal guarantees, we measure
distributional deviation using the same function class ℱ used for calibration. In practice we take a
moment class ϕ(z) ∈ ℝK
(time×category means, quantile
summaries of y, win-rate
curves, least-winning-cost proxies, and spend curves evaluated on a grid
of multipliers), and set
dℱ(P, Q) = sup∥w∥1 ≤ 1|𝔼P[w⊤ϕ(z)] − 𝔼Q[w⊤ϕ(z)]| = ∥𝔼P[ϕ(z)] − 𝔼Q[ϕ(z)]∥∞,
as in Proposition~3. This choice has two advantages in AuctionNet:
first, dℱ is easy
to compute from simulated draws; second, it keeps the stress test
anchored to discrepancies rather than generic divergences that may be
large but innocuous.
When we treat P0 as truth, we can compute the realized deviation δtrue := dℱ(P0, Q) directly. We can also compute a quantity δcert based on the subset of moments that would plausibly be disclosed in a real calibration setting. The comparison δcert versus δtrue is informative: it tells us when limited disclosure systematically under-detects the misspecifications that matter for ranking.
Given a collection of perturbed simulators {Qℓ}ℓ = 1L,
we estimate an empirical Lipschitz constant by the worst-case slope of
J with respect to the chosen
IPM:
$$
\widehat L_{\mathrm{emp}}
\;:=\;
\max_{k\in\{1,2\}}
\max_{\ell\in\{1,\dots,L\}}
\frac{\left|\hat J(M_k,Q_\ell)-\hat
J(M_k,P_0)\right|}{d_{\mathcal{F}}(P_0,Q_\ell)+\varepsilon},
$$
with a small ε > 0 to
prevent numerical instability when perturbations are extremely mild. We
emphasize what this statistic is (and is not): it is a global modulus of
continuity over all distributions, but rather an upper envelope over a
structured neighborhood of misspecifications that we consider plausible.
Precisely because strategic response can generate kinks, we find it
useful to report L̂emp separately by
mechanism and by operating regime (e.g., loose budgets versus
near-binding budgets), since the pacing amplification term can vary by
orders of magnitude.
Across the time-of-day and category-mixture shifts, mechanism rankings are typically stable unless the baseline gap is already small. Empirically, these shifts tend to move both mechanisms in the same direction, so Δ̂(Qℓ) tracks Δ̂(P0) closely. In contrast, tail distortions are the primary driver of sign changes: inflating the upper tail of v (or the competitive price y) can differentially affect mechanisms with reserves or aggressive price discrimination, and it can induce large changes in learned pacing multipliers. This is exactly the channel highlighted by the 1/κ amplification effect in the pacing discussion: once the response map begins to adjust sharply, the indirect component dominates.
To connect directly to the theory, we evaluate the empirical
margin
$$
\widehat{\mathrm{Margin}}
\;:=\;
|\hat\Delta(P_0)|-2\,\widehat L_{\mathrm{emp}}\,\delta_{\mathrm{cert}},
$$
and check whether it predicts stability. The qualitative pattern is
consistent with the main message: when the margin is comfortably
positive, observed misrankings are rare even at moderate N; when it is near zero or negative,
sign flips become common and are often traced to a small set of tail
events that the calibration moments did not constrain tightly.
Two caveats emerge. First, the concentration part of the misranking bound (Hoeffding with |g(τ)| ≤ B) is generally conservative in AuctionNet, because episode outcomes average across many opportunities and exhibit much smaller variance than worst-case boundedness suggests. In practice, replacing the worst-case B by an empirical variance proxy yields far sharper error bars, though at the cost of additional assumptions.
Second, the bias term 2Lδ can be either conservative or optimistic depending on what δ measures. When δcert is computed from a thin moment set (e.g., only marginal means), it can dramatically understate tail distortions and joint shifts, making the bound falsely reassuring. Conversely, when we include a richer set of ``curve moments’’ (win-rate and spend curves on a grid), δcert becomes more faithful and the resulting 2L̂empδcert tends to be a conservative but informative buffer: it rarely under-predicts instability, and when it over-predicts, the overage is largely explained by local smoothness (the realized slope is smaller than the worst-case slope across perturbations).
These findings set up the practical question we address next: given disclosure and engineering constraints, which minimal set of calibration moments and stress tests should a platform (or regulator) require so that simulator-based policy claims come with a transparent, auditable robustness margin rather than an unqualified point estimate?
The preceding discussion suggests a simple lesson for practice: simulator-based mechanism evaluation is credible only insofar as it is paired with (i) a transparent calibration certificate and (ii) a transparent stability assessment of the mechanism–behavior–environment pipeline. We therefore propose a minimal recipe that a platform can implement (and a regulator can audit) without requiring disclosure of raw user-level logs or proprietary bidder code. The output of the recipe is not merely a point estimate Δ̂, but a .
Any evaluation begins by fixing the set of candidate mechanisms {M1, M2}, the evaluation horizon (episode definition), and the performance functional g(τ). In our experience, ambiguity at this stage is the dominant source of non-reproducible claims. We recommend that platforms report the full definition of g (including constraint penalties, budget-violation definitions, and any trimming/winsorization), and report an a priori bound |g(τ)| ≤ B implied by the episode design. When g is a weighted composite (revenue, welfare, violations), the weights should be pre-committed or at least varied in a sensitivity table, since changing them can flip rankings even under perfect calibration.
A regulator typically cannot (and need not) observe the true primitive distribution P. What is needed is a disclosed set of payoff-relevant ϕ(z) ∈ ℝK and their empirical targets. The key is that the moment set must be rich enough to detect the misspecifications that matter for ranking, while being aggregated enough to protect privacy and proprietary information.
A minimal but operational disclosure package includes:
(i) (time-of-day, coarse category, device) and conditional means of key
primitives (e.g., mean predicted value proxy, mean exposure factor)
within each cell;
(ii) of price-like objects (least-winning-cost proxies, second price, or
realized CPC/CPM) by cell, with emphasis on upper quantiles;
(iii) evaluated on a public grid: win-rate as a function of bid
multiplier, spend as a function of bid multiplier (or pacing
multiplier), and (if applicable) reserve-hit rates as a function of
reserve level;
(iv) : distribution of budget exhaustion times, and average spend share
by period for each budget segment.
These are precisely the objects that allow a third party to compute a
certificate in the IPM induced by ℱ = {w⊤ϕ : ∥w∥1 ≤ 1},
without reconstructing auction-level logs.
Given disclosed targets, the platform fits a simulator P̂ and reports a certificate
$$
\delta_{\mathrm{cert}}
\;:=\;
\left\|\widehat{\mathbb{E}}_{P}[\phi(z)]-\mathbb{E}_{\hat
P}[\phi(z)]\right\|_\infty,
$$
where $\widehat{\mathbb{E}}_{P}$
denotes the empirical moment estimates produced from the real system
under the disclosure policy. The regulator should require (a) the
complete list of moments, (b) the sample sizes underlying each disclosed
moment (or confidence bands), and (c) the calibration procedure used to
obtain P̂.
Operationally, δcert defines an around
the simulator:
𝒰(δcert) := {Q: dℱ(Q, P̂) ≤ δcert}.
Even if one ultimately reports a single ranking, the uncertainty set is
the correct object for governance: it makes explicit what the simulator
rule out, given the limited disclosure moments.
Because the mapping R(M, P) can amplify small distributional shifts, regulators should not accept ``calibrated generator’’ as sufficient. At minimum, the platform should expose a test harness specifying: the bidder-policy suite (learning, pacing, or equilibrium solver), the stationarity criterion, and the initialization distribution. This does not require revealing proprietary bidder code; rather, it requires the platform to commit to a behavioral model class and to report diagnostic plots (e.g., convergence of pacing multipliers, stability of spend shares). The core requirement is reproducibility of the mapping (M, P̂) ↦ π.
Calibration on a finite moment set cannot exclude adversarial errors. We therefore recommend a standard stress-test suite in which the platform perturbs P̂ along economically meaningful directions that have historically caused ranking reversals: mixture reweighting (time-of-day, category), tail inflation/deflation of value or price components, and joint shifts that preserve marginal moments but alter dependence (e.g., stronger correlation between high values and high competition).
For each perturbation Qℓ, the platform
reports (a) the realized distance dℱ(P̂, Qℓ)
under the disclosed moment class, and (b) the change in each mechanism’s
performance. This supports an empirical stability envelope
$$
\widehat L_{\mathrm{emp}}
\;=\;
\max_{k\in\{1,2\}}
\max_{\ell}
\frac{\left|\hat J(M_k,Q_\ell)-\hat J(M_k,\hat
P)\right|}{d_{\mathcal{F}}(\hat P,Q_\ell)+\varepsilon},
$$
which is interpretable as a ``worst-case slope’’ within the tested
neighborhood. A regulator can treat the stress-test suite as a
falsification device: if rankings flip under perturbations with small
disclosed distance, then either ℱ is
missing key moments (certificate too weak) or the mechanism–response map
is locally unstable (large L),
and either way simulator claims require qualification.
Platforms should report Monte Carlo uncertainty separately from
calibration/robustness uncertainty. Under the boundedness condition
|g(τ)| ≤ B,
a simple auditable half-width is
$$
\varepsilon_{\mathrm{MC}}(\alpha)
\;=\;
B\sqrt{\frac{\log(2/\alpha)}{2N}},
$$
applied to each Ĵ(Mk, P̂)
(and combined appropriately for gaps). When a platform uses
variance-reduction or common random numbers, it should report both the
method and the effective variance reduction achieved, since this affects
the credibility of small claimed gaps.
The key reporting object is an explicit margin that subtracts both
simulator bias and sampling error. One convenient conservative form
is
$$
\widehat{\mathrm{Margin}}
\;:=\;
\left|\hat\Delta(\hat P)\right|
\;-\;
2\,\widehat L_{\mathrm{emp}}\,\delta_{\mathrm{cert}}
\;-\;
\varepsilon_{\mathrm{MC}}^{\Delta}(\alpha),
$$
where εMCΔ
is an auditable bound for the gap estimator. A platform can then make a
tiered claim:
(i) if $\widehat{\mathrm{Margin}}>0$, report
$M_1$ dominates $M_2$ with calibrated-and-stress-tested margin at level $\alpha$''; (ii) if $\widehat{\mathrm{Margin}}\approx 0$, reportinconclusive;
ranking sensitive to plausible misspecification’‘;
(iii) if $\widehat{\mathrm{Margin}}<0$, report
``simulator does not support a ranking claim under disclosed
calibration.’’
This style of reporting is useful precisely because it is : it converts engineering artifacts (moment discrepancies, stress tests, sample sizes) into a single decision-relevant quantity.
We recommend that any simulator-based policy claim be accompanied by a short ``model card’’ containing: (i) the full list of calibration moments and their empirical values; (ii) δcert and how it was computed; (iii) the bidder-response harness and convergence diagnostics; (iv) the stress-test perturbations and their realized distances; (v) L̂emp (possibly by regime, e.g., tight vs. loose budgets); (vi) Ĵ(Mk, P̂), Δ̂(P̂), N, and Monte Carlo uncertainty; (vii) the final margin and the resulting ranking tier; and (viii) a statement of known blind spots (features not captured by ϕ, objectives not included in g). The purpose is not to force a single methodology, but to ensure that two independent auditors can reproduce the reported margin from the disclosed artifacts.
The recipe is intentionally modular: calibration moments can be
disclosed in aggregated form; stress tests can be run entirely within
the simulator; and bidder response can be standardized via a harness
rather than by revealing bidder internals. Most importantly, the recipe
shifts the governance burden from
trust our simulator'' toverify our certificate and our
stability diagnostics,’’ which is a more realistic standard for complex
platforms.
These prescriptions also clarify what we should study next. Even with careful stress testing, two difficulties remain: bidder response may fail to converge or may be set-valued, and many policy objectives (fairness, privacy, market power) are not well summarized by the standard moment classes used for revenue/welfare calibration. Addressing these limitations requires extending the certification logic beyond the single-valued, well-behaved response setting.
Our guarantees were deliberately phrased in terms of a response map R(M, P) that is single-valued and ``stable enough’’ to make J(M, P) well defined and Lipschitz in the primitive distribution. This is the right abstraction for many engineering pipelines (fixed pacing rules, a specified learning algorithm run to convergence), but it is also the point at which real multi-agent systems most often violate the assumptions. In large ad markets, bidder algorithms co-adapt, platform policies evolve, and the resulting dynamics need not converge to a unique stationary point. In such environments, a simulator can be calibrated on disclosed moments and still mislead, because the dominant error is not distributional misspecification but .
A common failure mode is cycling or slow drift in multi-agent learning: for a fixed (M, P̂), the iterates of a learning rule do not settle to a stationary policy profile, or they converge on a time scale comparable to the episode horizon. When that occurs, J(M, P̂) depends on initialization, update schedules, and the stopping rule of the harness. From a governance perspective, this is not a technicality: if two mechanisms induce different degrees of non-stationarity, then the ranking can flip depending on whether we measure ``early training’’ outcomes or long-run outcomes.
One extension is to make the episode explicitly dynamic and interpret
J as a along the learning
trajectory. Let π(s) denote the
policy profile at learning iteration s, and let τ(s) denote the
trajectory induced at that iteration. We can evaluate
$$
J^{\mathrm{avg}}(M,P)
\;:=\;
\limsup_{S\to\infty}\frac{1}{S}\sum_{s=1}^S
\mathbb{E}\!\left[g(\tau^{(s)})\right],
$$
or, more realistically, its finite-S analogue reported together with
S. This shifts the burden from
equilibrium existence to an ergodic-type requirement: the learning
process should admit a stationary distribution (possibly over policies)
and mix sufficiently fast. A regulator can then require the platform to
report mixing diagnostics (autocorrelations of spend shares, stability
of pacing multipliers) and to run the harness long enough that the
residual non-stationarity is small relative to the ranking margin.
Formally, the continuity and concentration terms remain, but one must
add a third error term capturing :
$$
\left| \hat\Delta - \Delta^* \right|
\;\le\;
\underbrace{\left|\hat\Delta-\Delta^{\mathrm{sim}}\right|}_{\text{Monte
Carlo}}
\;+\;
\underbrace{\left|\Delta^{\mathrm{sim}}-\Delta^*\right|}_{\text{calibration}}
\;+\;
\underbrace{\mathrm{Err}_{\mathrm{dyn}}}_{\text{non-stationarity}},
$$
where Errdyn can be bounded
in favorable cases by a mixing-time argument (and otherwise must be
reported empirically).
Even if an equilibrium exists, it may not be unique. In that case the appropriate response object is a correspondence ℛ(M, P) ⊆ Π, not a function. The performance J(M, P; π) depends on which π ∈ ℛ(M, P) is selected by the market, by the learning dynamics, or by the platform’s own ``default’’ bidder model. This creates an additional wedge between what a simulator shows and what a real market might realize: two simulators calibrated to the same P̂ can disagree simply because they implement different equilibrium selections.
A conservative way to restore auditable claims is to move from point
evaluation to . Define
Jsup(M, P) := supπ ∈ ℛ(M, P)J(M, P; π), Jinf(M, P) := infπ ∈ ℛ(M, P)J(M, P; π).
A robust ranking claim of ``M1 dominates M2’’ then requires
Jinf(M1, P) > Jsup(M2, P),
which is stronger than comparing point predictions but matches the
governance problem: we want dominance that survives plausible
equilibrium selection. Under additional regularity—for instance, if the
graph of ℛ(M, ⋅) is upper
hemicontinuous and J(M, P; π)
is continuous in P uniformly
over π—the envelope
functionals can inherit a Lipschitz-type bound. In practice, rather than
proving these properties, the platform can approximate envelope
uncertainty by running multiple initializations, update schedules, or
equilibrium solvers and reporting the induced range of outcomes as an .
The margin used for ranking should then subtract this bar in the same
way we subtract 2Lδ
and Monte Carlo error.
A second limitation is that many policy objectives are not well summarized by the calibration moments used for revenue and welfare. Fairness metrics depend on protected attributes and often involve ratios (e.g., exposure parity or conditional conversion rates), which can be unstable when denominators are small. Privacy objectives (e.g., limiting the predictability of user attributes from outcomes, or meeting differential-privacy constraints in reporting) are even farther from standard auction moments. Market power and entry are dynamic and strategic at a horizon beyond the repeated-auction episode.
Our framework can accommodate such objectives in principle by redefining g(τ) and enriching ϕ(z), but the analyst must acknowledge two practical consequences. First, the boundedness and Lipschitz conditions become harder to justify: ratio metrics can have unbounded local sensitivity, and privacy metrics can depend on fine-grained dependence structures not captured by coarse moment classes. Second, disclosure constraints become tighter: group-conditional moments and privacy-relevant dependence moments may themselves be sensitive. A realistic governance approach therefore requires (e.g., clipped ratios, minimum-support constraints, or convex surrogates) and (e.g., noisy aggregated moments with known error bars that propagate into δcert).
Moment matching is only as good as the moments. A natural extension
is to choose ϕ not as a
generic
summary of the market,'' but as a targeted set designed to control the specific ranking functional. One way to formalize this is to view $J(M,P)$ as a functional of $P$ and ask for moments that approximate its local sensitivity. In semiparametric language, if $J$ admits an influence-function representation, then matching moments aligned with that influence function yields the tightest control of $J$ per unit of disclosed information. Concretely, this suggests adding moments such as: (i) high-quantile and tail-shape summaries of value and price proxies (since ranking reversals often come from tails); (ii) dependence moments that couple value proxies with competition proxies (to detect correlation shifts that preserve marginals); and (iii) mechanism-specificresponse
moments’’ such as gradients of spend and win-rate with respect to bid
multipliers, evaluated in the regimes where pacing binds.
An operational implementation is adversarial: given a candidate
simulator P̂, search over a
rich class of witness functions ℱrich (e.g., a neural net
discriminator constrained to be Lipschitz, or a large dictionary of
economically motivated basis functions) to find f that maximizes
$$
\left|\widehat{\mathbb{E}}_{P}[f(z)]-\mathbb{E}_{\hat P}[f(z)]\right|.
$$
The resulting f can be added
to the disclosed moment set for the next audit cycle. This ``moment
boosting’’ loop has a clear interpretation: the certificate improves
exactly along directions where the simulator is currently least
credible, and it does so without requiring raw-log disclosure.
A final limitation is subtle but important: the primitive distribution P is often . Auction logs observed under a baseline mechanism M0 reflect selection effects (which opportunities appear, which users are shown ads, which conversions are attributable), so the moments used for calibration may not identify the counterfactual P relevant under M1 or M2. Our recipe implicitly treated P as an environment distribution invariant to the mechanism, which is a good approximation in some settings (e.g., short-run counterfactuals holding demand fixed) but not in others (e.g., reserve changes that alter participation, or disclosure changes that alter bidder learning).
Two remedies are available. The first is experimental: require small randomized perturbations (to reserves, ranking weights, or disclosure) specifically to identify the moments most predictive of counterfactual shifts. The second is robust: expand the uncertainty set to include plausible policy-induced reweightings and participation responses, and report margins that remain valid over that larger set. Both approaches reduce the risk that we ``certify the wrong world.’’
The overarching message is that simulator governance must treat behavior and objectives as first-class citizens alongside calibration. When response is set-valued or non-convergent, the right output is not a single ranking but a ranking reflecting equilibrium selection and dynamics. When objectives expand to fairness or privacy, the right certificate is not generic but to the decision at hand, with regularized metrics and privacy-preserving disclosure. These extensions do not eliminate uncertainty; they make it legible, and therefore governable.