Stochastic, adaptive contracting has become an attractive design choice in modern principal–agent settings precisely because it addresses the frictions that make deterministic contracts brittle. When the principal faces heterogeneous agents, shifting environments, and strategic responses, a single fixed schedule of transfers and performance thresholds can be simultaneously too rigid (failing under distribution shift) and too predictable (inviting gaming). By contrast, a contract—a principal committing to a distribution over feasible contracts and sampling a realized offer each round—can smooth incentives, reduce exploitability, and provide the exploration needed for learning. In practice, the logic is familiar: platforms vary bonus schemes across time and location to manage supply; firms rotate performance metrics to prevent narrow ``teaching to the test’’; procurement agencies randomize audits and contract terms to deter manipulation; and algorithmic systems tune incentives online to stabilize throughput under uncertain demand.
The same features that make adaptive randomization operationally
appealing, however, make it difficult to govern. Two concerns recur in
deployments: and . First, if the principal may quietly alter contract
assignment as information accumulates, then ex post evaluations of
fairness or legality can be undermined by plausible deniability:
discriminatory treatment can be rationalized as
algorithmic adjustment,'' and harmful realized outcomes can be dismissed asbad
luck.’’ Second, the data stream produced by an adaptive policy is
inherently non-i.i.d.: the distribution of offers, actions, and outcomes
shifts endogenously with history. Standard audit tools that treat the
log as a static dataset—compute an average, run a regression, report a
confidence interval under independence—can be invalidated by the very
adaptivity that the system uses to remain profitable or stable.
These tensions have sharpened in 2026-era deployments, where algorithmic contracting is no longer a back-office optimization but a regulated interface between institutions and individuals. In labor platforms, creators’ marketplaces, and ``AI-managed’’ internal work allocation, contracts are often issued continuously, contingent on observable context, and adjusted online. At the same time, governance frameworks increasingly demand that compliance be , not merely asserted. Regulators and counterparties ask questions that are operationally concrete: Were payments ever negative (violating limited liability)? Were the announced rules for random assignment followed, or were they overridden in precisely the cases that mattered? Did the realized wealth trajectories induced by the system satisfy a stipulated fairness requirement, not just on average in a training period, but throughout deployment as the system adapted?
We use the term to describe a standard of evidence that is compatible with these realities. Auditable fairness is not merely the existence of a fairness metric, nor the claim that a policy is ``fair by design.’’ Rather, it is the ability of an external auditor—with access only to a tamper-evident operational log and to publicly checkable commitments about the principal’s randomization—to (i) compute a well-defined target fairness quantity, and (ii) issue a certificate of compliance (or a proof of violation) whose error probability is explicitly controlled, the fact that the policy and the population may evolve endogenously over time. In this sense, the object of interest is as much epistemic as normative: we are not only asking what fairness should mean, but also what forms of fairness can be verified under realistic information constraints.
A central motivation for randomized contracting is stability. When agents anticipate deterministic cutoffs, they may concentrate effort narrowly on measured dimensions, engage in timing games, or coordinate to exploit predictable rules. Randomization weakens such knife-edge incentives and can reduce the returns to manipulation. Moreover, learning-based principals naturally encounter an exploration–exploitation tradeoff: to improve contract design, they must sometimes try alternatives, which necessarily induces variation in treatment. Randomization supplies a disciplined way to introduce such variation without ad hoc discretion. Yet, absent verifiability, randomness can become a cloak rather than a commitment. If the principal can retrospectively claim that a favorable contract draw ``just happened’’ for one group and not another, then the auditability of fairness collapses into a debate over intent. For governance, what matters is whether the process is : a third party should be able to check that the realized contract was generated from the declared distribution, and that the log of outcomes and payments has not been selectively edited.
This perspective suggests a practical separation between two layers of compliance. Some constraints are and thus amenable to deterministic certification from the log. Limited liability is the canonical example: if every logged payment is nonnegative, then the constraint is satisfied; if any payment is negative, the offending entry is a direct witness. Other objectives—fairness of cumulative wealth, welfare, or a Rawlsian floor—are in the sense that they involve expectations over stochastic outcomes and strategic behavior. Even if all realized entries are accurate, fairness depends on latent counterfactuals (what would have happened under alternative histories) and on conditional expectations (what was predictable at each point in time). In adaptive environments, it is therefore natural to demand not certainty but high-confidence guarantees: an auditor should be able to say, ``with probability at least 1 − δ, the deployment-average fairness exceeded the required threshold,’’ and the statement should remain valid no matter when the audit is run.
This requirement pushes us toward sequential, anytime-valid inference. In the deployments we have in mind, audits are not a one-off event; they are periodic, sometimes triggered by complaints, and sometimes executed automatically. Any method that requires fixing the audit time in advance is easy to game (or simply operationally infeasible), and any method that assumes stationarity is fragile to policy drift. What we want instead are guarantees that hold : at each round, the auditor can update an interval for the target fairness quantity, and the interval remains statistically valid even if the principal adapted based on past data. This is precisely the setting in which martingale methods and confidence sequences are appropriate, because they treat adaptivity as a feature of the filtration rather than as a violation of assumptions.
To make this feasible, we deliberately focus on fairness notions that are functions of cumulative wealth trajectories—quantities that can be constructed from logged outcomes and payments. This is not because we believe wealth is the only morally relevant dimension, but because wealth is often the most directly contractible and verifiable proxy for benefit and burden in economic interactions. A fairness functional F(Wt) can encode inequality aversion (e.g., via 1 − Gini), egalitarian objectives (e.g., negative dispersion), or worst-off protection (e.g., a Rawlsian minimum). The key modeling move is to treat fairness as a functional: if wealth changes by a bounded amount in a single round, fairness should not jump arbitrarily. This stability, formalized through Lipschitz-type conditions on F over the relevant bounded domain, is what translates operational boundedness (payments and outcomes cannot explode) into auditability (fairness estimates concentrate over time).
We emphasize two limitations up front. First, any fairness requirement is inherently normative and context-dependent. Auditing can certify compliance with a chosen metric and threshold, but it cannot settle disagreements about what the metric should be, nor can it capture dimensions of harm that are absent from the log (dignity, procedural justice, unsafe work, or coercive outside options). Second, selection and strategic participation matter: if agents can opt out, then observed wealth trajectories reflect both treatment and endogenous composition. Our framework can still produce valid statements about the induced population of participants and, under additional overlap-type conditions, about certain counterfactual reference policies; but it does not magically identify fairness for unobserved counterfactual populations without further assumptions.
With these caveats, our goal is to illuminate a tractable governance path for adaptive contracting systems. The path is conceptually simple: require that (i) operational logs be append-only and tamper-evident, (ii) the principal’s randomization be publicly verifiable (so declared propensities are meaningful commitments), and (iii) fairness metrics be chosen from a class stable enough to admit anytime-valid inference under bounded increments. The resulting compliance regime respects why stochastic adaptivity is economically valuable—it preserves flexibility, exploration, and robustness—while making the resulting distributional consequences contestable. In short, the model is meant to clarify the tradeoff: we can have adaptive, randomized contracts meaningful oversight, but only if we build systems so that fairness claims are not aspirational statements, but audit-ready objects with explicit error control.
Our framework sits at the intersection of four literatures that are often studied in isolation: classic contract theory (with its emphasis on Limited Liability, Individual Rationality, and Incentive Compatibility), learning-enabled principal–agent design (where contracts are updated online), statistical fairness auditing under adaptivity (where the data-generating process responds to history), and the systems/cryptography tools that make operational records contestable (verifiable randomness and tamper-evident logs). We briefly position our contribution relative to each, emphasizing both what we borrow and what we intentionally do attempt to solve.
In canonical principal–agent models, Limited Liability (LL), Individual Rationality (IR), and Incentive Compatibility (IC) are treated as equilibrium constraints that shape the set of feasible contracts (e.g., moral hazard with hidden action, adverse selection with hidden type, or both). LL is particularly prominent in applications where transfers cannot be negative due to bankruptcy constraints, legal restrictions, or platform policy; it is also a workhorse assumption that changes optimal sharing rules and can induce distortions such as bunching at zero payments. In standard theory, these constraints are imposed by the modeler and then enforced by design.
We instead treat LL as an property of realized operation: if payments are logged and the log is tamper-evident, then LL becomes a pointwise statement about the observed record. This moves LL from the realm of equilibrium reasoning into the realm of compliance verification. By contrast, IR and IC are less directly auditable in our setting because they depend on private costs, outside options, and beliefs. Even when we observe opt-out actions, we typically cannot infer whether a participating agent was strictly better off than her outside option, nor whether an action was chosen because it was optimal under the contract or because of unobserved shocks. Accordingly, our baseline audit targets focus on distributional properties of induced by the interaction (cumulative wealth trajectories), rather than attempting to certify IR/IC in a strong structural sense. We view this as a deliberate governance tradeoff: regulators often can and do enforce hard constraints on transfers (e.g., nonnegative pay, wage floors, non-withholding rules), while treating deeper incentive properties as matters for design review, stress testing, or ex ante approval rather than ex post proof.
More broadly, dynamic contracting and relational contract theories highlight that incentives and constraints unfold over time and depend on histories and continuation values. Our model accommodates this operationally—the principal may choose qt as a function of past logs, and agents may respond strategically—but our audit objects are intentionally : the statistical guarantees target conditional expectations given the filtration, without assuming stationarity or a particular equilibrium selection.
A second relevant stream studies principals who learn contracts online, sometimes via reinforcement learning, contextual bandits, or adaptive experimentation. Here, randomized contracts arise for familiar reasons: exploration to improve performance; robustness to non-stationary demand and heterogeneous agent pools; and reduced manipulability when agents attempt to game deterministic thresholds. Work in this area often emphasizes regret, sample efficiency, and strategic behavior (including information design and mechanism design under learning). A common theme is that the principal wants to retain flexibility to adapt qt based on outcomes, while agents respond to the induced incentives and may anticipate future changes.
Our perspective is complementary: we ask what kinds of about fairness and legality can be extracted from the same adaptive process. The main conceptual connection is that learning algorithms naturally produce non-i.i.d. logs, which invalidates audit procedures that assume static treatment assignment or fixed sampling plans. Put differently, even if the principal is using learning methods responsibly, the resulting data stream is still adversarial to naive inference. We therefore take adaptivity as a primitive feature rather than a pathology, and we build the audit layer using tools that remain valid under policy drift and strategic response.
We also differ from much of the learning-in-mechanisms literature in
our choice of target. Many learning formulations optimize welfare,
revenue, or regret relative to a benchmark policy; fairness enters as a
constraint or a secondary objective. In our audit framing, fairness is a
whose satisfaction must be certified with explicit error control from
the operational log. This shift in objective aligns with regulated
deployments, where the question is not only
is the policy optimal?'' butcan the operator demonstrate
that it stayed within required bounds during deployment?’’
A large literature on algorithmic fairness proposes metrics (statistical parity, equalized odds, calibration, individual fairness, welfare-based criteria) and auditing methods to estimate them from data. Much of this work, however, presumes either i.i.d. samples from a fixed distribution or a batch dataset whose sampling mechanism can be treated as exogenous. In adaptive contracting, neither assumption is safe: the policy changes over time, the composition of participating agents may change endogenously through opt-out, and the outcome distribution may shift in response to incentives.
Our approach leverages a line of work in sequential analysis and martingale methods that explicitly treats adaptivity through filtrations. Confidence sequences, e-values, and other anytime-valid constructions are designed to remain correct under optional stopping and continuously monitored testing, which is precisely the operational reality of periodic audits. Conceptually, this is a governance fit: regulators rarely commit to a single audit time, and firms rarely can commit to keeping policies fixed until a predetermined evaluation date. The contribution we emphasize is that, when fairness is formulated as a stable functional of bounded wealth trajectories, one can combine bounded-increment assumptions with martingale concentration to obtain certificates for deployment-average fairness targets.
We stress a limitation here. Many fairness notions of interest depend on unobserved counterfactuals (e.g., ``would the same individual have received a different contract under a different policy?’’) or on protected attributes that may not be logged for legal reasons. Our baseline audit targets are therefore intentionally modest: they certify properties of the induced wealth distribution among observed participants, as recorded. Counterfactual auditing is possible only under additional conditions (e.g., overlap/randomization and well-defined reference policies), and even then the target is typically a policy-level counterfactual rather than an individual-level one. This echoes a broader lesson in causal inference: identifiability requires design, not just clever estimation.
Finally, our emphasis on verifiable randomization and append-only logging draws on systems and cryptography ideas that are increasingly central to ``algorithmic accountability’’ in practice. Transparency logs, secure audit trails, and cryptographic commitments are widely used in domains ranging from certificate authorities and supply chains to financial compliance. Verifiable random functions (VRFs) and related primitives provide publicly checkable proofs that a realized random draw was generated from a committed seed, preventing ex post manipulation while keeping the draw unpredictable ex ante.
In randomized contracting, this matters because fairness disputes often hinge on whether ``randomness’’ was genuine or selectively invoked. A principal who can override random draws in edge cases effectively reintroduces discretion while retaining plausible deniability. By requiring that the principal commit to qt(⋅ ∣ s) and produce a verifiable proof for each realized bt, we turn randomization into an auditable commitment. This is not merely a technical embellishment: it changes what can be contested in a regulatory setting. Combined with tamper-evident logs, it allows an auditor to treat propensities as meaningful objects and to apply off-policy or sequential methods without relying on the operator’s goodwill.
At the same time, cryptographic integrity does not solve all governance problems. It does not guarantee that the logged outcome proxy yi, t is an adequate measure of contribution, nor that the state st is recorded without strategic feature engineering, nor that the fairness metric chosen captures all morally relevant harms. Our aim is narrower: to show that, conditional on a well-specified logging and randomization protocol, certain fairness statements become with explicit error probabilities even under adaptivity.
Taken together, these strands motivate the clean baseline model we introduce next. We specify the operational primitives (contracts, randomization, logging), separate deterministic from statistical compliance objects, and define audit targets that are meaningful under endogenous, non-stationary interaction while remaining verifiable from the log.
We now formalize a clean ``one-step’’ (contextual) contracting model that isolates the objects we will later audit. The intent is not to fully characterize optimal contracts or equilibrium behavior—indeed, we allow the principal to adapt and agents to respond strategically—but rather to pin down (i) what is recorded in the operational log, (ii) what the auditor can verify deterministically versus only statistically, and (iii) which target parameters are meaningful under non-stationary, history-dependent interaction.
Time is indexed by rounds (episodes) t ∈ {1, …, T}. At the start of round t, an observable state or context st is realized (e.g., demand conditions, job attributes, or platform-side features). We treat st as publicly observable to the auditor ex post because it is logged; it may also be observed by agents, depending on the application, but nothing in the audit definitions will rely on agents observing st.
A is an element b ∈ ℬ,
where ℬ is a bounded, pre-specified
class of permissible contract terms. We keep ℬ abstract because the audit layer should be
agnostic to the operator’s contract design details. For intuition, one
auditable special case is a linear share plus floor,
pi, t = mt + αt yi, t, (αt, mt) ∈ [0, 1] × [0, mmax],
where yi, t
is a verifiable outcome proxy attributable to agent i in round t and pi, t
is the logged transfer. More generally, bt can include
schedules, thresholds, or discrete menus, as long as realized payments
are unambiguously determined from the logged variables.
The principal is allowed to choose contracts adaptively. Formally,
before selecting a realized contract in round t, the principal declares a
conditional distribution (propensity) over contracts,
qt( ⋅ ∣st) ∈ Δ(ℬ),
which may depend on the full past history through the filtration ℱt − 1 generated by the
log up to t − 1. The realized
contract is then drawn as
bt ∼ qt(⋅ ∣ st).
Crucially, in our operational protocol this draw is not a black box: the
principal must produce a publicly checkable proof (via a verifiable
random function, VRF) that the draw bt was generated
from the declared qt(⋅ ∣ st)
using an unpredictable but verifiable source of randomness.
Conceptually, the declaration qt is a that
turns randomization into an auditable choice rather than managerial
discretion. This commitment is what later permits the auditor to treat
propensities as meaningful inputs to statistical procedures (and, in
optional extensions, to off-policy estimators).
There are n agents indexed
by i ∈ {1, …, n}.
After observing the posted contract bt (and possibly
the context st), each agent
chooses an action
ai, t ∈ 𝒜i ∪ {reject},
where the reject/opt-out action captures non-participation. The
environment then generates realized outcomes yi, t,
which we interpret as a measurable contribution (or performance proxy)
attributable to agent i in
round t. We do not assume a
stationary outcome model: the distribution of yi, t
may depend on (st, bt, ai, t),
on past history, and on unobserved shocks. The only structural
requirement is that yi, t
is verifiable (or at least logged in a way that is contestable), so that
payments can be checked against the contract.
Given (bt, st, ai, t, yi, t), the principal makes a realized payment pi, t to each agent. In applications, pi, t might be computed mechanically from bt and yi, t (which makes auditing easiest), or it might include discretionary components; our deterministic compliance checks will be framed in terms of the logged payments regardless.
To connect contracting to distributional compliance, we track wealth
increments for the principal and each agent. Let Δwi, t
denote agent i’s realized
wealth increment at time t,
and Δwp, t
the principal’s. We allow general utility accounting for non-monetary
costs and outside options,
$$
\Delta w_{i,t} :=
\begin{cases}
u_i(p_{i,t},a_{i,t},s_t,y_{i,t}) & \text{if }
a_{i,t}\neq\text{reject},\\
u_i^{\mathrm{out}}(s_t) & \text{if } a_{i,t}=\text{reject},
\end{cases}
\qquad
\Delta w_{p,t} := \sum_{i=1}^n (y_{i,t}-p_{i,t}).
$$
In the baseline audit, the auditor need not observe ui or uiout
directly; what matters is that the log contains a derived from
observable components. In the simplest instantiation, we take the proxy
wealth increment to be monetary (e.g., Δwi, t = pi, t
for participants and 0 for reject), and
treat non-monetary costs as part of the limitation of what can be
certified ex post. We then define cumulative wealth
$$
w_{j,t}:=\sum_{k=1}^t \Delta w_{j,k},
\qquad
W_t := (w_{p,t},w_{1,t},\dots,w_{n,t})\in\mathbb R^{n+1}.
$$
This cumulative wealth vector is the state on which our fairness and
welfare targets will be evaluated. Importantly, Wt is to the
interaction: it reflects the principal’s adaptive choices and agents’
strategic responses.
The auditor observes an append-only, tamper-evident log of each
round’s transcript
(st, bt, a1 : n, t, y1 : n, t, p1 : n, t),
together with any cryptographic commitments and VRF proofs needed to
verify that bt was sampled
from the declared propensity qt(⋅ ∣ st).
We write ℱt for the
filtration generated by this log up to time t. All statistical validity claims
will be stated on ℱt − 1, which is the
appropriate way to formalize adaptivity: the principal may choose qt as any
measurable function of ℱt − 1, and agents may
choose actions as functions of current information and the anticipated
continuation, without invalidating martingale-based inference.
To enable anytime-valid concentration under adaptivity, we assume
bounded realized increments. Concretely, for each party j ∈ {p, 1, …, n}
and each round t, we
assume
$$
\Delta w_{j,t}\in[\underline B,\overline B],
$$
where the bounds may be set by design (caps/floors in contracts, limited
exposure, or platform constraints) or by the choice of wealth proxy.
This boundedness is not innocuous, but it aligns with practice:
regulated contracts typically have maximum payouts, and auditing
protocols typically rely on bounded-score proxies. We emphasize that we
do assume i.i.d. data, stationarity, or a parametric outcome model.
We separate compliance objects into those that are deterministically verifiable from the log and those that require statistical inference.
First, Limited Liability (LL) is an example of a deterministic constraint. When LL takes the form pi, t ≥ 0 for all i, t, it is directly checkable from the recorded transfers. The audit target is therefore not an expectation but a logical statement over the realized transcript: any negative payment constitutes a certifiable violation tied to a specific log entry.
Second, we define welfare-type targets as functionals of the
cumulative wealth vector. For example, total welfare at time t can be written as
Welfare(Wt) := ∑j ∈ {p, 1, …, n}wj, t,
and Rawlsian welfare as Rawls(Wt) := minjwj, t.
These metrics are attractive for auditing because they are computable
from the log once the wealth proxy is specified.
Third, our main target is a fairness functional F : ℝn + 1 → ℝ applied to cumulative wealth, such as 1 − Gini(Wt), a Jain index, the negative variance of wealth, or a minimum-share criterion. The key regularity condition we impose for the baseline theory is Lipschitz stability: F is L-Lipschitz on the feasible wealth domain induced by the bounded increments. This is a deliberate modeling choice reflecting a governance intuition: fairness notions that are excessively sensitive to single-round perturbations are difficult to certify from finite logs.
Because the process is adaptive, we do not target the empirical
average $\frac{1}{T}\sum_{t=1}^T
F(W_t)$ as if it were an i.i.d. sample mean. Instead, we target
the of fairness,
$$
\mu_t \;:=\; \frac{1}{t}\sum_{k=1}^t \mathbb E\!\left[F(W_k)\mid
\mathcal F_{k-1}\right],
$$
and evaluate compliance against a policy threshold τ via μT ≥ τ.
This parameter has two governance-relevant features. First, it is
well-defined under arbitrary policy drift: 𝔼[F(Wk) ∣ ℱk − 1]
is the operator’s fairness at time k, given what was known just before
acting. Second, it aligns with operational auditing, where the question
is whether the system stayed within bounds deployment, not merely on a
hypothetical stationary distribution.
This clean baseline model now lets us articulate the audit problem precisely: given the log and cryptographic proofs, an auditor should (i) deterministically flag any LL violations, and (ii) produce anytime-valid confidence sequences for welfare/fairness targets that remain correct under the adaptive, strategic interaction described above. We next formalize what the auditor is allowed to assume (and what it explicitly does not assume) in a threat model, and we define (ε, δ)-compliance statements that connect these targets to actionable regulatory conclusions.
Our auditing layer is intended to be robust to precisely those
features that make online contracting operationally attractive and
regulatorily challenging: the principal can adapt the contracting policy
in response to history, agents can respond strategically (including
opting out), and the resulting data stream is neither i.i.d. nor
stationary. For this reason, we separate the
economic degrees of freedom'' of the actors from theintegrity
assumptions’’ that make the log a meaningful evidentiary object. The
threat model clarifies what kinds of behavior we treat as part of the
regulated system (and hence must be handled by the audit), versus what
kinds of behavior would constitute falsification of the audit record
itself.
We allow the principal to be fully strategic subject to the protocol. In particular, the principal may choose qt(⋅ ∣ st) as any ℱt − 1-measurable function, including policies that drift rapidly over time, target particular subpopulations through state dependence, or attempt to ``game’’ the fairness statistic by changing terms of trade across rounds. Likewise, agents may behave strategically and heterogeneously: their actions ai, t can depend on the posted contract, their private information, and expectations about future offers. Importantly, we do not impose equilibrium restrictions (Bayes–Nash, trembling-hand, etc.) because an auditor typically cannot validate such assumptions from an operational log. Finally, we allow the outcome-generating process for yi, t to be non-stationary and history dependent; the only role outcomes play in the baseline audit is through logged observables and boundedness.
This ``maximally adaptive’’ stance is a feature, not a bug: it ensures that the validity of our statistical claims is not contingent on the operator adhering to a fixed policy class, nor on agents conforming to a stable behavioral model. The cost is that we must define our compliance targets in a way that remains well-posed under such adaptivity—hence the emphasis on predictable (conditional) targets such as 𝔼[F(Wt) ∣ ℱt − 1] rather than stationary-population averages.
Against these economic degrees of freedom, we posit a narrow set of integrity assumptions that make auditing feasible.
These assumptions do require the auditor to trust the principal’s incentives or the agents’ incentives; they require only that the recorded transcript is an accurate, contestable account of what the system actually did, and that declared randomization is binding.
Equally important, we state what the auditor does assume, since these omissions determine the interpretation (and limitations) of any certificate.
In short, our audit statements are designed to be under minimal behavioral assumptions, at the expense of targeting quantities that are meaningfully defined from the logged deployment process.
The log supports two qualitatively different kinds of compliance conclusions.
First, some constraints are and therefore deterministically auditable. Limited Liability in the form pi, t ≥ 0 is the canonical example: a single negative payment is a concrete violation tied to a specific round and agent, and the auditor can produce the corresponding log entry as evidence. Similar deterministic checks include syntactic validity of contract parameters, adherence to stated caps/floors, and (once specified) integrity checks for missingness.
Second, distributional targets such as welfare and fairness are inherently because we target conditional expectations that reflect the system’s predictable behavior under uncertainty. Here, the right governance question is not whether the realized path happened to look fair (which can be luck or noise), but whether the system’s fairness during deployment met a threshold, as captured by $\mu_T=\frac{1}{T}\sum_{t=1}^T \mathbb E[F(W_t)\mid\mathcal F_{t-1}]$.
We formalize compliance as a statement that blends deterministic constraints with probabilistic guarantees. Fix a horizon T, a fairness threshold τ, and a failure probability δ ∈ (0, 1). An auditing algorithm observes the log sequentially and outputs (i) deterministic flags for any pointwise violations and (ii) an anytime-valid confidence sequence [LCBt, UCBt] for the predictable deployment-average fairness μt.
We say the system is with respect to threshold τ over horizon T if:The role of ε ≥ 0 is to accommodate governance-relevant approximations that are not statistical in nature: discretization of a continuous contract space, conservative bounding of wealth proxies, or an explicitly permitted tolerance band around τ. In the baseline development, one can take ε = 0 when the target is exactly μT ≥ τ and the fairness functional is computed exactly from the proxy wealth.
Because μT is not
directly observed, compliance must be by an audit rule. Given an
anytime-valid confidence sequence, a natural risk-limiting decision rule
is:
Certify compliance at time
T ⇔ LCBT ≥ τ − ε.
By construction of confidence sequences, this certificate is :
Pr (LCBT ≥ τ − ε and μT < τ − ε) ≤ δ,
even though the underlying data are adaptive and non-stationary.
Symmetrically, one may if UCBT < τ − ε,
with the analogous error control. When τ − ε lies inside the
interval, the audit is inconclusive; this is not a failure of the method
but an explicit reflection of finite-sample uncertainty.
A key operational feature is that the regulator may audit at unpredictable times, or the platform may need to monitor compliance continuously. For this reason we require : with probability at least 1 − δ, the confidence statement holds simultaneously for all t ≤ T. This allows audits at stopping times (e.g., ``trigger an investigation when complaints arrive’’) without inflating Type I error. When multiple metrics are audited (e.g., LL and fairness, or multiple fairness functionals), we can allocate failure budgets δ1, …, δM across metrics and apply a union bound so that overall failure probability remains controlled.
Finally, we emphasize an interpretive point that is often blurred in policy discussions. A compliance certificate at level δ is a guarantee about the system’s fairness with respect to the and the . It does not certify unobserved welfare components (e.g., effort costs), nor does it establish that the principal could not have achieved higher welfare while remaining fair. In that sense, the model illuminates a core tradeoff: by insisting on targets that remain well-defined and auditable under adaptivity and strategic response, we obtain strong error control and contestability, but we necessarily restrict attention to fairness notions that are stable and measurable from the operational record.
The next section makes these integrity conditions concrete by specifying how verifiable randomization and log integrity are implemented, and by listing the deterministic checks (including LL, boundedness, and missingness) that the auditor can perform before running any statistical certification procedure.
Our statistical guarantees in the next section rest on a simple prerequisite: the auditor must be able to treat the transcript as an accurate, contestable record of (i) what the principal it would do (the propensity function qt) and (ii) what the principal did (the realized draw bt and subsequent payments). This section makes that prerequisite operational by specifying (a) a VRF-based sampling protocol that binds the principal to its declared randomization and (b) a small set of deterministic checks that the auditor can run directly on the log before attempting any statistical certification.
To make propensity compliance verifiable, the contract space ℬ and each distribution qt(⋅ ∣ s) must admit a canonical encoding. For finite ℬ this is straightforward: the principal logs a probability vector {qt(b ∣ st)}b ∈ ℬ in a fixed order with a prescribed numerical precision. For continuous or high-dimensional ℬ, we require a declared with a canonical description (e.g., a parametric family with parameters θt(st) and a fixed base measure), together with an explicit discretization rule if the actual implementation is discretized. The key design principle is that, given the logged fields, the auditor can deterministically (i) reconstruct the declared distribution to within a known tolerance and (ii) reproduce the mapping from a uniform variate to the realized contract.
In practice, we recommend logging both (1) a human-interpretable object (e.g., parameters and family name) and (2) a machine-verifiable digest (e.g., a hash of the canonical serialization). The latter is what we cryptographically bind into the VRF input so that the principal cannot change its story about qt ex post.
Fix a VRF keypair (pk, sk) registered with the auditor prior to deployment. In round t, the principal must commit to the propensity object the random draw is determined. A clean way to enforce this ordering is to define a round-specific VRF input that includes a commitment to qt.
Concretely, let Entryt − 1 denote the
previous log entry and let ht − 1 := H(Entryt − 1)
be its hash (or the running hash-chain value). Let Enc(qt(⋅ ∣ st))
be the canonical serialization of the declared propensity object at
state st,
and let ct := H(Enc(qt(⋅ ∣ st)))
be its commitment digest. Define the VRF input
xt := H (ht − 1 ∥ t ∥ st ∥ ct),
where ∥ denotes concatenation and H is a collision-resistant hash. The
principal computes the VRF output and proof
(rt, πt) := VRFsk(xt),
and deterministically maps rt to a uniform
variate ut ∈ [0, 1)
(e.g., by interpreting rt as an integer
and normalizing). The realized contract is then computed by a
deterministic sampling map
bt := Sample(qt(⋅ ∣ st), ut),
where Sample is fixed by protocol. For
discrete ℬ, Sample can be implemented via the inverse-CDF
rule on the ordered support. For continuous ℬ, Sample is
the declared procedure (e.g., inverse transform for a univariate family,
or a deterministic pseudocode for multivariate sampling) applied to
ut (and,
if needed, additional variates derived deterministically from rt).
The log entry for round t
contains, at minimum,
(st, Enc(qt(⋅ ∣ st)), ct, xt, rt, πt, bt, a1 : n, t, y1 : n, t, p1 : n, t),
together with the hash-chain value ht = H(ht − 1 ∥ Entryt)
and a digital signature under the principal’s signing key. Given these
fields, the auditor verifies: (i) hash-chain continuity and signature
validity; (ii) ct matches Enc(qt); (iii)
πt is a
valid VRF proof for input xt under pk; and (iv) recomputing ut and Sample(⋅) reproduces the logged bt.
Two operational remarks matter for soundness. First, including ht − 1 and t in xt prevents replay and makes the draw round-specific. Second, including ct in xt prevents the principal from choosing qt observing the VRF output. In this sense, the VRF is not merely a randomness beacon; it is a commitment device that makes the principal’s declared propensity contestable.
When ℬ is continuous, implementations often discretize either the support or the CDF inversion. Because the auditor must obtain an exact match to accept a draw, we require that discretization be part of the declared sampling map and be deterministically reproducible (including rounding rules). Any approximation error can then be treated as an explicit governance slack ε (as in Section~) or bounded as a deterministic implementation error (e.g., total variation distance between the ideal and implemented sampling rule). The audit criterion is then: the realized bt must equal the output of the declared deterministic procedure applied to the VRF-derived ut.
Binding a draw to a declared propensity is only useful if the surrounding transcript is itself immutable. Accordingly, each round must be recorded in an append-only, tamper-evident structure (hash chain, signed ledger, or an external notarization service). The auditor’s baseline integrity checks are mechanical: verify that each entry is signed, that the hash chain links correctly, and that round indices are contiguous. These checks are not statistical; they are the evidentiary substrate on which statistical claims will later rest.
Before constructing any confidence sequence, the auditor should first run a set of deterministic checks that either (i) produce immediate violations with cryptographic evidence or (ii) certify that the boundedness and measurability conditions needed for martingale tools plausibly hold for the logged proxy.
The role of this section is deliberately narrow: we are not yet claiming that the system is fair, only that the transcript is . VRF verification and log integrity make the stochastic elements of deployment contestable; LL, missingness, and boundedness checks ensure that the objects we will feed into martingale machinery are well-defined and satisfy the regularity conditions under which anytime-valid inference is possible. With these prerequisites in place, we can treat {F(Wt)}t ≥ 1 (or suitable increments) as an adapted, bounded process and construct confidence sequences that remain valid under the adaptivity and strategic response emphasized in Section~. The next section develops this statistical machinery.
Having made the transcript (the realized contract draw is bound to the declared propensity and the log is tamper-evident), we can treat the deployed system as generating an adapted stochastic process relative to the auditor’s filtration {ℱt}t ≥ 0. The key methodological point is that we do require independence, stationarity, or a fixed policy: the principal may update qt based on history, agents may respond strategically, and the induced distribution of outcomes may drift arbitrarily. What we do require for anytime-valid inference is that the statistic we audit can be expressed as a bounded adapted process, so that deviations from its conditional expectations form a martingale difference sequence to which time-uniform concentration applies.
Fix any ℱt-measurable scalar
statistic Zt computed from
the log at time t (e.g.,
welfare, Rawlsian minimum wealth, or a fairness functional F(Wt)).
Define the (predictable) conditional mean
mt := 𝔼[Zt ∣ ℱt − 1],
and the deployment-average conditional expectation
$$
\mu_t \;:=\; \frac{1}{t}\sum_{k=1}^t m_k
\;=\; \frac{1}{t}\sum_{k=1}^t \mathbb E[Z_k\mid \mathcal F_{k-1}].
$$
This μt is
the natural audit target in an adaptive environment: it is the average
performance that the deployed process delivers, conditional on what was
known just before each round. Importantly, μt remains
well-defined even when the data stream is non-i.i.d.
Let
Dt := Zt − mt.
Then {Dt}
is a martingale difference sequence: 𝔼[Dt ∣ ℱt − 1] = 0.
Writing partial sums $S_t:=\sum_{k=1}^t
D_k$, we obtain the decomposition
$$
\bar Z_t - \mu_t
\;=\;
\frac{1}{t}\sum_{k=1}^t (Z_k-m_k)
\;=\;
\frac{S_t}{t},
\qquad
\bar Z_t:=\frac{1}{t}\sum_{k=1}^t Z_k.
$$
Thus, any time-uniform bound on St immediately
yields a confidence sequence for μt centered at
the observed average Z̄t, despite
adaptivity.
Martingale concentration requires controlling tail growth of St, which in
turn is achieved by bounding Dt. In our
setting the auditor can verify ex ante that per-round wealth increments
lie in a bounded interval $\Delta
w_{j,t}\in[\underline B,\overline B]$. This implies that
cumulative wealth vectors remain in a known bounded set:
$$
W_t \in [t\underline B,t\overline B]^{n+1}.
$$
Therefore, for many audit-relevant functionals Zt = ϕ(Wt),
the auditor can derive deterministic bounds $Z_t\in[\underline z_t,\overline z_t]$
(possibly depending on t but
known in advance), which imply a bound on Dt.
Concretely, if $Z_t\in[\underline z,\overline z]$ for all t ≤ T, then $D_t\in[\underline z-\overline z,\overline z-\underline z]$, so $|D_t|\le (\overline z-\underline z)$. Even when only time-varying bounds are available, one can work with predictable envelopes $\underline z_t,\overline z_t$ and apply confidence sequence constructions for bounded but non-identically bounded differences; in what follows we emphasize the simpler uniform-bound case, since our wealth bounds imply uniform boundedness over any fixed horizon T.
We use the standard supermartingale method. Suppose Dt is
conditionally sub-Gaussian with scale parameter σ (a condition implied by
boundedness via Hoeffding’s lemma). Then for any fixed λ ∈ ℝ,
$$
M_t(\lambda)
\;:=\;
\exp\!\Big(\lambda S_t - \tfrac{\lambda^2\sigma^2}{2}\,t\Big)
$$
is a nonnegative supermartingale. Ville’s inequality yields
Pr (∃t ≤ T: Mt(λ) ≥ 1/δ) ≤ δ,
which can be rearranged into a time-uniform boundary for St of the
form
$$
S_t \;\le\; \frac{\log(1/\delta)}{\lambda} +
\frac{\lambda\sigma^2}{2}\,t
\qquad \forall t\le T
$$
with probability at least 1 − δ. Optimizing over λ gives a $\sqrt{t}$-type boundary. To avoid committing
to a single horizon T (or to
obtain a bound valid for all t ≥ 1 simultaneously), one uses
either (i) supermartingales that integrate Mt(λ)
over a mixing distribution on λ, or (ii) arguments that union
bound over geometrically increasing epochs. These constructions yield
the familiar anytime-valid scaling
$$
|S_t|
\;\lesssim\;
\sigma \sqrt{t\,\log\!\log t} \;+\; \sigma\sqrt{t\,\log(1/\delta)}
$$
(up to constants), and therefore
$$
|\bar Z_t-\mu_t|
\;=\;
\frac{|S_t|}{t}
\;\lesssim\;
\sigma \sqrt{\frac{\log\!\log t + \log(1/\delta)}{t}}.
$$
We will state explicit finite-sample widths in the next section; here
the point is conceptual: validity is obtained by controlling the running
maximum of a supermartingale, not by repeating fixed-time tests.
Pure bounded-difference (Hoeffding-style) widths can be conservative
when realized variability is small. Martingale confidence sequences
admit variance-sensitive analogues based on the predictable quadratic
variation
$$
V_t \;:=\; \sum_{k=1}^t \mathbb E[D_k^2\mid \mathcal F_{k-1}],
$$
leading to Freedman/Bernstein-type boundaries of the schematic
form
$$
|S_t|
\;\lesssim\;
\sqrt{V_t\,\log(1/\delta)} \;+\; c\,\log(1/\delta),
$$
where c is a bound on |Dt|. While
Vt is not
directly observable, one can upper bound it using deterministic
envelopes, or employ empirical-Bernstein confidence sequences that
replace Vt
with an observable self-normalizer built from the realized Zt’s (at the
cost of slightly larger constants). Operationally, this matters because
fairness and welfare trajectories in many deployments are far less
volatile than worst-case bounds suggest; variance-adaptive sequences
tighten substantially and can certify compliance earlier.
The remaining technical step is to ensure that our audit statistics
are indeed bounded (or have bounded differences) under the logged wealth
proxy. For welfare and Rawlsian objectives this is immediate, since they
are Lipschitz functionals of Wt under
standard norms. For general fairness functionals F(Wt),
we impose L-Lipschitzness on
the relevant bounded domain:
|F(W) − F(W′)| ≤ L ∥W − W′∥ for
all feasible W, W′.
Under this condition and bounded wealth increments, we obtain
deterministic one-step stability:
|Zt − Zt − 1| = |F(Wt) − F(Wt − 1)| ≤ L ∥Wt − Wt − 1∥ = L ∥ΔWt∥.
If we take ∥⋅∥ to be the Euclidean
norm, then $\|\Delta W_t\|\le
\sqrt{n+1}\,\max\{|\underline B|,|\overline B|\}$, hence $|Z_t-Z_{t-1}|\le L\sqrt{n+1}\max\{|\underline
B|,|\overline B|\}$. This stability is what lets us translate the
primitive boundedness of per-round transfers and outcomes into
boundedness (or at least controlled range growth) of fairness statistics
computed on cumulative wealth.
Conceptually, Lipschitzness is the bridge between bounds (the contract cannot move anyone’s wealth too much in one round) and concentration (the fairness metric cannot jump too much in one round). It is also where limitations surface: some desirable disparity measures are not globally Lipschitz without additional domain restrictions (e.g., inequality indices that divide by mean wealth), which is why we will separately impose a mean-wealth lower bound for 1 − Gini in Section~7.
Once we have an anytime-valid confidence sequence [LCBt, UCBt] for μt, the audit decision rule is immediate: we can (i) certify compliance at horizon T whenever LCBT ≥ τ, (ii) flag likely noncompliance when UCBt < τ, and (iii) monitor continuously without inflating type-I error, since the sequence is valid under optional stopping. This operationalizes a regulatory stance that is both strict about evidentiary integrity (deterministic checks) and appropriately cautious about statistical uncertainty (time-uniform inference under adaptivity).
The next section instantiates this template for (a) welfare and welfare increments, (b) the Rawlsian minimum, and (c) Lipschitz fairness metrics, and then treats 1 − Gini by adding the minimal domain restriction needed to restore Lipschitz stability.
In this section we instantiate the generic martingale template with explicit, finite-sample confidence sequence (CS) widths for the three audit objects that recur in policy discussions: (a) total welfare (or welfare increments), (b) Rawlsian protection of the worst-off party, and (c) inequality-style fairness metrics that are stable functionals of cumulative wealth. The common structure is that we choose an adapted statistic Zt whose range is deterministically bounded from the log primitives, and then form time-uniform bounds for the deployment-average conditional mean $\mu_t=\frac{1}{t}\sum_{k=1}^t \mathbb E[Z_k\mid \mathcal F_{k-1}]$.
We begin with a single reusable lemma (stated here as a theorem for convenience) that turns a boundedness certificate into an operational CS.
Assume Zt
is ℱt-measurable
and almost surely bounded as $Z_t\in[\underline z,\overline z]$ for all
t ≥ 1, with range $c:=\overline z-\underline z$. Define $\bar Z_t:=\frac{1}{t}\sum_{k=1}^t Z_k$ and,
for any δ ∈ (0, 1),
$$
\mathrm{rad}_t(\delta)
\;:=\;
c\sqrt{\frac{2}{t}\left(\log\frac{2}{\delta}+\log\!\big(1+\log_2
t\big)\right)}.
$$
Then the interval
CSt(δ) := [Z̄t − radt(δ), Z̄t + radt(δ)]
satisfies
Pr (∀t ≥ 1: μt ∈ CSt(δ)) ≥ 1 − δ,
for any adaptive (non-i.i.d.) data stream consistent with the filtration
{ℱt}.
The important operational point is that the auditor only needs the deterministic range c (derivable from $\underline B,\overline B$ and the functional form of Zt) and the observed running average Z̄t to compute CSt(δ) online. In applications below, we will typically report the compliance condition as LCBT ≥ τ, where LCBt := Z̄t − radt(δ).
A natural welfare proxy in our contracting environment is total
realized wealth change (principal plus agents) per round,
Ztwel := ΔWelfaret := ∑j ∈ {p, 1, …, n}Δwj, t.
Because each $\Delta w_{j,t}\in[\underline
B,\overline B]$ by assumption, we have the deterministic
bound
$$
Z_t^{\mathrm{wel}}
\in
\big[(n+1)\underline B,\ (n+1)\overline B\big],
\qquad
c_{\mathrm{wel}}
=
(n+1)(\overline B-\underline B).
$$
Applying Theorem~7.1 yields an anytime-valid CS for
$$
\mu_t^{\mathrm{wel}}
:=
\frac{1}{t}\sum_{k=1}^t
\mathbb E\!\left[\Delta \mathrm{Welfare}_k\mid \mathcal F_{k-1}\right],
$$
i.e., the deployment-average welfare increment. This target is
particularly well-suited to adaptive deployments: it answers the
regulatory question ``on average, conditional on what the principal knew
when acting, what surplus did the deployed mechanism deliver?’’ without
requiring stationarity.
If a policy requirement is stated instead in terms of cumulative
welfare Welfare(Wt) = ∑jwj, t,
the auditor can translate between increments and levels via
$$
\frac{1}{T}\sum_{t=1}^T \mathrm{Welfare}(W_t)
=
\frac{1}{T}\sum_{t=1}^T \sum_{k=1}^t \Delta \mathrm{Welfare}_k
=
\sum_{k=1}^T \left(1-\frac{k-1}{T}\right)\Delta \mathrm{Welfare}_k,
$$
so that a weighted version of the same CS machinery applies (with
predictable weights); we emphasize increments because they preserve a
uniform range and therefore yield clean $\tilde O(1/\sqrt{t})$ widths.
The Rawlsian metric at time t is the minimum cumulative
wealth,
Rt := minj ∈ {p, 1, …, n}wj, t.
Directly auditing {Rt} as a level
statistic is possible but unattractive because its range grows linearly
in t, which mechanically
widens Hoeffding-style bounds. A simple normalization avoids this
problem. Define the worst-off wealth,
$$
Z_t^{\mathrm{raw}}
\;:=\;
\frac{1}{t}\,R_t
=
\frac{1}{t}\min_{j} w_{j,t}.
$$
Since each $w_{j,t}\in[t\underline
B,t\overline B]$, it follows deterministically that $Z_t^{\mathrm{raw}}\in[\underline B,\overline
B]$, hence $c_{\mathrm{raw}}=\overline
B-\underline B$. Theorem~7.1 therefore provides an anytime-valid
CS for
$$
\mu_t^{\mathrm{raw}}
:=
\frac{1}{t}\sum_{k=1}^t
\mathbb E\!\left[\frac{1}{k}\min_j w_{j,k}\ \middle|\ \mathcal
F_{k-1}\right].
$$
This target has a clear interpretation in deployment terms: it averages
(over rounds) the predictable value of the worst-off party’s
wealth-to-date.
A second, sometimes more policy-aligned alternative is to audit the
in the Rawlsian minimum:
ΔRt := Rt − Rt − 1.
One can show purely from the wealth-increment bounds that $\Delta R_t\in[\underline B,\overline B]$ for
all t (because the minimum
cannot rise by more than the largest feasible one-step increment, nor
fall by more than the smallest). Thus Theorem~7.1 applies again with the
same range $\overline B-\underline B$,
yielding an anytime-valid CS for the predictable ``worst-off growth
rate’’ $\frac{1}{t}\sum_{k=1}^t \mathbb
E[\Delta R_k\mid\mathcal F_{k-1}]$.
Let F : ℝn + 1 → ℝ
be an L-Lipschitz functional
on the feasible wealth domain up to horizon T (for a chosen norm), and set
Ztfair := F(Wt).
When F is itself range-bounded
(as with many normalized indices taking values in [0, 1]), Theorem~7.1 applies immediately with
cfair = 1. More
generally, even when F is not
globally bounded, our primitive wealth increment bounds restrict Wt to the
hyper-rectangle $[t\underline B,t\overline
B]^{n+1}$, and Lipschitzness yields a deterministic range bound
on Ztfair
over t ≤ T:
$$
\sup_{W,W'\in [T\underline B,T\overline B]^{n+1}} |F(W)-F(W')|
\;\le\;
L\cdot \mathrm{diam}\!\left([T\underline B,T\overline B]^{n+1}\right),
$$
where the diameter is explicit under the chosen norm (e.g., under ℓ2 it equals $T(\overline B-\underline B)\sqrt{n+1}$).
Therefore, for a fixed deployment horizon T, the auditor can compute a valid
uniform range cfair(T) and
apply Theorem~7.1 to obtain an anytime-valid CS for
$$
\mu_t^{\mathrm{fair}}
:=
\frac{1}{t}\sum_{k=1}^t \mathbb E\!\left[F(W_k)\mid \mathcal
F_{k-1}\right],
$$
which is exactly the fairness target used in our compliance
definition.
In many deployments, worst-case envelopes based on $\overline B-\underline B$ are conservative.
The auditor can therefore supplement Theorem~7.1 with an
empirical-Bernstein CS whose width depends on realized variability. A
canonical form is
$$
\mathrm{rad}^{\mathrm{EB}}_t(\delta)
\;=\;
\sqrt{\frac{2\widehat V_t\log(C/\delta)}{t^2}}
\;+\;
\frac{K\,c\log(C/\delta)}{t},
$$
where V̂t
is an observable self-normalizer (built from the realized {Zk}), and C, K are absolute constants
determined by the chosen EB construction. This refinement can materially
accelerate certification when fairness (or welfare) is stable.
The index F(W) = 1 − Gini(W) is attractive because it is widely understood by stakeholders, but it is not globally Lipschitz due to the normalization by mean wealth. The minimal fix is to restrict attention to a domain where the mean is bounded away from zero.
Define, for W = (w0, w1, …, wn)
with mean $\bar w:=\frac{1}{n+1}\sum_{j=0}^n
w_j$,
$$
\mathrm{Gini}(W)
:=
\frac{1}{2(n+1)^2\,\bar w}\sum_{j=0}^n\sum_{k=0}^n |w_j-w_k|,
\qquad
F(W):=1-\mathrm{Gini}(W).
$$
Assume the auditor can verify a policy-imposed condition
w̄t ≥ μmin > 0 for
all t ≤ T,
directly from the log-computed wealth vector Wt. Also let
$R_T:=T\max\{|\underline B|,|\overline
B|\}$ so that |wj, t| ≤ RT
for all j and t ≤ T.
On the restricted feasible set
𝒲T(μmin) := {W ∈ [−RT, RT]n + 1: w̄ ≥ μmin},
one can bound the sensitivity of Gini
as follows (under the ℓ1 norm). Let A(W) := ∑j, k|wj − wk|.
Then
|A(W) − A(W′)| ≤ 2(n + 1) ∥W − W′∥1, A(W) ≤ 2RT(n + 1)2,
and
$$
\left|\frac{1}{\bar w}-\frac{1}{\bar w'}\right|
\le
\frac{|\bar w-\bar w'|}{\mu_{\min}^2}
\le
\frac{\|W-W'\|_1}{(n+1)\mu_{\min}^2}.
$$
Combining these inequalities yields a Lipschitz bound on Gini (hence also on 1 − Gini):
$$
|\mathrm{Gini}(W)-\mathrm{Gini}(W')|
\;\le\;
\left(
\frac{1}{(n+1)\mu_{\min}}
+
\frac{R_T}{(n+1)\mu_{\min}^2}
\right)\|W-W'\|_1,
\qquad
W,W'\in\mathcal W_T(\mu_{\min}).
$$
Thus F(W) = 1 − Gini(W)
is Lipschitz on the auditable domain, with an explicit constant that
worsens as μmin ↓ 0
(the precise expression is less important than this comparative static).
Since F(Wt) ∈ [0, 1]
on this domain, the simplest implementation is to take Zt = F(Wt) ∈ [0, 1]
and apply Theorem~7.1 directly with cfair = 1, while treating
the mean lower bound w̄t ≥ μmin
as a separate deterministic precondition: if it fails at any t ≤ T, the auditor flags
the fairness metric as unstable and declines to certify.
This illustrates a broader lesson for inequality auditing in adaptive mechanisms: seemingly mild normalizations (division by a mean or a baseline) can destroy stability unless the regulator also enforces a domain restriction that keeps the normalization well-behaved.
So far we have treated auditing as an exercise: we certify properties of the deployed mechanism using only bounded adapted statistics. In many regulatory conversations, however, the counterfactual question is central: This motivates an optional off-policy (counterfactual) module that leverages the same log, but now uses the principal’s declared propensities to reweight outcomes.
We begin with the contextual bandit case, where each round t consists of observing st, drawing a
contract bt ∼ qt(⋅ ∣ st),
and then observing a bounded auditable statistic Zt (e.g., Zt = F(Wt),
or more conservatively a per-round fairness proxy built from ΔWt).
Fix a policy π*(b ∣ s)
that is not necessarily equal to the deployed sampling rule qt(⋅ ∣ s).
The counterfactual estimand we can identify from logs is the
deployment-average conditional mean fairness under π*,
$$
\mu_t(\pi^*)
\;:=\;
\frac{1}{t}\sum_{k=1}^t
\mathbb E\!\left[
\mathbb E_{b\sim \pi^*(\cdot\mid s_k)}\!\left[\,\mathbb E[Z_k\mid
\mathcal F_{k-1},s_k,b]\,\right]
\ \middle|\ \mathcal F_{k-1}
\right].
$$
This is the natural analog of our on-policy target: it asks, round by
round, what fairness would have been delivered had the principal sampled
contracts according to π* given the realized
state sk,
while holding fixed the environment response mapping from (s, b) into outcomes
(including strategic agent responses).
Identification requires . Specifically, if we impose the auditable
condition
qt(b ∣ s) ≥ η > 0 ∀(b, s, t),
then the one-step importance ratio π*(bt ∣ st)/qt(bt ∣ st)
is bounded above by 1/η (when
π* is supported on
ℬ). This is the precise sense in which
randomization is not merely a design choice but a precondition for
counterfactual contestability: without overlap, the log cannot speak
about contracts that were essentially never tried.
Given verifiable propensities qt(⋅ ∣ st)
(via VRF-checked declarations), the auditor can form the
inverse-propensity-score (IPS) reweighted statistic
$$
\widetilde Z_t^{\mathrm{IPS}}(\pi^*)
\;:=\;
\frac{\pi^*(b_t\mid s_t)}{q_t(b_t\mid s_t)}\, Z_t.
$$
Under overlap and boundedness $Z_t\in[\underline z,\overline z]$, we have
the deterministic range bound
$$
\widetilde Z_t^{\mathrm{IPS}}(\pi^*)
\in
\left[\frac{\pi^*(b_t\mid s_t)}{q_t(b_t\mid s_t)}\underline z,\
\frac{\pi^*(b_t\mid s_t)}{q_t(b_t\mid s_t)}\overline z\right]
\subseteq
\left[\frac{\underline z}{\eta},\ \frac{\overline z}{\eta}\right],
$$
and hence the range scales as $c_{\mathrm{IPS}}\le (\overline z-\underline
z)/\eta$. The IPS average
$$
\overline{\widetilde Z}^{\mathrm{IPS}}_t(\pi^*)
\;:=\;
\frac{1}{t}\sum_{k=1}^t \widetilde Z_k^{\mathrm{IPS}}(\pi^*)
$$
is then an unbiased estimator (in the conditional-on-ℱk − 1 sense) of the
counterfactual mean under π*, and we may apply the
same martingale CS machinery as in Theorem~7.1 to obtain a time-uniform
confidence sequence for μt(π*).
The economic content of the bound is immediate: the variance (and therefore certification time) deteriorates as η ↓ 0. In policy terms, a regulator that wants credible counterfactual auditing must either (i) mandate a minimum exploration rate (a lower bound η), or (ii) restrict permissible counterfactual policies π* to those that do not put mass on rarely-sampled contracts.
Even with overlap, the IPS estimator can be extremely noisy when
π*/qt
is large on a set of non-negligible probability. This is not a
statistical artifact but a genuine information constraint: if the
deployed mechanism almost never tries some contract, then the log
contains too little information to estimate its consequences
precisely.
A common engineering response is to cap weights, i.e. to replace π*/qt
by min {π*/qt, M}
for some M, but this
introduces bias. From an auditing perspective, we view such truncation
as a policy choice: one can still produce statements, but they must be
framed as certification for a modified, truncated estimand, or as
conservative bounds that account explicitly for truncation bias (which
typically requires additional structure, e.g. outcome monotonicity in
b).
When outcome modeling is feasible, a doubly robust estimator can
substantially reduce variance while retaining a martingale-valid
analysis, provided the numerical component is handled carefully. Let
m̂t − 1(s, b)
be a regression estimate of m(s, b) := 𝔼[Zt ∣ ℱt − 1, st = s, bt = b]
built only from data up to time t − 1 (so m̂t − 1 is ℱt − 1-measurable).
Define
m̂t − 1(s, π*) := ∑b ∈ ℬπ*(b ∣ s) m̂t − 1(s, b),
(with the obvious integral form for continuous ℬ). The one-step DR score is
$$
\widetilde Z_t^{\mathrm{DR}}(\pi^*)
\;:=\;
\widehat m_{t-1}(s_t,\pi^*)
\;+\;
\frac{\pi^*(b_t\mid s_t)}{q_t(b_t\mid s_t)}\Big(Z_t-\widehat
m_{t-1}(s_t,b_t)\Big).
$$
If either (i) the propensity weights are correct (ensured here by
VRF-verified sampling and declared qt) or (ii) the
model is correct, the DR average targets μt(π*);
when both are approximately correct, it typically has much smaller
variance than IPS.
For auditing, the key point is that we can still obtain time-uniform concentration so long as we can bound the DR increments. Under overlap and bounded $Z_t\in[\underline z,\overline z]$, if we additionally ensure $\widehat m_{t-1}(s,b)\in[\underline z,\overline z]$ by construction (e.g. clipping predictions), then Z̃tDR(π*) lies in a known interval whose width again scales like 1/η, but with the residual term Zt − m̂t − 1(st, bt) often much smaller in practice. We emphasize a limitation: the learning step that produces m̂t − 1 is not itself cryptographically verifiable. What is auditable is the (the model must be frozen before observing round t) and the subsequent CS computation given the realized residuals. In deployments, this suggests a clean separation: the regulator specifies the permissible modeling class and training protocol, but the final compliance decision is still based on an anytime-valid bound.
In richer environments, st is not
exogenous but evolves with actions, and a ``policy’’ π* specifies a sequence
of contract distributions along a trajectory (an episodic MDP).
Off-policy evaluation then involves products of importance ratios across
steps. In an episode of horizon H, a naïve trajectory-weighted
estimator uses
$$
\rho_{e}
\;:=\;
\prod_{h=1}^H
\frac{\pi^*(b_{e,h}\mid s_{e,h})}{q_{e,h}(b_{e,h}\mid s_{e,h})},
$$
which, under overlap q ≥ η, is bounded by η−H and
therefore suffers exponential variance blow-up in H. This is the dynamic analog of the
bandit variance problem, and it is more severe.
A practical (and standard) mitigation is to use per-decision importance sampling (weighting stepwise rewards rather than whole trajectories) and/or sequential doubly robust estimators that combine local models with one-step residual corrections. However, to obtain sharp, anytime-valid guarantees in this setting, additional assumptions are typically required beyond those used for on-policy fairness certification: for example, bounded per-step rewards, bounded importance ratios (or explicit truncation), and some form of mixing/stability to control how model errors propagate through time. Because these conditions are application-dependent, we treat the Markov extension as a modular add-on rather than a baseline guarantee: the log architecture (VRF-verified propensities and append-only integrity) is fully compatible with these estimators, but the statistical validity of counterfactual fairness claims hinges on enforceable overlap and, in many cases, structural constraints on the dynamics.
Taken together, the counterfactual module clarifies the tradeoff we want the model to illuminate: verifiable randomization makes counterfactual questions identifiable, but only a regulator-imposed overlap regime (or a restriction of the counterfactual class) prevents the resulting audit from becoming statistically powerless.
We complement the theoretical guarantees with a simulation study designed to answer a practical question a regulator would immediately ask: Because our confidence sequences are anytime-valid, ``sample size’’ is endogenous—certification can occur early when the data are informative, and may never occur when the system is near the threshold or the fairness functional is highly variable. Simulations let us visualize this behavior through and stress-test robustness to the two features that break classical i.i.d. analysis: policy drift (the principal adapts qt) and strategic responses (agents adapt ai, t).
We simulate n agents and a
principal over T rounds. Each
round draws an observable state st ∈ {1, …, S}
(e.g. market conditions) from a Markov chain with moderate persistence;
the auditor observes st in the log.
The contract class is the auditable linear form
pi, t = mt + αtyi, t, (αt, mt) ∈ [0, 1] × [0, mmax],
shared across agents for simplicity, with bt = (αt, mt).
Outcomes are generated by strategic effort with idiosyncratic
productivity. Concretely, each agent i draws a private cost shock θi, t
and chooses effort ei, t ∈ [0, emax]
if participating; output is
yi, t = βi(st) ei, t + εi, t, εi, t ∈ [−σ, σ],
and agent utility is quasi-linear
$$
\Delta w_{i,t} \;=\; p_{i,t} - \tfrac{1}{2}\theta_{i,t} e_{i,t}^2
\quad \text{if participating,}
\qquad
\Delta w_{i,t} \;=\; u_i^{\mathrm{out}}(s_t)
\quad \text{if rejecting.}
$$
Given (αt, mt),
agents best-respond myopically (a standard reduced form for repeated
contracting when the state is observed and the contract is per-round).
Rejection occurs when the implied best-response payoff falls below uiout(st).
This design produces two empirically relevant patterns: (i) raising
αt
increases incentives and output, but shifts surplus toward agents; (ii)
raising mt
relaxes participation constraints, but can generate inequality if
targeted unevenly through state dependence. The auditor does not observe
θi, t
or βi(⋅);
it only observes (st, bt, a1 : n, t, y1 : n, t, p1 : n, t).
To induce realistic nonstationarity, we let the principal adapt qt(⋅ ∣ s)
using a simple learning rule that trades off profit and a soft fairness
penalty. In each state s, the
principal maintains weights over a finite grid of contracts ℬ and updates them via exponentiated gradient
on a noisy proxy objective
$$
\widehat J_t(b) \;=\; \sum_{i=1}^n \big(y_{i,t} - p_{i,t}\big) \;-\;
\lambda\, \widehat{\mathrm{Ineq}}_t(b),
$$
where $\widehat{\mathrm{Ineq}}_t$ is
computed from recent logged wealth changes (e.g. a sliding-window
variance of Δwi, t).
This induces endogenously: when the environment changes (through st) or the agent
pool is heterogeneous, the principal shifts mass toward contracts that
improve its objective. Importantly, the audit does not assume any
particular learning algorithm; we use adaptivity here only to test that
the confidence sequence maintains coverage under history-dependent qt.
We focus on wealth-based fairness functionals F(Wt), including (i) 1 − Gini(Wt) with an enforced mean-wealth floor w̄t ≥ μmin (implemented in simulation by adding a fixed baseline transfer to all parties), and (ii) the Jain index on agent wealth alone. To align with the theory, we explicitly enforce bounded increments $\Delta w_{j,t}\in[\underline B,\overline B]$ by clipping payments and outcomes in the simulator and by ensuring mt ∈ [0, mmax] and yi, t ∈ [−ymax, ymax]. These are not merely technicalities: the empirical width of the confidence sequence is driven by the realized variability, but the hinges on a correct deterministic range bound. We therefore treat ``range calibration’’ as an input to the empirical exercise: the auditor must be conservative about $\underline B,\overline B$ and about the Lipschitz proxy L for the chosen F.
For each simulated run, we compute the streaming confidence sequence
for
$$
\mu_t \;=\; \frac{1}{t}\sum_{k=1}^t \mathbb E\!\left[F(W_k)\mid \mathcal
F_{k-1}\right],
$$
using the same martingale CS construction as in our theoretical section
(implemented with a bounded-differences mixture boundary). We then
record the at which LCBt ≥ τ
(certification) and, symmetrically, at which UCBt < τ
(early rejection). To interpret these stopping times as ``power,’’ we
run paired experiments where we can compute a ground-truth benchmark:
since we control the simulator, we approximate μt by Monte
Carlo conditioning on ℱt − 1 (holding the
realized history fixed but resampling the one-step noise). This is not
available in practice, but it allows us to measure (i) empirical of the
CS and (ii) the probability of certification as a function of T and the fairness gap μT − τ.
The central output is an audit power curve plotting Pr (∃t ≤ T : LCBt ≥ τ) against T, stratified by the fairness gap and by the overlap/drift regime. Three patterns are robust across parameterizations. First, when μT exceeds τ by a comfortable margin, certification occurs quickly and the stopping time concentrates tightly; empirically, the median stopping time scales roughly like O((μT − τ)−2log (1/δ)), consistent with concentration intuition. Second, when μT is close to τ, the CS often remains inconclusive over long horizons: the auditor does not falsely certify (by coverage), but it also cannot ``force’’ a decision without more information. This is precisely the operational meaning of an anytime-valid guarantee: it trades premature false assurance for a transparent dependence on data. Third, larger per-round variability—induced either by higher σ in outcomes, more heterogeneous βi, or more aggressive policy drift—shifts the power curve rightward, increasing the rounds needed for certification.
To test robustness, we vary (i) the learning rate of the principal
(faster drift) and (ii) the degree of strategic elasticity (how strongly
ei, t
responds to αt, and how
often agents reject). Classical fixed-policy concentration can fail
badly here because the distribution of F(Wt)
is nonstationary and endogenous. In contrast, our diagnostics focus on
the claim the auditor actually needs: . Across drift regimes, we track
the event
ℰ = {∃t ≤ T: μt < LCBt},
and estimate Pr (ℰ) over many runs.
Empirically, Pr (ℰ) stays below the
nominal δ (up to Monte Carlo
error), even in regimes where the realized fairness trajectory is highly
path-dependent and exhibits long transients. This is the main sense in
which the simulation supports the theory: not that the audit is always
decisive, but that when it is decisive it is not spuriously so.
Two limitations are worth making explicit. First, the simulator necessarily hard-codes boundedness and (for Gini) a mean-wealth floor; in deployment these must be justified institutionally (e.g. payment caps, escrow constraints, or explicit baseline compensation). Second, our power curves are conditional on the chosen fairness functional and on conservative bounds. A regulator that insists on a highly sensitive metric, or that cannot credibly bound increments, should expect the audit to require substantially more data or to be frequently inconclusive. We view this not as a weakness but as a policy-relevant output: it makes transparent which fairness notions are at reasonable sample sizes under adaptive contracting.
These simulation results set up the next empirical module, where we move from fully specified synthetic environments to semi-synthetic platform-style logs, and evaluate not only stopping times but also false positive/false negative rates under controlled injections of unfairness.
Our simulation study isolates the statistical logic of anytime-valid auditing under adaptivity, but it deliberately abstracts from the messy structure of real platform logs: heterogeneous tasks, irregular participation, missingness, and payment rules that are only partially parameterized. We therefore complement it with a semi-synthetic empirical module that uses while preserving experimental control over the contracting and fairness properties that the auditor is asked to certify. The goal is operational: quantify (i) how often our audit would certify or reject in finite samples when the underlying system is truly fair or unfair, and (ii) how sensitive these rates are to the range and Lipschitz calibrations that the theory requires.
We start from a platform-style dataset consisting of time-stamped
interactions between a principal (the platform) and a population of
workers. Each interaction includes covariates that we treat as the
observable state st (task
category, predicted duration, location, baseline price, worker tenure,
congestion indicators, etc.) and a realized outcome proxy yi, t
(e.g. completed units, verified quality score, or revenue attributable
to worker i). Because the raw
data typically record realized payments but not counterfactual payments
under alternative contracts, we use the dataset only as a source of
(st, yi, t)
trajectories and participation patterns, and then payments under a
specified contract class. Concretely, we define a bounded contract grid
ℬ of auditable linear-share forms
b = (α, m) ∈ [0, 1] × [0, mmax], pi, t(b) = m + α yi, t,
together with deterministic payment caps that enforce pi, t ∈ [0, pmax]
and hence bounded wealth increments. Participation is treated as an
action ai, t ∈ {accept, reject}:
in the simplest variant, we take the observed accept/reject decisions as
fixed, while in a richer variant we fit a probabilistic participation
model on historical data and then simulate ai, t
under counterfactual b to
capture selection effects. In either case, the resulting semi-synthetic
log matches the auditor’s interface: each round records (st, bt, a1 : n, t, y1 : n, t, p1 : n, t)
and a declared propensity vector qt(⋅ ∣ st).
A practical nuance is that real systems do not natively provide VRF
proofs. In our semi-synthetic evaluation we therefore treat propensity
integrity in two layers: (i) the layer, where the auditor trusts the
logged qt
and uses it for (optional) overlap and counterfactual analyses; and (ii)
the layer, where we separately evaluate the value of VRF-style
commitment by simulating two regimes: a
committed'' regime in which $(q_t,b_t)$ are consistent by construction, and anopportunistic’’
regime in which we allow adversarial post hoc editing of qt to illustrate
how quickly off-policy estimates can become meaningless without
verifiable randomization. This split lets us be explicit about what is
statistical and what is cryptographic: the confidence sequence controls
false certification , whereas VRF mechanisms are what make that
condition institutionally credible.
To evaluate false positive/false negative rates, we need repeated instances where the ground-truth deployment-average fairness μT is above or below the threshold τ by a known margin. We accomplish this by injecting unfairness through mechanisms that mirror plausible platform behaviors while remaining auditable from the log.
The first injection channel is . Partition agents into two observed
groups g(i) ∈ {0, 1}
(interpretable as cohorts, regions, or any policy-relevant partition
available to the auditor). For an unfairness level κ ≥ 0, define a modified payment
rule
pi, t(κ)(b) = min {pmax, max {0, m + αyi, t − κ ⋅ 1[g(i) = 1]}},
which preserves boundedness and LL by construction but creates
systematic wealth divergence.
The second channel is . Here the principal uses different
propensities conditional on group membership or on state variables
correlated with group. For instance, the principal may allocate
high-α contracts (high
incentive/high worker surplus) at lower probability for g = 1:
qt(κ)(b ∣ st, g) ∝ qt(0)(b ∣ st) ⋅ exp ( − κ ⋅ 1[b ∈ ℬhigh-α] ⋅ 1[g = 1]),
followed by renormalization. This kind of unfairness is subtle in
one-step outcomes but accumulates in wealth, which is precisely why
wealth-based F(Wt)
is a useful audit target.
The third channel is . Even when posted contracts are identical, a platform may effectively induce differential rejection (e.g. through frictions, delayed payouts, or information asymmetries). In the variant with simulated participation, we apply a group-dependent outside-option shift Δuout that changes reject rates, which then changes realized wealth distributions through selection. This injection is valuable because it produces unfairness without explicit payment discrimination, stressing that an audit based on Wt must treat rejection as an action that affects welfare and inequality.
For each unfairness level κ and horizon T, we generate many semi-synthetic deployments by replaying the observed (st, yi, t) trace and sampling contracts bt ∼ qt(⋅ ∣ st) under the specified regime. On each deployment we run the streaming audit and record (i) the first time t at which LCBt ≥ τ (certification), (ii) the first time t at which UCBt < τ (rejection), and (iii) whether neither occurs by T (inconclusive). Because the audit target is the conditional-expectation average $\mu_t=\frac{1}{t}\sum_{k=1}^t\mathbb E[F(W_k)\mid\mathcal F_{k-1}]$, we approximate ground truth in this semi-synthetic setting by resampling only the internal randomization (contract draws and, when applicable, participation draws) while holding the empirical (st, yi, t) path fixed. This mirrors how an auditor would reason about a fixed deployment history with stochastic policy execution.
We summarize performance with two complementary notions of error. The
first is , the probability that the audit ever certifies when the true
μt is
below threshold at some time. Operationally we estimate
Pr (∃t ≤ T: LCBt ≥ τ and μt < τ),
which should be controlled at approximately δ when the deterministic bounds are
correctly specified. The second is with an indifference band: for a
tolerance γ > 0, we treat
μT ∈ [τ − γ, τ + γ]
as ``near-threshold’’ and report false positives Pr (certify by
T ∣ μT ≤ τ − γ)
and false negatives Pr (not certify by
T ∣ μT ≥ τ + γ).
This decomposition is important for practice: near threshold,
inconclusiveness is not a failure but a truthful reflection of
insufficient evidence.
Three qualitative results recur across datasets and injection channels. First, under conservative range bounds and correct implementation, false certification is rare and empirically tracks the nominal level: when we set δ = 0.05, the probability of ever certifying an actually unfair deployment remains near or below 5% across a broad range of κ that place μT below τ. Second, false negatives are governed primarily by variance and selection: regimes with high dispersion in yi, t and high reject elasticity produce wide confidence sequences and thus delayed or absent certification even when the system is truly fair. Third, miscalibration of the deterministic bounds is the dominant failure mode: if we intentionally underestimate the per-round range (e.g. by ignoring rare but real outlier outcomes), false certification can rise sharply. This is not surprising mathematically, but the semi-synthetic exercise puts numbers on it and motivates an institutional response: range bounds are not an innocuous modeling choice, they are a compliance-critical engineering parameter.
The semi-synthetic results suggest a concrete checklist for deploying wealth-based fairness audits.
If payments or outcomes are unbounded or subject to rare spikes, the correct response is not to ``hope the tails behave’’ but to impose caps, escrow rules, or throttles that make $\underline B,\overline B$ credible and log-verifiable. The audit’s validity is only as strong as these deterministic constraints.
Choosing F ex post invites both gaming and confusion. For sensitive functionals such as 1 − Gini, enforce a mean-wealth floor (via baseline transfers or restricted reporting windows) so that the Lipschitz proxy is well-defined on the feasible set.
Near threshold, an anytime-valid procedure will often refuse to decide quickly. Reporting should therefore include not only a binary decision but also the terminal interval [LCBT, UCBT] and, when useful, the implied sample size required to separate τ from the current estimate at the desired confidence.
If the audit relies on propensities (for overlap diagnostics or counterfactual analyses), then propensity integrity must be enforced with commitments such as VRFs; otherwise, the statistical layer can be satisfied while the system is manipulable. In contrast, LL violations can be certified deterministically from payments alone, so they should be monitored continuously regardless of whether propensities are committed.
Before using an audit for enforcement, we recommend the exact semi-synthetic exercise we report here: take historical logs, inject controlled unfairness at multiple strengths, and verify that the audit (a) does not spuriously certify, and (b) has adequate power at policy-relevant effect sizes.
Taken together, these semi-synthetic exercises bridge the gap between our theoretical guarantees and a regulator’s implementation concerns: they show where the audit is robust (adaptivity, selection, heterogeneity) and where it is brittle (range miscalibration, unverifiable propensities). This sets up the broader question we turn to next: how to design contracting and logging rules so that the set of fairness notions is as large as possible without unduly constraining performance.
Our results can be read as mapping a ``stability–auditability frontier’’ for fairness regulation in online contracting systems. On one axis sits the ambition of the compliance claim—for example, whether one wishes to certify a simple pointwise constraint (LL), a deployment-average scalar of realized welfare, or a distribution-sensitive functional of cumulative wealth such as 1 − Gini or a Rawlsian minimum. On the other axis sits the stability of the statistic being audited, in the precise sense needed for anytime-valid inference under adaptivity: bounded increments and a Lipschitz envelope that controls how much the target can move when the ledger moves by one round. The frontier is sharp: whenever the audited object is sensitive to rare events, to near-zero denominators (as with Gini when mean wealth is small), or to unlogged confounding (as in many counterfactual questions), validity is not merely harder—it can become ill-posed unless the institution redesigns the system to restore stability.
This perspective clarifies why wealth-based fairness is simultaneously attractive and delicate. It is attractive because wealth accumulation is what makes repeated interactions policy-relevant: small per-round disparities that are invisible in myopic outcome metrics can compound into durable inequality. But it is delicate because compounding also amplifies sensitivity. A fairness functional F(Wt) can be stable only on a restricted domain: bounded increments, bounded horizon (or controlled growth), and, for scale-normalized measures, lower bounds on relevant denominators. The practical implication is that auditability is not a purely statistical feature of the world; it is a design choice. A regulator who insists on certifying an unstable metric without imposing stabilizing design constraints is, in effect, demanding a proof without an axiom.
The most direct way to move along the frontier is to impose constraints that are both operational and log-verifiable. Payment caps, floors, escrow rules, and throttles are often discussed as product or risk controls; our contribution is to interpret them as . If $\Delta w_{j,t}\in[\underline B,\overline B]$ is violated even rarely, the mathematics of anytime-valid certification deteriorates quickly because the relevant concentration tools depend on worst-case increments. Conversely, when boundedness is enforced by design, the auditor can treat range parameters as institutional facts rather than model estimates. Similarly, for fairness functionals with normalization (e.g., 1 − Gini), a mean-wealth floor w̄t ≥ μmin is not a technical nuisance but an explicit policy lever: baseline transfers, restricted reporting windows, or a requirement to audit net-of-fixed endowments can keep the system in a domain where F is Lipschitz and thus certifiable.
Any counterfactual or
policy comparison'' question requires overlap, and overlap is expensive because it forces the principal to sometimes select contracts that are not myopically optimal. The right goal is therefore not maximal randomization but \emph{minimal randomization sufficient for audit}. Concretely, one can impose a lower bound $q_t(b\mid s)\ge \eta$ on a small, pre-registered exploration set of contracts, where $\eta$ is chosen to meet a target audit precision at horizon $T$. This reframes exploration as compliance infrastructure: the platform purchasesauditability
capital’’ by paying a small efficiency cost. In practice we expect η to be heterogeneous across
states—larger where fairness concerns or heterogeneity are greatest,
smaller where behavior is well understood—but the principle remains that
overlap should be treated as a first-class policy variable, not as an
incidental artifact of experimentation. Moreover, because the relevant
guarantees are anytime-valid, this exploration can be scheduled
adaptively (e.g., turned on when intervals widen) without sacrificing
inferential validity, provided the logging and propensity integrity are
maintained.
The statistical layer of any overlap-based audit depends on the integrity of the logged propensities qt(⋅ ∣ st). Absent a commitment mechanism, a sophisticated principal can ``explain’’ any realized action ex post by retroactively editing propensities, defeating both diagnostics and counterfactual estimation. The VRF-based construction we study is a clean solution because it enforces a simple sequencing norm: propensities are declared and committed outcomes are realized, and the realized bt is demonstrably sampled from that declaration. Beyond cryptography, there is a governance lesson: regulators should specify not only what must be logged, but also it must be logged relative to outcome realization. In our setting, timing is part of the definition of manipulability. A practical corollary is that platforms should standardize propensity schemas (state variables included, discretization of ℬ, and versioning of contract classes) so that auditors can interpret qt consistently across time and across product changes.
LL illustrates a class of constraints that are deterministically auditable from logs: they admit cryptographic proofs of violation and require no probabilistic calibration. Fairness targets, by contrast, are inherently statistical when defined in expectation under adaptive behavior. Enforcement regimes should respect this difference. We view it as a mistake to treat all compliance checks as if they were of the same epistemic type. A more coherent enforcement design is two-tiered: (i) continuously monitor hard, pointwise constraints (LL, payment caps, declared contract class membership, VRF verification), and (ii) monitor soft, distributional targets with confidence sequences that explicitly allow inconclusive outcomes near threshold. This separation reduces both regulatory risk (by making ``hard’’ violations quickly contestable) and platform risk (by making clear when non-decision is the correct outcome).
Several limitations are substantive rather than cosmetic. First, our fairness guarantees rely on a Lipschitz envelope for F on a bounded domain. Many appealing notions—tail risk measures, quantile-based parity constraints, or metrics that condition on rare subpopulations—are intrinsically non-Lipschitz or effectively unbounded in finite samples; auditing them may require stronger design interventions (e.g., enforced minimum sample sizes per subgroup) or different inferential tools (e.g., robust or trimmed functionals). Second, the log-based approach presumes that outcomes yi, t are verifiable proxies. In many platforms, outcome attribution is noisy, delayed, or manipulable by the principal (e.g., through measurement choices). Without an external measurement channel or a trusted attestation mechanism, any audit can be undermined at the measurement layer even if payments and propensities are committed.
Third, while our model accommodates strategic agent responses in the martingale sense (adaptivity), it does not model collusion, sybil attacks, or coordinated gaming of the fairness functional. A principal who can create or merge ``agents’’ can mechanically alter inequality measures without changing underlying treatment; conversely, agents might coordinate to reshape the wealth distribution. Addressing identity and collusion requires institutional primitives (identity verification, anti-sybil rules) that sit outside our statistical argument but are essential for real deployments.
Several directions are natural. One is to tighten the stability requirements. Lipschitzness is sufficient but not necessary; exploiting structure in F (e.g., smoothness, self-bounding properties) may yield narrower confidence sequences and hence less required randomization. Another is to move from one-step counterfactual auditing to truly dynamic contracting in Markov settings where actions affect future states, participation, and learning. Here, valid off-policy evaluation typically requires either bounded importance weights over trajectories or mixing assumptions that are often empirically contestable; formalizing what can be credibly assumed—and what must be engineered through policy constraints—remains open.
A third direction is mechanism design under audit constraints: if the principal internalizes that only auditable fairness notions will be enforced, how should contracts be chosen to maximize performance subject to (τ, δ)-compliance? This turns the audit from a passive diagnostic into an active design constraint, yielding a new frontier between efficiency and certifiable equity. Finally, privacy and confidentiality matter: logs rich enough to support fairness auditing can reveal sensitive information about workers or trade secrets about contracting. Developing cryptographic or differential-privacy layers that preserve auditability while protecting participants is, in our view, not an optional add-on but a central requirement for deployment.
Taken together, these considerations support a pragmatic message. Fairness auditing is feasible under adaptivity when the institution commits to stable targets, bounded ledgers, and transparent randomization. When these conditions are absent, the right response is not to overfit statistical fixes, but to redesign the system so that the desired fairness claim becomes a property that can, in fact, be audited.
We set out to clarify a simple but often blurred question in online contracting environments: when a regulator demands that a principal be ``fair,’’ what exactly is being demanded in terms of what can be from the operational record, and what must instead be under uncertainty? Our central message is that fairness regulation in adaptive, strategically populated systems is not primarily limited by the sophistication of statistical tooling, but by the stability properties of the objects being audited and by the extent to which the platform commits to log- and timing-level primitives that make manipulation contestable.
At the conceptual level, we framed fairness auditing as an exercise
in under adaptivity. The principal may change contract propensities
qt(⋅ ∣ st)
as it learns or optimizes; agents may respond strategically; and the
resulting data are emphatically non-i.i.d. This adaptivity is not an
inconvenient detail: it is the defining feature of real platforms. The
appropriate target for regulation is therefore not a stationary
population quantity, but a deployment-average quantity indexed by the
information set ℱt − 1 that is available
to the platform when it acts. In our formulation, the compliance claim
concerns the running average of conditional expectations,
$$
\mu_t \;:=\; \frac{1}{t}\sum_{k=1}^t \mathbb E\!\left[F(W_k)\mid
\mathcal F_{k-1}\right],
$$
which is the natural object that remains meaningful when policies drift
and responses are endogenous.
At the technical level, our contribution is to show how this target becomes auditable with anytime-valid inference once two design commitments are in place. The first is : an append-only, tamper-evident record of states, realized contracts, actions, outcomes, and payments. This transforms certain compliance notions from statistical claims into deterministic checks. Limited liability is the canonical example: because payments pi, t are explicitly logged, LL violations are not matters of estimation, but of record. The auditor can either certify compliance by verifying mini, tpi, t ≥ 0, or point to the precise offending entry. The second commitment is : the principal must commit to qt(⋅ ∣ st) before outcomes are realized and must sample bt in a publicly verifiable manner (e.g., via a VRF-based construction). This commitment is not merely cryptographic hygiene. It makes randomization a governed object: the platform cannot retroactively rationalize behavior by rewriting propensities, and the auditor can treat the logged propensities as factual inputs rather than as self-reported narratives.
With these primitives established, the remaining burden is statistical, and here stability is decisive. We emphasized that fairness metrics are auditable only to the extent that they are stable functionals of the ledger. Bounded per-round wealth increments $\Delta w_{j,t}\in[\underline B,\overline B]$ and Lipschitz continuity of the fairness functional F on the relevant domain imply that the adapted process Zt := F(Wt) cannot jump arbitrarily in a single round. This bounded-difference structure is exactly what is required by modern martingale concentration tools to deliver confidence sequences that remain valid under optional stopping and continuous monitoring. The operational output is an anytime-valid interval [LCBt, UCBt] for μt such that, with probability at least 1 − δ, the interval contains the true deployment-average target t ≤ T. Consequently, a regulator can implement a clean enforcement rule: if LCBT ≥ τ, then fairness compliance μT ≥ τ holds at level 1 − δ, even though the platform adapted throughout deployment.
This yields an interpretive lens that we believe is useful beyond the
particular constructions in the paper. Some compliance goals are
hard'' in the sense of being directly testable from the ledger (nonnegativity of payments, contract-class membership, or VRF verification). Others aresoft’’
in the sense of being inherently expectation-based and therefore
requiring statistical tolerance near the threshold. Treating these two
classes symmetrically leads to predictable pathologies: either
over-enforcement (punishing noise as if it were misconduct) or
under-enforcement (allowing manipulation by exploiting ambiguity). Our
framework makes the separation explicit and provides a route to
operationalizing soft targets without pretending they are deterministic
facts.
The framework also clarifies the economic meaning of randomization. In many platforms, exploration is defended as a learning device internal to the firm. In regulated environments, exploration has an additional role: it creates overlap and hence makes counterfactual questions identifiable from logs. When overlap is enforced and propensities are verifiable, the auditor can ask not only whether realized fairness met a threshold, but (in restricted settings) what fairness would have been under a fixed reference policy. The important economic point is that overlap is costly: it diverts actions away from myopic profit maximization. Our analysis therefore supports a design principle that is naturally expressed in policy language: require randomization sufficient for audit precision at the relevant horizon, rather than maximal randomization or ad hoc experimentation.
We do not claim that our conditions are innocuous. The need for bounded increments and Lipschitz stability is not an artifact of proof technique; it reflects a genuine impossibility of certifying unstable objects in adversarially adaptive systems. Likewise, our reliance on verifiable outcomes yi, t is a substantive institutional assumption. If the principal can manipulate measurement or attribution, then even perfect payment logs and perfect VRF proofs cannot rescue the audit: the failure occurs at the sensing layer, not at the inference layer. These limitations are valuable because they point regulators and system designers toward the correct locus of intervention. When auditability fails, it is often because the system has not been engineered to support the desired claim.
Looking forward, we see three directions as especially consequential. First, there is room to sharpen the stability–efficiency trade-off: Lipschitzness is a sufficient condition, but not always tight, and exploiting additional structure in fairness metrics may reduce the data requirements for certification. Second, the interaction between audit constraints and optimal contracting remains largely open. If the principal must maintain (τ, δ)-compliance in an anytime sense, what contracts maximize performance subject to that constraint, and how does the answer depend on the volatility bounds and on the fairness functional? Third, privacy and confidentiality must be treated as co-equal design requirements. Logs that make fairness auditable can also expose sensitive worker information or proprietary policy details; resolving this tension will likely require cryptographic commitments, secure aggregation, or carefully designed disclosure regimes that preserve verifiability while limiting leakage.
The broader takeaway is pragmatic. Fairness regulation for online contracting systems can be made operational if it is formulated as a claim about stable, logged quantities; if it distinguishes hard constraints from soft targets; and if it treats randomization, logging, and timing as compliance infrastructure rather than as afterthoughts. Under these conditions, the regulator is not asking for faith, and the platform is not asked to prove the unprovable. Instead, the fairness claim becomes what it ought to be in a high-stakes economic environment: a property that can be contested, audited, and certified under explicit assumptions that are themselves subject to institutional design.