Auditable Stochastic Contracts: Anytime-Valid Fairness Certification from Logged Adaptive Contracting

Metadata

Total Words: 15,499
Export Date: 2026-01-16 05:49:28
Description: Recent work on fairness-aware contracting in principal–agent Markov games argues that randomized (stochastic) linear contracts can stabilize learning and improve equity without sacrificing welfare, but it leaves open a 2026-critical question: how can regulators or third-party auditors verify that an adaptive, randomized contracting system actually satisfied Limited Liability and fairness requirements during deployment? We propose a tractable audit framework for stochastic adaptive contracts. The principal samples contracts from publicly declared propensities using verifiable randomness and records an append-only log of states, contracts, outcomes, and payments. We show that Limited Liability can be certified deterministically from logs, and—more importantly—that fairness and welfare quantities computed from cumulative wealth can be certified statistically with anytime-valid confidence sequences that remain correct under fully adaptive, non-i.i.d. data streams induced by online learning and strategic agent responses. Our main results use martingale concentration to provide finite-sample, time-uniform guarantees for equity metrics (including 1 − Gini under mild boundedness/positivity conditions), Rawlsian minima, and welfare. We also outline conditions under which logged randomization enables off-policy evaluation of counterfactual fairness. Experiments on simulated principal–agent environments (including sequential social dilemmas) and semi-synthetic platform-pay data illustrate audit power, required sample sizes, and the tradeoff between stability-enhancing randomization and auditability.

1. Introduction & motivation: why stochastic adaptive contracts are attractive for stability but problematic for governance; what “auditable fairness” means in 2026 deployments.
2. Related work: (i) contract theory constraints (LL/IR/IC), (ii) principal-agent RL and stochastic contracts, (iii) fairness auditing and adaptive data, (iv) verifiable randomness and tamper-evident logs.
3. Clean baseline model (contextual contracting / one-step): primitives, logging, and the audit target parameters (deployment-average fairness, welfare, LL violations).
4. Threat model & compliance goals: what the auditor assumes/does not assume (strategic agents, adaptive principal); definition of (ε, δ)-compliance and audit statements.
5. Verifiable randomization and log integrity: VRF-based sampling, propensity disclosure, and deterministic checks (e.g., LL, boundedness, missingness).
6. Main statistical machinery: martingale confidence sequences for adaptive processes; converting fairness functionals on wealth vectors into bounded martingale increments.
7. Core theorems: anytime-valid confidence sequences for (a) welfare, (b) Rawlsian minimum, (c) Lipschitz fairness metrics, with explicit finite-sample widths; special handling of 1 − Gini (mean lower bound).
8. Counterfactual auditing (optional, bandit first): propensity-weighted and doubly robust estimators for fairness under a fixed alternative contract policy; discussion of overlap and variance blow-up; extension sketch to episodic Markov settings (flag assumptions).
9. Empirics I (simulation): principal-agent environments with adaptive stochastic contracts; audit power curves (samples needed to certify fairness thresholds); robustness to policy drift and strategic responses.
10. Empirics II (semi-synthetic platform-pay): constructing logs, injecting controlled unfairness, and evaluating false positive/false negative audit rates; operational guidance.
11. Discussion: the stability–auditability frontier; design recommendations (minimal randomization for overlap, transparent propensities); limitations and open problems.
12. Conclusion.

Content

1. Introduction & motivation: why stochastic adaptive contracts are attractive for stability but problematic for governance; what “auditable fairness” means in 2026 deployments.

Stochastic, adaptive contracting has become an attractive design choice in modern principal–agent settings precisely because it addresses the frictions that make deterministic contracts brittle. When the principal faces heterogeneous agents, shifting environments, and strategic responses, a single fixed schedule of transfers and performance thresholds can be simultaneously too rigid (failing under distribution shift) and too predictable (inviting gaming). By contrast, a contract—a principal committing to a distribution over feasible contracts and sampling a realized offer each round—can smooth incentives, reduce exploitability, and provide the exploration needed for learning. In practice, the logic is familiar: platforms vary bonus schemes across time and location to manage supply; firms rotate performance metrics to prevent narrow ``teaching to the test’’; procurement agencies randomize audits and contract terms to deter manipulation; and algorithmic systems tune incentives online to stabilize throughput under uncertain demand.

The same features that make adaptive randomization operationally appealing, however, make it difficult to govern. Two concerns recur in deployments: and . First, if the principal may quietly alter contract assignment as information accumulates, then ex post evaluations of fairness or legality can be undermined by plausible deniability: discriminatory treatment can be rationalized as algorithmic adjustment,'' and harmful realized outcomes can be dismissed asbad luck.’’ Second, the data stream produced by an adaptive policy is inherently non-i.i.d.: the distribution of offers, actions, and outcomes shifts endogenously with history. Standard audit tools that treat the log as a static dataset—compute an average, run a regression, report a confidence interval under independence—can be invalidated by the very adaptivity that the system uses to remain profitable or stable.

These tensions have sharpened in 2026-era deployments, where algorithmic contracting is no longer a back-office optimization but a regulated interface between institutions and individuals. In labor platforms, creators’ marketplaces, and ``AI-managed’’ internal work allocation, contracts are often issued continuously, contingent on observable context, and adjusted online. At the same time, governance frameworks increasingly demand that compliance be , not merely asserted. Regulators and counterparties ask questions that are operationally concrete: Were payments ever negative (violating limited liability)? Were the announced rules for random assignment followed, or were they overridden in precisely the cases that mattered? Did the realized wealth trajectories induced by the system satisfy a stipulated fairness requirement, not just on average in a training period, but throughout deployment as the system adapted?

We use the term to describe a standard of evidence that is compatible with these realities. Auditable fairness is not merely the existence of a fairness metric, nor the claim that a policy is ``fair by design.’’ Rather, it is the ability of an external auditor—with access only to a tamper-evident operational log and to publicly checkable commitments about the principal’s randomization—to (i) compute a well-defined target fairness quantity, and (ii) issue a certificate of compliance (or a proof of violation) whose error probability is explicitly controlled, the fact that the policy and the population may evolve endogenously over time. In this sense, the object of interest is as much epistemic as normative: we are not only asking what fairness should mean, but also what forms of fairness can be verified under realistic information constraints.

A central motivation for randomized contracting is stability. When agents anticipate deterministic cutoffs, they may concentrate effort narrowly on measured dimensions, engage in timing games, or coordinate to exploit predictable rules. Randomization weakens such knife-edge incentives and can reduce the returns to manipulation. Moreover, learning-based principals naturally encounter an exploration–exploitation tradeoff: to improve contract design, they must sometimes try alternatives, which necessarily induces variation in treatment. Randomization supplies a disciplined way to introduce such variation without ad hoc discretion. Yet, absent verifiability, randomness can become a cloak rather than a commitment. If the principal can retrospectively claim that a favorable contract draw ``just happened’’ for one group and not another, then the auditability of fairness collapses into a debate over intent. For governance, what matters is whether the process is : a third party should be able to check that the realized contract was generated from the declared distribution, and that the log of outcomes and payments has not been selectively edited.

This perspective suggests a practical separation between two layers of compliance. Some constraints are and thus amenable to deterministic certification from the log. Limited liability is the canonical example: if every logged payment is nonnegative, then the constraint is satisfied; if any payment is negative, the offending entry is a direct witness. Other objectives—fairness of cumulative wealth, welfare, or a Rawlsian floor—are in the sense that they involve expectations over stochastic outcomes and strategic behavior. Even if all realized entries are accurate, fairness depends on latent counterfactuals (what would have happened under alternative histories) and on conditional expectations (what was predictable at each point in time). In adaptive environments, it is therefore natural to demand not certainty but high-confidence guarantees: an auditor should be able to say, ``with probability at least 1 − δ, the deployment-average fairness exceeded the required threshold,’’ and the statement should remain valid no matter when the audit is run.

This requirement pushes us toward sequential, anytime-valid inference. In the deployments we have in mind, audits are not a one-off event; they are periodic, sometimes triggered by complaints, and sometimes executed automatically. Any method that requires fixing the audit time in advance is easy to game (or simply operationally infeasible), and any method that assumes stationarity is fragile to policy drift. What we want instead are guarantees that hold : at each round, the auditor can update an interval for the target fairness quantity, and the interval remains statistically valid even if the principal adapted based on past data. This is precisely the setting in which martingale methods and confidence sequences are appropriate, because they treat adaptivity as a feature of the filtration rather than as a violation of assumptions.

To make this feasible, we deliberately focus on fairness notions that are functions of cumulative wealth trajectories—quantities that can be constructed from logged outcomes and payments. This is not because we believe wealth is the only morally relevant dimension, but because wealth is often the most directly contractible and verifiable proxy for benefit and burden in economic interactions. A fairness functional F(W_t) can encode inequality aversion (e.g., via 1 − Gini), egalitarian objectives (e.g., negative dispersion), or worst-off protection (e.g., a Rawlsian minimum). The key modeling move is to treat fairness as a functional: if wealth changes by a bounded amount in a single round, fairness should not jump arbitrarily. This stability, formalized through Lipschitz-type conditions on F over the relevant bounded domain, is what translates operational boundedness (payments and outcomes cannot explode) into auditability (fairness estimates concentrate over time).

We emphasize two limitations up front. First, any fairness requirement is inherently normative and context-dependent. Auditing can certify compliance with a chosen metric and threshold, but it cannot settle disagreements about what the metric should be, nor can it capture dimensions of harm that are absent from the log (dignity, procedural justice, unsafe work, or coercive outside options). Second, selection and strategic participation matter: if agents can opt out, then observed wealth trajectories reflect both treatment and endogenous composition. Our framework can still produce valid statements about the induced population of participants and, under additional overlap-type conditions, about certain counterfactual reference policies; but it does not magically identify fairness for unobserved counterfactual populations without further assumptions.

With these caveats, our goal is to illuminate a tractable governance path for adaptive contracting systems. The path is conceptually simple: require that (i) operational logs be append-only and tamper-evident, (ii) the principal’s randomization be publicly verifiable (so declared propensities are meaningful commitments), and (iii) fairness metrics be chosen from a class stable enough to admit anytime-valid inference under bounded increments. The resulting compliance regime respects why stochastic adaptivity is economically valuable—it preserves flexibility, exploration, and robustness—while making the resulting distributional consequences contestable. In short, the model is meant to clarify the tradeoff: we can have adaptive, randomized contracts meaningful oversight, but only if we build systems so that fairness claims are not aspirational statements, but audit-ready objects with explicit error control.

Our framework sits at the intersection of four literatures that are often studied in isolation: classic contract theory (with its emphasis on Limited Liability, Individual Rationality, and Incentive Compatibility), learning-enabled principal–agent design (where contracts are updated online), statistical fairness auditing under adaptivity (where the data-generating process responds to history), and the systems/cryptography tools that make operational records contestable (verifiable randomness and tamper-evident logs). We briefly position our contribution relative to each, emphasizing both what we borrow and what we intentionally do attempt to solve.

In canonical principal–agent models, Limited Liability (LL), Individual Rationality (IR), and Incentive Compatibility (IC) are treated as equilibrium constraints that shape the set of feasible contracts (e.g., moral hazard with hidden action, adverse selection with hidden type, or both). LL is particularly prominent in applications where transfers cannot be negative due to bankruptcy constraints, legal restrictions, or platform policy; it is also a workhorse assumption that changes optimal sharing rules and can induce distortions such as bunching at zero payments. In standard theory, these constraints are imposed by the modeler and then enforced by design.

We instead treat LL as an property of realized operation: if payments are logged and the log is tamper-evident, then LL becomes a pointwise statement about the observed record. This moves LL from the realm of equilibrium reasoning into the realm of compliance verification. By contrast, IR and IC are less directly auditable in our setting because they depend on private costs, outside options, and beliefs. Even when we observe opt-out actions, we typically cannot infer whether a participating agent was strictly better off than her outside option, nor whether an action was chosen because it was optimal under the contract or because of unobserved shocks. Accordingly, our baseline audit targets focus on distributional properties of induced by the interaction (cumulative wealth trajectories), rather than attempting to certify IR/IC in a strong structural sense. We view this as a deliberate governance tradeoff: regulators often can and do enforce hard constraints on transfers (e.g., nonnegative pay, wage floors, non-withholding rules), while treating deeper incentive properties as matters for design review, stress testing, or ex ante approval rather than ex post proof.

More broadly, dynamic contracting and relational contract theories highlight that incentives and constraints unfold over time and depend on histories and continuation values. Our model accommodates this operationally—the principal may choose q_t as a function of past logs, and agents may respond strategically—but our audit objects are intentionally : the statistical guarantees target conditional expectations given the filtration, without assuming stationarity or a particular equilibrium selection.

A second relevant stream studies principals who learn contracts online, sometimes via reinforcement learning, contextual bandits, or adaptive experimentation. Here, randomized contracts arise for familiar reasons: exploration to improve performance; robustness to non-stationary demand and heterogeneous agent pools; and reduced manipulability when agents attempt to game deterministic thresholds. Work in this area often emphasizes regret, sample efficiency, and strategic behavior (including information design and mechanism design under learning). A common theme is that the principal wants to retain flexibility to adapt q_t based on outcomes, while agents respond to the induced incentives and may anticipate future changes.

Our perspective is complementary: we ask what kinds of about fairness and legality can be extracted from the same adaptive process. The main conceptual connection is that learning algorithms naturally produce non-i.i.d. logs, which invalidates audit procedures that assume static treatment assignment or fixed sampling plans. Put differently, even if the principal is using learning methods responsibly, the resulting data stream is still adversarial to naive inference. We therefore take adaptivity as a primitive feature rather than a pathology, and we build the audit layer using tools that remain valid under policy drift and strategic response.

We also differ from much of the learning-in-mechanisms literature in our choice of target. Many learning formulations optimize welfare, revenue, or regret relative to a benchmark policy; fairness enters as a constraint or a secondary objective. In our audit framing, fairness is a whose satisfaction must be certified with explicit error control from the operational log. This shift in objective aligns with regulated deployments, where the question is not only is the policy optimal?'' butcan the operator demonstrate that it stayed within required bounds during deployment?’’

A large literature on algorithmic fairness proposes metrics (statistical parity, equalized odds, calibration, individual fairness, welfare-based criteria) and auditing methods to estimate them from data. Much of this work, however, presumes either i.i.d. samples from a fixed distribution or a batch dataset whose sampling mechanism can be treated as exogenous. In adaptive contracting, neither assumption is safe: the policy changes over time, the composition of participating agents may change endogenously through opt-out, and the outcome distribution may shift in response to incentives.

Our approach leverages a line of work in sequential analysis and martingale methods that explicitly treats adaptivity through filtrations. Confidence sequences, e-values, and other anytime-valid constructions are designed to remain correct under optional stopping and continuously monitored testing, which is precisely the operational reality of periodic audits. Conceptually, this is a governance fit: regulators rarely commit to a single audit time, and firms rarely can commit to keeping policies fixed until a predetermined evaluation date. The contribution we emphasize is that, when fairness is formulated as a stable functional of bounded wealth trajectories, one can combine bounded-increment assumptions with martingale concentration to obtain certificates for deployment-average fairness targets.

We stress a limitation here. Many fairness notions of interest depend on unobserved counterfactuals (e.g., ``would the same individual have received a different contract under a different policy?’’) or on protected attributes that may not be logged for legal reasons. Our baseline audit targets are therefore intentionally modest: they certify properties of the induced wealth distribution among observed participants, as recorded. Counterfactual auditing is possible only under additional conditions (e.g., overlap/randomization and well-defined reference policies), and even then the target is typically a policy-level counterfactual rather than an individual-level one. This echoes a broader lesson in causal inference: identifiability requires design, not just clever estimation.

Finally, our emphasis on verifiable randomization and append-only logging draws on systems and cryptography ideas that are increasingly central to ``algorithmic accountability’’ in practice. Transparency logs, secure audit trails, and cryptographic commitments are widely used in domains ranging from certificate authorities and supply chains to financial compliance. Verifiable random functions (VRFs) and related primitives provide publicly checkable proofs that a realized random draw was generated from a committed seed, preventing ex post manipulation while keeping the draw unpredictable ex ante.

In randomized contracting, this matters because fairness disputes often hinge on whether ``randomness’’ was genuine or selectively invoked. A principal who can override random draws in edge cases effectively reintroduces discretion while retaining plausible deniability. By requiring that the principal commit to q_t(⋅ ∣ s) and produce a verifiable proof for each realized b_t, we turn randomization into an auditable commitment. This is not merely a technical embellishment: it changes what can be contested in a regulatory setting. Combined with tamper-evident logs, it allows an auditor to treat propensities as meaningful objects and to apply off-policy or sequential methods without relying on the operator’s goodwill.

At the same time, cryptographic integrity does not solve all governance problems. It does not guarantee that the logged outcome proxy y_i, t is an adequate measure of contribution, nor that the state s_t is recorded without strategic feature engineering, nor that the fairness metric chosen captures all morally relevant harms. Our aim is narrower: to show that, conditional on a well-specified logging and randomization protocol, certain fairness statements become with explicit error probabilities even under adaptivity.

Taken together, these strands motivate the clean baseline model we introduce next. We specify the operational primitives (contracts, randomization, logging), separate deterministic from statistical compliance objects, and define audit targets that are meaningful under endogenous, non-stationary interaction while remaining verifiable from the log.

3. Clean baseline model (contextual contracting / one-step): primitives, logging, and the audit target parameters (deployment-average fairness, welfare, LL violations).

We now formalize a clean ``one-step’’ (contextual) contracting model that isolates the objects we will later audit. The intent is not to fully characterize optimal contracts or equilibrium behavior—indeed, we allow the principal to adapt and agents to respond strategically—but rather to pin down (i) what is recorded in the operational log, (ii) what the auditor can verify deterministically versus only statistically, and (iii) which target parameters are meaningful under non-stationary, history-dependent interaction.

Time is indexed by rounds (episodes) t ∈ {1, …, T}. At the start of round t, an observable state or context s_t is realized (e.g., demand conditions, job attributes, or platform-side features). We treat s_t as publicly observable to the auditor ex post because it is logged; it may also be observed by agents, depending on the application, but nothing in the audit definitions will rely on agents observing s_t.

A is an element b ∈ ℬ, where ℬ is a bounded, pre-specified class of permissible contract terms. We keep ℬ abstract because the audit layer should be agnostic to the operator’s contract design details. For intuition, one auditable special case is a linear share plus floor,
p_i, t = m_t + α_t y_i, t, (α_t, m_t) ∈ [0, 1] × [0, m_max],
where y_i, t is a verifiable outcome proxy attributable to agent i in round t and p_i, t is the logged transfer. More generally, b_t can include schedules, thresholds, or discrete menus, as long as realized payments are unambiguously determined from the logged variables.

The principal is allowed to choose contracts adaptively. Formally, before selecting a realized contract in round t, the principal declares a conditional distribution (propensity) over contracts,
q_t( ⋅ ∣s_t) ∈ Δ(ℬ),
which may depend on the full past history through the filtration ℱ_t − 1 generated by the log up to t − 1. The realized contract is then drawn as
b_t ∼ q_t(⋅ ∣ s_t).
Crucially, in our operational protocol this draw is not a black box: the principal must produce a publicly checkable proof (via a verifiable random function, VRF) that the draw b_t was generated from the declared q_t(⋅ ∣ s_t) using an unpredictable but verifiable source of randomness. Conceptually, the declaration q_t is a that turns randomization into an auditable choice rather than managerial discretion. This commitment is what later permits the auditor to treat propensities as meaningful inputs to statistical procedures (and, in optional extensions, to off-policy estimators).

There are n agents indexed by i ∈ {1, …, n}. After observing the posted contract b_t (and possibly the context s_t), each agent chooses an action
a_i, t ∈ 𝒜_i ∪ {reject},
where the reject/opt-out action captures non-participation. The environment then generates realized outcomes y_i, t, which we interpret as a measurable contribution (or performance proxy) attributable to agent i in round t. We do not assume a stationary outcome model: the distribution of y_i, t may depend on (s_t, b_t, a_i, t), on past history, and on unobserved shocks. The only structural requirement is that y_i, t is verifiable (or at least logged in a way that is contestable), so that payments can be checked against the contract.

Given (b_t, s_t, a_i, t, y_i, t), the principal makes a realized payment p_i, t to each agent. In applications, p_i, t might be computed mechanically from b_t and y_i, t (which makes auditing easiest), or it might include discretionary components; our deterministic compliance checks will be framed in terms of the logged payments regardless.

To connect contracting to distributional compliance, we track wealth increments for the principal and each agent. Let Δw_i, t denote agent i’s realized wealth increment at time t, and Δw_p, t the principal’s. We allow general utility accounting for non-monetary costs and outside options,
$$ \Delta w_{i,t} := \begin{cases} u_i(p_{i,t},a_{i,t},s_t,y_{i,t}) & \text{if } a_{i,t}\neq\text{reject},\\ u_i^{\mathrm{out}}(s_t) & \text{if } a_{i,t}=\text{reject}, \end{cases} \qquad \Delta w_{p,t} := \sum_{i=1}^n (y_{i,t}-p_{i,t}). $$
In the baseline audit, the auditor need not observe u_i or u_i^out directly; what matters is that the log contains a derived from observable components. In the simplest instantiation, we take the proxy wealth increment to be monetary (e.g., Δw_i, t = p_i, t for participants and 0 for reject), and treat non-monetary costs as part of the limitation of what can be certified ex post. We then define cumulative wealth
$$ w_{j,t}:=\sum_{k=1}^t \Delta w_{j,k}, \qquad W_t := (w_{p,t},w_{1,t},\dots,w_{n,t})\in\mathbb R^{n+1}. $$
This cumulative wealth vector is the state on which our fairness and welfare targets will be evaluated. Importantly, W_t is to the interaction: it reflects the principal’s adaptive choices and agents’ strategic responses.

The auditor observes an append-only, tamper-evident log of each round’s transcript
(s_t, b_t, a_{1 : n, t}, y_{1 : n, t}, p_{1 : n, t}),
together with any cryptographic commitments and VRF proofs needed to verify that b_t was sampled from the declared propensity q_t(⋅ ∣ s_t). We write ℱ_t for the filtration generated by this log up to time t. All statistical validity claims will be stated on ℱ_t − 1, which is the appropriate way to formalize adaptivity: the principal may choose q_t as any measurable function of ℱ_t − 1, and agents may choose actions as functions of current information and the anticipated continuation, without invalidating martingale-based inference.

To enable anytime-valid concentration under adaptivity, we assume bounded realized increments. Concretely, for each party j ∈ {p, 1, …, n} and each round t, we assume
$$ \Delta w_{j,t}\in[\underline B,\overline B], $$
where the bounds may be set by design (caps/floors in contracts, limited exposure, or platform constraints) or by the choice of wealth proxy. This boundedness is not innocuous, but it aligns with practice: regulated contracts typically have maximum payouts, and auditing protocols typically rely on bounded-score proxies. We emphasize that we do assume i.i.d. data, stationarity, or a parametric outcome model.

We separate compliance objects into those that are deterministically verifiable from the log and those that require statistical inference.

First, Limited Liability (LL) is an example of a deterministic constraint. When LL takes the form p_i, t ≥ 0 for all i, t, it is directly checkable from the recorded transfers. The audit target is therefore not an expectation but a logical statement over the realized transcript: any negative payment constitutes a certifiable violation tied to a specific log entry.

Second, we define welfare-type targets as functionals of the cumulative wealth vector. For example, total welfare at time t can be written as
Welfare(W_t) := ∑_{j ∈ {p, 1, …, n}}w_j, t,
and Rawlsian welfare as Rawls(W_t) := min_jw_j, t. These metrics are attractive for auditing because they are computable from the log once the wealth proxy is specified.

Third, our main target is a fairness functional F : ℝ^n + 1 → ℝ applied to cumulative wealth, such as 1 − Gini(W_t), a Jain index, the negative variance of wealth, or a minimum-share criterion. The key regularity condition we impose for the baseline theory is Lipschitz stability: F is L-Lipschitz on the feasible wealth domain induced by the bounded increments. This is a deliberate modeling choice reflecting a governance intuition: fairness notions that are excessively sensitive to single-round perturbations are difficult to certify from finite logs.

Because the process is adaptive, we do not target the empirical average $\frac{1}{T}\sum_{t=1}^T F(W_t)$ as if it were an i.i.d. sample mean. Instead, we target the of fairness,
$$ \mu_t \;:=\; \frac{1}{t}\sum_{k=1}^t \mathbb E\!\left[F(W_k)\mid \mathcal F_{k-1}\right], $$
and evaluate compliance against a policy threshold τ via μ_T ≥ τ. This parameter has two governance-relevant features. First, it is well-defined under arbitrary policy drift: 𝔼[F(W_k) ∣ ℱ_k − 1] is the operator’s fairness at time k, given what was known just before acting. Second, it aligns with operational auditing, where the question is whether the system stayed within bounds deployment, not merely on a hypothetical stationary distribution.

This clean baseline model now lets us articulate the audit problem precisely: given the log and cryptographic proofs, an auditor should (i) deterministically flag any LL violations, and (ii) produce anytime-valid confidence sequences for welfare/fairness targets that remain correct under the adaptive, strategic interaction described above. We next formalize what the auditor is allowed to assume (and what it explicitly does not assume) in a threat model, and we define (ε, δ)-compliance statements that connect these targets to actionable regulatory conclusions.

4. Threat model & compliance goals: what the auditor assumes/does not assume (strategic agents, adaptive principal); definition of (ε, δ)-compliance and audit statements.

Our auditing layer is intended to be robust to precisely those features that make online contracting operationally attractive and regulatorily challenging: the principal can adapt the contracting policy in response to history, agents can respond strategically (including opting out), and the resulting data stream is neither i.i.d. nor stationary. For this reason, we separate the economic degrees of freedom'' of the actors from theintegrity assumptions’’ that make the log a meaningful evidentiary object. The threat model clarifies what kinds of behavior we treat as part of the regulated system (and hence must be handled by the audit), versus what kinds of behavior would constitute falsification of the audit record itself.

We allow the principal to be fully strategic subject to the protocol. In particular, the principal may choose q_t(⋅ ∣ s_t) as any ℱ_t − 1-measurable function, including policies that drift rapidly over time, target particular subpopulations through state dependence, or attempt to ``game’’ the fairness statistic by changing terms of trade across rounds. Likewise, agents may behave strategically and heterogeneously: their actions a_i, t can depend on the posted contract, their private information, and expectations about future offers. Importantly, we do not impose equilibrium restrictions (Bayes–Nash, trembling-hand, etc.) because an auditor typically cannot validate such assumptions from an operational log. Finally, we allow the outcome-generating process for y_i, t to be non-stationary and history dependent; the only role outcomes play in the baseline audit is through logged observables and boundedness.

This ``maximally adaptive’’ stance is a feature, not a bug: it ensures that the validity of our statistical claims is not contingent on the operator adhering to a fixed policy class, nor on agents conforming to a stable behavioral model. The cost is that we must define our compliance targets in a way that remains well-posed under such adaptivity—hence the emphasis on predictable (conditional) targets such as 𝔼[F(W_t) ∣ ℱ_t − 1] rather than stationary-population averages.

Against these economic degrees of freedom, we posit a narrow set of integrity assumptions that make auditing feasible.

These assumptions do require the auditor to trust the principal’s incentives or the agents’ incentives; they require only that the recorded transcript is an accurate, contestable account of what the system actually did, and that declared randomization is binding.

Equally important, we state what the auditor does assume, since these omissions determine the interpretation (and limitations) of any certificate.

In short, our audit statements are designed to be under minimal behavioral assumptions, at the expense of targeting quantities that are meaningfully defined from the logged deployment process.

The log supports two qualitatively different kinds of compliance conclusions.

First, some constraints are and therefore deterministically auditable. Limited Liability in the form p_i, t ≥ 0 is the canonical example: a single negative payment is a concrete violation tied to a specific round and agent, and the auditor can produce the corresponding log entry as evidence. Similar deterministic checks include syntactic validity of contract parameters, adherence to stated caps/floors, and (once specified) integrity checks for missingness.

Second, distributional targets such as welfare and fairness are inherently because we target conditional expectations that reflect the system’s predictable behavior under uncertainty. Here, the right governance question is not whether the realized path happened to look fair (which can be luck or noise), but whether the system’s fairness during deployment met a threshold, as captured by $\mu_T=\frac{1}{T}\sum_{t=1}^T \mathbb E[F(W_t)\mid\mathcal F_{t-1}]$.

We formalize compliance as a statement that blends deterministic constraints with probabilistic guarantees. Fix a horizon T, a fairness threshold τ, and a failure probability δ ∈ (0, 1). An auditing algorithm observes the log sequentially and outputs (i) deterministic flags for any pointwise violations and (ii) an anytime-valid confidence sequence [LCB_t, UCB_t] for the predictable deployment-average fairness μ_t.

We say the system is with respect to threshold τ over horizon T if:

The role of ε ≥ 0 is to accommodate governance-relevant approximations that are not statistical in nature: discretization of a continuous contract space, conservative bounding of wealth proxies, or an explicitly permitted tolerance band around τ. In the baseline development, one can take ε = 0 when the target is exactly μ_T ≥ τ and the fairness functional is computed exactly from the proxy wealth.

Because μ_T is not directly observed, compliance must be by an audit rule. Given an anytime-valid confidence sequence, a natural risk-limiting decision rule is:
Certify compliance at time T ⇔ LCB_T ≥ τ − ε.
By construction of confidence sequences, this certificate is :
Pr (LCB_T ≥ τ − ε and μ_T < τ − ε) ≤ δ,
even though the underlying data are adaptive and non-stationary. Symmetrically, one may if UCB_T < τ − ε, with the analogous error control. When τ − ε lies inside the interval, the audit is inconclusive; this is not a failure of the method but an explicit reflection of finite-sample uncertainty.

A key operational feature is that the regulator may audit at unpredictable times, or the platform may need to monitor compliance continuously. For this reason we require : with probability at least 1 − δ, the confidence statement holds simultaneously for all t ≤ T. This allows audits at stopping times (e.g., ``trigger an investigation when complaints arrive’’) without inflating Type I error. When multiple metrics are audited (e.g., LL and fairness, or multiple fairness functionals), we can allocate failure budgets δ₁, …, δ_M across metrics and apply a union bound so that overall failure probability remains controlled.

Finally, we emphasize an interpretive point that is often blurred in policy discussions. A compliance certificate at level δ is a guarantee about the system’s fairness with respect to the and the . It does not certify unobserved welfare components (e.g., effort costs), nor does it establish that the principal could not have achieved higher welfare while remaining fair. In that sense, the model illuminates a core tradeoff: by insisting on targets that remain well-defined and auditable under adaptivity and strategic response, we obtain strong error control and contestability, but we necessarily restrict attention to fairness notions that are stable and measurable from the operational record.

The next section makes these integrity conditions concrete by specifying how verifiable randomization and log integrity are implemented, and by listing the deterministic checks (including LL, boundedness, and missingness) that the auditor can perform before running any statistical certification procedure.

5. Verifiable randomization and log integrity: VRF-based sampling, propensity disclosure, and deterministic checks (e.g., LL, boundedness, missingness).

Our statistical guarantees in the next section rest on a simple prerequisite: the auditor must be able to treat the transcript as an accurate, contestable record of (i) what the principal it would do (the propensity function q_t) and (ii) what the principal did (the realized draw b_t and subsequent payments). This section makes that prerequisite operational by specifying (a) a VRF-based sampling protocol that binds the principal to its declared randomization and (b) a small set of deterministic checks that the auditor can run directly on the log before attempting any statistical certification.

To make propensity compliance verifiable, the contract space ℬ and each distribution q_t(⋅ ∣ s) must admit a canonical encoding. For finite ℬ this is straightforward: the principal logs a probability vector {q_t(b ∣ s_t)}_{b ∈ ℬ} in a fixed order with a prescribed numerical precision. For continuous or high-dimensional ℬ, we require a declared with a canonical description (e.g., a parametric family with parameters θ_t(s_t) and a fixed base measure), together with an explicit discretization rule if the actual implementation is discretized. The key design principle is that, given the logged fields, the auditor can deterministically (i) reconstruct the declared distribution to within a known tolerance and (ii) reproduce the mapping from a uniform variate to the realized contract.

In practice, we recommend logging both (1) a human-interpretable object (e.g., parameters and family name) and (2) a machine-verifiable digest (e.g., a hash of the canonical serialization). The latter is what we cryptographically bind into the VRF input so that the principal cannot change its story about q_t ex post.

Fix a VRF keypair (pk, sk) registered with the auditor prior to deployment. In round t, the principal must commit to the propensity object the random draw is determined. A clean way to enforce this ordering is to define a round-specific VRF input that includes a commitment to q_t.

Concretely, let Entry_t − 1 denote the previous log entry and let h_t − 1 := H(Entry_t − 1) be its hash (or the running hash-chain value). Let Enc(q_t(⋅ ∣ s_t)) be the canonical serialization of the declared propensity object at state s_t, and let c_t := H(Enc(q_t(⋅ ∣ s_t))) be its commitment digest. Define the VRF input
x_t := H (h_t − 1 ∥ t ∥ s_t ∥ c_t),
where ∥ denotes concatenation and H is a collision-resistant hash. The principal computes the VRF output and proof
(r_t, π_t) := VRF_sk(x_t),
and deterministically maps r_t to a uniform variate u_t ∈ [0, 1) (e.g., by interpreting r_t as an integer and normalizing). The realized contract is then computed by a deterministic sampling map
b_t := Sample(q_t(⋅ ∣ s_t), u_t),
where Sample is fixed by protocol. For discrete ℬ, Sample can be implemented via the inverse-CDF rule on the ordered support. For continuous ℬ, Sample is the declared procedure (e.g., inverse transform for a univariate family, or a deterministic pseudocode for multivariate sampling) applied to u_t (and, if needed, additional variates derived deterministically from r_t).

The log entry for round t contains, at minimum,
(s_t, Enc(q_t(⋅ ∣ s_t)), c_t, x_t, r_t, π_t, b_t, a_{1 : n, t}, y_{1 : n, t}, p_{1 : n, t}),
together with the hash-chain value h_t = H(h_t − 1 ∥ Entry_t) and a digital signature under the principal’s signing key. Given these fields, the auditor verifies: (i) hash-chain continuity and signature validity; (ii) c_t matches Enc(q_t); (iii) π_t is a valid VRF proof for input x_t under pk; and (iv) recomputing u_t and Sample(⋅) reproduces the logged b_t.

Two operational remarks matter for soundness. First, including h_t − 1 and t in x_t prevents replay and makes the draw round-specific. Second, including c_t in x_t prevents the principal from choosing q_t observing the VRF output. In this sense, the VRF is not merely a randomness beacon; it is a commitment device that makes the principal’s declared propensity contestable.

When ℬ is continuous, implementations often discretize either the support or the CDF inversion. Because the auditor must obtain an exact match to accept a draw, we require that discretization be part of the declared sampling map and be deterministically reproducible (including rounding rules). Any approximation error can then be treated as an explicit governance slack ε (as in Section~) or bounded as a deterministic implementation error (e.g., total variation distance between the ideal and implemented sampling rule). The audit criterion is then: the realized b_t must equal the output of the declared deterministic procedure applied to the VRF-derived u_t.

Binding a draw to a declared propensity is only useful if the surrounding transcript is itself immutable. Accordingly, each round must be recorded in an append-only, tamper-evident structure (hash chain, signed ledger, or an external notarization service). The auditor’s baseline integrity checks are mechanical: verify that each entry is signed, that the hash chain links correctly, and that round indices are contiguous. These checks are not statistical; they are the evidentiary substrate on which statistical claims will later rest.

Before constructing any confidence sequence, the auditor should first run a set of deterministic checks that either (i) produce immediate violations with cryptographic evidence or (ii) certify that the boundedness and measurability conditions needed for martingale tools plausibly hold for the logged proxy.

The role of this section is deliberately narrow: we are not yet claiming that the system is fair, only that the transcript is . VRF verification and log integrity make the stochastic elements of deployment contestable; LL, missingness, and boundedness checks ensure that the objects we will feed into martingale machinery are well-defined and satisfy the regularity conditions under which anytime-valid inference is possible. With these prerequisites in place, we can treat {F(W_t)}_t ≥ 1 (or suitable increments) as an adapted, bounded process and construct confidence sequences that remain valid under the adaptivity and strategic response emphasized in Section~. The next section develops this statistical machinery.

6. Main statistical machinery: martingale confidence sequences for adaptive processes; converting fairness functionals on wealth vectors into bounded martingale increments.

Having made the transcript (the realized contract draw is bound to the declared propensity and the log is tamper-evident), we can treat the deployed system as generating an adapted stochastic process relative to the auditor’s filtration {ℱ_t}_t ≥ 0. The key methodological point is that we do require independence, stationarity, or a fixed policy: the principal may update q_t based on history, agents may respond strategically, and the induced distribution of outcomes may drift arbitrarily. What we do require for anytime-valid inference is that the statistic we audit can be expressed as a bounded adapted process, so that deviations from its conditional expectations form a martingale difference sequence to which time-uniform concentration applies.

Fix any ℱ_t-measurable scalar statistic Z_t computed from the log at time t (e.g., welfare, Rawlsian minimum wealth, or a fairness functional F(W_t)). Define the (predictable) conditional mean
m_t := 𝔼[Z_t ∣ ℱ_t − 1],
and the deployment-average conditional expectation
$$ \mu_t \;:=\; \frac{1}{t}\sum_{k=1}^t m_k \;=\; \frac{1}{t}\sum_{k=1}^t \mathbb E[Z_k\mid \mathcal F_{k-1}]. $$
This μ_t is the natural audit target in an adaptive environment: it is the average performance that the deployed process delivers, conditional on what was known just before each round. Importantly, μ_t remains well-defined even when the data stream is non-i.i.d.

Let
D_t := Z_t − m_t.
Then {D_t} is a martingale difference sequence: 𝔼[D_t ∣ ℱ_t − 1] = 0. Writing partial sums $S_t:=\sum_{k=1}^t D_k$, we obtain the decomposition
$$ \bar Z_t - \mu_t \;=\; \frac{1}{t}\sum_{k=1}^t (Z_k-m_k) \;=\; \frac{S_t}{t}, \qquad \bar Z_t:=\frac{1}{t}\sum_{k=1}^t Z_k. $$
Thus, any time-uniform bound on S_t immediately yields a confidence sequence for μ_t centered at the observed average Z̄_t, despite adaptivity.

Martingale concentration requires controlling tail growth of S_t, which in turn is achieved by bounding D_t. In our setting the auditor can verify ex ante that per-round wealth increments lie in a bounded interval $\Delta w_{j,t}\in[\underline B,\overline B]$. This implies that cumulative wealth vectors remain in a known bounded set:
$$ W_t \in [t\underline B,t\overline B]^{n+1}. $$
Therefore, for many audit-relevant functionals Z_t = ϕ(W_t), the auditor can derive deterministic bounds $Z_t\in[\underline z_t,\overline z_t]$ (possibly depending on t but known in advance), which imply a bound on D_t.

Concretely, if $Z_t\in[\underline z,\overline z]$ for all t ≤ T, then $D_t\in[\underline z-\overline z,\overline z-\underline z]$, so $|D_t|\le (\overline z-\underline z)$. Even when only time-varying bounds are available, one can work with predictable envelopes $\underline z_t,\overline z_t$ and apply confidence sequence constructions for bounded but non-identically bounded differences; in what follows we emphasize the simpler uniform-bound case, since our wealth bounds imply uniform boundedness over any fixed horizon T.

We use the standard supermartingale method. Suppose D_t is conditionally sub-Gaussian with scale parameter σ (a condition implied by boundedness via Hoeffding’s lemma). Then for any fixed λ ∈ ℝ,
$$ M_t(\lambda) \;:=\; \exp\!\Big(\lambda S_t - \tfrac{\lambda^2\sigma^2}{2}\,t\Big) $$
is a nonnegative supermartingale. Ville’s inequality yields
Pr (∃t ≤ T: M_t(λ) ≥ 1/δ) ≤ δ,
which can be rearranged into a time-uniform boundary for S_t of the form
$$ S_t \;\le\; \frac{\log(1/\delta)}{\lambda} + \frac{\lambda\sigma^2}{2}\,t \qquad \forall t\le T $$
with probability at least 1 − δ. Optimizing over λ gives a $\sqrt{t}$-type boundary. To avoid committing to a single horizon T (or to obtain a bound valid for all t ≥ 1 simultaneously), one uses either (i) supermartingales that integrate M_t(λ) over a mixing distribution on λ, or (ii) arguments that union bound over geometrically increasing epochs. These constructions yield the familiar anytime-valid scaling
$$ |S_t| \;\lesssim\; \sigma \sqrt{t\,\log\!\log t} \;+\; \sigma\sqrt{t\,\log(1/\delta)} $$
(up to constants), and therefore
$$ |\bar Z_t-\mu_t| \;=\; \frac{|S_t|}{t} \;\lesssim\; \sigma \sqrt{\frac{\log\!\log t + \log(1/\delta)}{t}}. $$
We will state explicit finite-sample widths in the next section; here the point is conceptual: validity is obtained by controlling the running maximum of a supermartingale, not by repeating fixed-time tests.

Pure bounded-difference (Hoeffding-style) widths can be conservative when realized variability is small. Martingale confidence sequences admit variance-sensitive analogues based on the predictable quadratic variation
$$ V_t \;:=\; \sum_{k=1}^t \mathbb E[D_k^2\mid \mathcal F_{k-1}], $$
leading to Freedman/Bernstein-type boundaries of the schematic form
$$ |S_t| \;\lesssim\; \sqrt{V_t\,\log(1/\delta)} \;+\; c\,\log(1/\delta), $$
where c is a bound on |D_t|. While V_t is not directly observable, one can upper bound it using deterministic envelopes, or employ empirical-Bernstein confidence sequences that replace V_t with an observable self-normalizer built from the realized Z_t’s (at the cost of slightly larger constants). Operationally, this matters because fairness and welfare trajectories in many deployments are far less volatile than worst-case bounds suggest; variance-adaptive sequences tighten substantially and can certify compliance earlier.

The remaining technical step is to ensure that our audit statistics are indeed bounded (or have bounded differences) under the logged wealth proxy. For welfare and Rawlsian objectives this is immediate, since they are Lipschitz functionals of W_t under standard norms. For general fairness functionals F(W_t), we impose L-Lipschitzness on the relevant bounded domain:
|F(W) − F(W^′)| ≤ L ∥W − W^′∥ for all feasible W, W^′.
Under this condition and bounded wealth increments, we obtain deterministic one-step stability:
|Z_t − Z_t − 1| = |F(W_t) − F(W_t − 1)| ≤ L ∥W_t − W_t − 1∥ = L ∥ΔW_t∥.
If we take ∥⋅∥ to be the Euclidean norm, then $\|\Delta W_t\|\le \sqrt{n+1}\,\max\{|\underline B|,|\overline B|\}$, hence $|Z_t-Z_{t-1}|\le L\sqrt{n+1}\max\{|\underline B|,|\overline B|\}$. This stability is what lets us translate the primitive boundedness of per-round transfers and outcomes into boundedness (or at least controlled range growth) of fairness statistics computed on cumulative wealth.

Conceptually, Lipschitzness is the bridge between bounds (the contract cannot move anyone’s wealth too much in one round) and concentration (the fairness metric cannot jump too much in one round). It is also where limitations surface: some desirable disparity measures are not globally Lipschitz without additional domain restrictions (e.g., inequality indices that divide by mean wealth), which is why we will separately impose a mean-wealth lower bound for 1 − Gini in Section~7.

Once we have an anytime-valid confidence sequence [LCB_t, UCB_t] for μ_t, the audit decision rule is immediate: we can (i) certify compliance at horizon T whenever LCB_T ≥ τ, (ii) flag likely noncompliance when UCB_t < τ, and (iii) monitor continuously without inflating type-I error, since the sequence is valid under optional stopping. This operationalizes a regulatory stance that is both strict about evidentiary integrity (deterministic checks) and appropriately cautious about statistical uncertainty (time-uniform inference under adaptivity).

The next section instantiates this template for (a) welfare and welfare increments, (b) the Rawlsian minimum, and (c) Lipschitz fairness metrics, and then treats 1 − Gini by adding the minimal domain restriction needed to restore Lipschitz stability.

7. Core theorems: anytime-valid confidence sequences for (a) welfare, (b) Rawlsian minimum, (c) Lipschitz fairness metrics, with explicit finite-sample widths; special handling of 1 − Gini (mean lower bound).

In this section we instantiate the generic martingale template with explicit, finite-sample confidence sequence (CS) widths for the three audit objects that recur in policy discussions: (a) total welfare (or welfare increments), (b) Rawlsian protection of the worst-off party, and (c) inequality-style fairness metrics that are stable functionals of cumulative wealth. The common structure is that we choose an adapted statistic Z_t whose range is deterministically bounded from the log primitives, and then form time-uniform bounds for the deployment-average conditional mean $\mu_t=\frac{1}{t}\sum_{k=1}^t \mathbb E[Z_k\mid \mathcal F_{k-1}]$.

We begin with a single reusable lemma (stated here as a theorem for convenience) that turns a boundedness certificate into an operational CS.

Assume Z_t is ℱ_t-measurable and almost surely bounded as $Z_t\in[\underline z,\overline z]$ for all t ≥ 1, with range $c:=\overline z-\underline z$. Define $\bar Z_t:=\frac{1}{t}\sum_{k=1}^t Z_k$ and, for any δ ∈ (0, 1),
$$ \mathrm{rad}_t(\delta) \;:=\; c\sqrt{\frac{2}{t}\left(\log\frac{2}{\delta}+\log\!\big(1+\log_2 t\big)\right)}. $$
Then the interval
CS_t(δ) := [Z̄_t − rad_t(δ), Z̄_t + rad_t(δ)]
satisfies
Pr (∀t ≥ 1: μ_t ∈ CS_t(δ)) ≥ 1 − δ,
for any adaptive (non-i.i.d.) data stream consistent with the filtration {ℱ_t}.

The important operational point is that the auditor only needs the deterministic range c (derivable from $\underline B,\overline B$ and the functional form of Z_t) and the observed running average Z̄_t to compute CS_t(δ) online. In applications below, we will typically report the compliance condition as LCB_T ≥ τ, where LCB_t := Z̄_t − rad_t(δ).

A natural welfare proxy in our contracting environment is total realized wealth change (principal plus agents) per round,
Z_t^wel := ΔWelfare_t := ∑_{j ∈ {p, 1, …, n}}Δw_j, t.
Because each $\Delta w_{j,t}\in[\underline B,\overline B]$ by assumption, we have the deterministic bound
$$ Z_t^{\mathrm{wel}} \in \big[(n+1)\underline B,\ (n+1)\overline B\big], \qquad c_{\mathrm{wel}} = (n+1)(\overline B-\underline B). $$
Applying Theorem~7.1 yields an anytime-valid CS for
$$ \mu_t^{\mathrm{wel}} := \frac{1}{t}\sum_{k=1}^t \mathbb E\!\left[\Delta \mathrm{Welfare}_k\mid \mathcal F_{k-1}\right], $$
i.e., the deployment-average welfare increment. This target is particularly well-suited to adaptive deployments: it answers the regulatory question ``on average, conditional on what the principal knew when acting, what surplus did the deployed mechanism deliver?’’ without requiring stationarity.

If a policy requirement is stated instead in terms of cumulative welfare Welfare(W_t) = ∑_jw_j, t, the auditor can translate between increments and levels via
$$ \frac{1}{T}\sum_{t=1}^T \mathrm{Welfare}(W_t) = \frac{1}{T}\sum_{t=1}^T \sum_{k=1}^t \Delta \mathrm{Welfare}_k = \sum_{k=1}^T \left(1-\frac{k-1}{T}\right)\Delta \mathrm{Welfare}_k, $$
so that a weighted version of the same CS machinery applies (with predictable weights); we emphasize increments because they preserve a uniform range and therefore yield clean $\tilde O(1/\sqrt{t})$ widths.

The Rawlsian metric at time t is the minimum cumulative wealth,
R_t := min_{j ∈ {p, 1, …, n}}w_j, t.
Directly auditing {R_t} as a level statistic is possible but unattractive because its range grows linearly in t, which mechanically widens Hoeffding-style bounds. A simple normalization avoids this problem. Define the worst-off wealth,
$$ Z_t^{\mathrm{raw}} \;:=\; \frac{1}{t}\,R_t = \frac{1}{t}\min_{j} w_{j,t}. $$
Since each $w_{j,t}\in[t\underline B,t\overline B]$, it follows deterministically that $Z_t^{\mathrm{raw}}\in[\underline B,\overline B]$, hence $c_{\mathrm{raw}}=\overline B-\underline B$. Theorem~7.1 therefore provides an anytime-valid CS for
$$ \mu_t^{\mathrm{raw}} := \frac{1}{t}\sum_{k=1}^t \mathbb E\!\left[\frac{1}{k}\min_j w_{j,k}\ \middle|\ \mathcal F_{k-1}\right]. $$
This target has a clear interpretation in deployment terms: it averages (over rounds) the predictable value of the worst-off party’s wealth-to-date.

A second, sometimes more policy-aligned alternative is to audit the in the Rawlsian minimum:
ΔR_t := R_t − R_t − 1.
One can show purely from the wealth-increment bounds that $\Delta R_t\in[\underline B,\overline B]$ for all t (because the minimum cannot rise by more than the largest feasible one-step increment, nor fall by more than the smallest). Thus Theorem~7.1 applies again with the same range $\overline B-\underline B$, yielding an anytime-valid CS for the predictable ``worst-off growth rate’’ $\frac{1}{t}\sum_{k=1}^t \mathbb E[\Delta R_k\mid\mathcal F_{k-1}]$.

Let F : ℝ^n + 1 → ℝ be an L-Lipschitz functional on the feasible wealth domain up to horizon T (for a chosen norm), and set
Z_t^fair := F(W_t).
When F is itself range-bounded (as with many normalized indices taking values in [0, 1]), Theorem~7.1 applies immediately with c_fair = 1. More generally, even when F is not globally bounded, our primitive wealth increment bounds restrict W_t to the hyper-rectangle $[t\underline B,t\overline B]^{n+1}$, and Lipschitzness yields a deterministic range bound on Z_t^fair over t ≤ T:
$$ \sup_{W,W'\in [T\underline B,T\overline B]^{n+1}} |F(W)-F(W')| \;\le\; L\cdot \mathrm{diam}\!\left([T\underline B,T\overline B]^{n+1}\right), $$
where the diameter is explicit under the chosen norm (e.g., under ℓ₂ it equals $T(\overline B-\underline B)\sqrt{n+1}$). Therefore, for a fixed deployment horizon T, the auditor can compute a valid uniform range c_fair(T) and apply Theorem~7.1 to obtain an anytime-valid CS for
$$ \mu_t^{\mathrm{fair}} := \frac{1}{t}\sum_{k=1}^t \mathbb E\!\left[F(W_k)\mid \mathcal F_{k-1}\right], $$
which is exactly the fairness target used in our compliance definition.

In many deployments, worst-case envelopes based on $\overline B-\underline B$ are conservative. The auditor can therefore supplement Theorem~7.1 with an empirical-Bernstein CS whose width depends on realized variability. A canonical form is
$$ \mathrm{rad}^{\mathrm{EB}}_t(\delta) \;=\; \sqrt{\frac{2\widehat V_t\log(C/\delta)}{t^2}} \;+\; \frac{K\,c\log(C/\delta)}{t}, $$
where V̂_t is an observable self-normalizer (built from the realized {Z_k}), and C, K are absolute constants determined by the chosen EB construction. This refinement can materially accelerate certification when fairness (or welfare) is stable.

The index F(W) = 1 − Gini(W) is attractive because it is widely understood by stakeholders, but it is not globally Lipschitz due to the normalization by mean wealth. The minimal fix is to restrict attention to a domain where the mean is bounded away from zero.

Define, for W = (w₀, w₁, …, w_n) with mean $\bar w:=\frac{1}{n+1}\sum_{j=0}^n w_j$,
$$ \mathrm{Gini}(W) := \frac{1}{2(n+1)^2\,\bar w}\sum_{j=0}^n\sum_{k=0}^n |w_j-w_k|, \qquad F(W):=1-\mathrm{Gini}(W). $$
Assume the auditor can verify a policy-imposed condition
w̄_t ≥ μ_min > 0 for all t ≤ T,
directly from the log-computed wealth vector W_t. Also let $R_T:=T\max\{|\underline B|,|\overline B|\}$ so that |w_j, t| ≤ R_T for all j and t ≤ T.

On the restricted feasible set
𝒲_T(μ_min) := {W ∈ [−R_T, R_T]^n + 1: w̄ ≥ μ_min},
one can bound the sensitivity of Gini as follows (under the ℓ₁ norm). Let A(W) := ∑_j, k|w_j − w_k|. Then
|A(W) − A(W^′)| ≤ 2(n + 1) ∥W − W^′∥₁, A(W) ≤ 2R_T(n + 1)²,
and
$$ \left|\frac{1}{\bar w}-\frac{1}{\bar w'}\right| \le \frac{|\bar w-\bar w'|}{\mu_{\min}^2} \le \frac{\|W-W'\|_1}{(n+1)\mu_{\min}^2}. $$
Combining these inequalities yields a Lipschitz bound on Gini (hence also on 1 − Gini):
$$ |\mathrm{Gini}(W)-\mathrm{Gini}(W')| \;\le\; \left( \frac{1}{(n+1)\mu_{\min}} + \frac{R_T}{(n+1)\mu_{\min}^2} \right)\|W-W'\|_1, \qquad W,W'\in\mathcal W_T(\mu_{\min}). $$
Thus F(W) = 1 − Gini(W) is Lipschitz on the auditable domain, with an explicit constant that worsens as μ_min ↓ 0 (the precise expression is less important than this comparative static). Since F(W_t) ∈ [0, 1] on this domain, the simplest implementation is to take Z_t = F(W_t) ∈ [0, 1] and apply Theorem~7.1 directly with c_fair = 1, while treating the mean lower bound w̄_t ≥ μ_min as a separate deterministic precondition: if it fails at any t ≤ T, the auditor flags the fairness metric as unstable and declines to certify.

This illustrates a broader lesson for inequality auditing in adaptive mechanisms: seemingly mild normalizations (division by a mean or a baseline) can destroy stability unless the regulator also enforces a domain restriction that keeps the normalization well-behaved.

8. Counterfactual auditing (optional, bandit first): propensity-weighted and doubly robust estimators for fairness under a fixed alternative contract policy; discussion of overlap and variance blow-up; extension sketch to episodic Markov settings (flag assumptions).

So far we have treated auditing as an exercise: we certify properties of the deployed mechanism using only bounded adapted statistics. In many regulatory conversations, however, the counterfactual question is central: This motivates an optional off-policy (counterfactual) module that leverages the same log, but now uses the principal’s declared propensities to reweight outcomes.

We begin with the contextual bandit case, where each round t consists of observing s_t, drawing a contract b_t ∼ q_t(⋅ ∣ s_t), and then observing a bounded auditable statistic Z_t (e.g., Z_t = F(W_t), or more conservatively a per-round fairness proxy built from ΔW_t).
Fix a policy π^*(b ∣ s) that is not necessarily equal to the deployed sampling rule q_t(⋅ ∣ s). The counterfactual estimand we can identify from logs is the deployment-average conditional mean fairness under π^*,
$$ \mu_t(\pi^*) \;:=\; \frac{1}{t}\sum_{k=1}^t \mathbb E\!\left[ \mathbb E_{b\sim \pi^*(\cdot\mid s_k)}\!\left[\,\mathbb E[Z_k\mid \mathcal F_{k-1},s_k,b]\,\right] \ \middle|\ \mathcal F_{k-1} \right]. $$
This is the natural analog of our on-policy target: it asks, round by round, what fairness would have been delivered had the principal sampled contracts according to π^* given the realized state s_k, while holding fixed the environment response mapping from (s, b) into outcomes (including strategic agent responses).

Identification requires . Specifically, if we impose the auditable condition
q_t(b ∣ s) ≥ η > 0 ∀(b, s, t),
then the one-step importance ratio π^*(b_t ∣ s_t)/q_t(b_t ∣ s_t) is bounded above by 1/η (when π^* is supported on ℬ). This is the precise sense in which randomization is not merely a design choice but a precondition for counterfactual contestability: without overlap, the log cannot speak about contracts that were essentially never tried.

Given verifiable propensities q_t(⋅ ∣ s_t) (via VRF-checked declarations), the auditor can form the inverse-propensity-score (IPS) reweighted statistic
$$ \widetilde Z_t^{\mathrm{IPS}}(\pi^*) \;:=\; \frac{\pi^*(b_t\mid s_t)}{q_t(b_t\mid s_t)}\, Z_t. $$
Under overlap and boundedness $Z_t\in[\underline z,\overline z]$, we have the deterministic range bound
$$ \widetilde Z_t^{\mathrm{IPS}}(\pi^*) \in \left[\frac{\pi^*(b_t\mid s_t)}{q_t(b_t\mid s_t)}\underline z,\ \frac{\pi^*(b_t\mid s_t)}{q_t(b_t\mid s_t)}\overline z\right] \subseteq \left[\frac{\underline z}{\eta},\ \frac{\overline z}{\eta}\right], $$
and hence the range scales as $c_{\mathrm{IPS}}\le (\overline z-\underline z)/\eta$. The IPS average
$$ \overline{\widetilde Z}^{\mathrm{IPS}}_t(\pi^*) \;:=\; \frac{1}{t}\sum_{k=1}^t \widetilde Z_k^{\mathrm{IPS}}(\pi^*) $$
is then an unbiased estimator (in the conditional-on-ℱ_k − 1 sense) of the counterfactual mean under π^*, and we may apply the same martingale CS machinery as in Theorem~7.1 to obtain a time-uniform confidence sequence for μ_t(π^*).

The economic content of the bound is immediate: the variance (and therefore certification time) deteriorates as η ↓ 0. In policy terms, a regulator that wants credible counterfactual auditing must either (i) mandate a minimum exploration rate (a lower bound η), or (ii) restrict permissible counterfactual policies π^* to those that do not put mass on rarely-sampled contracts.

Even with overlap, the IPS estimator can be extremely noisy when π^*/q_t is large on a set of non-negligible probability. This is not a statistical artifact but a genuine information constraint: if the deployed mechanism almost never tries some contract, then the log contains too little information to estimate its consequences precisely.
A common engineering response is to cap weights, i.e. to replace π^*/q_t by min {π^*/q_t, M} for some M, but this introduces bias. From an auditing perspective, we view such truncation as a policy choice: one can still produce statements, but they must be framed as certification for a modified, truncated estimand, or as conservative bounds that account explicitly for truncation bias (which typically requires additional structure, e.g. outcome monotonicity in b).

When outcome modeling is feasible, a doubly robust estimator can substantially reduce variance while retaining a martingale-valid analysis, provided the numerical component is handled carefully. Let m̂_t − 1(s, b) be a regression estimate of m(s, b) := 𝔼[Z_t ∣ ℱ_t − 1, s_t = s, b_t = b] built only from data up to time t − 1 (so m̂_t − 1 is ℱ_t − 1-measurable). Define
m̂_t − 1(s, π^*) := ∑_{b ∈ ℬ}π^*(b ∣ s) m̂_t − 1(s, b),
(with the obvious integral form for continuous ℬ). The one-step DR score is
$$ \widetilde Z_t^{\mathrm{DR}}(\pi^*) \;:=\; \widehat m_{t-1}(s_t,\pi^*) \;+\; \frac{\pi^*(b_t\mid s_t)}{q_t(b_t\mid s_t)}\Big(Z_t-\widehat m_{t-1}(s_t,b_t)\Big). $$
If either (i) the propensity weights are correct (ensured here by VRF-verified sampling and declared q_t) or (ii) the model is correct, the DR average targets μ_t(π^*); when both are approximately correct, it typically has much smaller variance than IPS.

For auditing, the key point is that we can still obtain time-uniform concentration so long as we can bound the DR increments. Under overlap and bounded $Z_t\in[\underline z,\overline z]$, if we additionally ensure $\widehat m_{t-1}(s,b)\in[\underline z,\overline z]$ by construction (e.g. clipping predictions), then Z̃_t^DR(π^*) lies in a known interval whose width again scales like 1/η, but with the residual term Z_t − m̂_t − 1(s_t, b_t) often much smaller in practice. We emphasize a limitation: the learning step that produces m̂_t − 1 is not itself cryptographically verifiable. What is auditable is the (the model must be frozen before observing round t) and the subsequent CS computation given the realized residuals. In deployments, this suggests a clean separation: the regulator specifies the permissible modeling class and training protocol, but the final compliance decision is still based on an anytime-valid bound.

In richer environments, s_t is not exogenous but evolves with actions, and a ``policy’’ π^* specifies a sequence of contract distributions along a trajectory (an episodic MDP). Off-policy evaluation then involves products of importance ratios across steps. In an episode of horizon H, a naïve trajectory-weighted estimator uses
$$ \rho_{e} \;:=\; \prod_{h=1}^H \frac{\pi^*(b_{e,h}\mid s_{e,h})}{q_{e,h}(b_{e,h}\mid s_{e,h})}, $$
which, under overlap q ≥ η, is bounded by η^−H and therefore suffers exponential variance blow-up in H. This is the dynamic analog of the bandit variance problem, and it is more severe.

A practical (and standard) mitigation is to use per-decision importance sampling (weighting stepwise rewards rather than whole trajectories) and/or sequential doubly robust estimators that combine local models with one-step residual corrections. However, to obtain sharp, anytime-valid guarantees in this setting, additional assumptions are typically required beyond those used for on-policy fairness certification: for example, bounded per-step rewards, bounded importance ratios (or explicit truncation), and some form of mixing/stability to control how model errors propagate through time. Because these conditions are application-dependent, we treat the Markov extension as a modular add-on rather than a baseline guarantee: the log architecture (VRF-verified propensities and append-only integrity) is fully compatible with these estimators, but the statistical validity of counterfactual fairness claims hinges on enforceable overlap and, in many cases, structural constraints on the dynamics.

Taken together, the counterfactual module clarifies the tradeoff we want the model to illuminate: verifiable randomization makes counterfactual questions identifiable, but only a regulator-imposed overlap regime (or a restriction of the counterfactual class) prevents the resulting audit from becoming statistically powerless.

9. Empirics I (simulation): principal-agent environments with adaptive stochastic contracts; audit power curves (samples needed to certify fairness thresholds); robustness to policy drift and strategic responses.

We complement the theoretical guarantees with a simulation study designed to answer a practical question a regulator would immediately ask: Because our confidence sequences are anytime-valid, ``sample size’’ is endogenous—certification can occur early when the data are informative, and may never occur when the system is near the threshold or the fairness functional is highly variable. Simulations let us visualize this behavior through and stress-test robustness to the two features that break classical i.i.d. analysis: policy drift (the principal adapts q_t) and strategic responses (agents adapt a_i, t).

We simulate n agents and a principal over T rounds. Each round draws an observable state s_t ∈ {1, …, S} (e.g. market conditions) from a Markov chain with moderate persistence; the auditor observes s_t in the log. The contract class is the auditable linear form
p_i, t = m_t + α_ty_i, t, (α_t, m_t) ∈ [0, 1] × [0, m_max],
shared across agents for simplicity, with b_t = (α_t, m_t). Outcomes are generated by strategic effort with idiosyncratic productivity. Concretely, each agent i draws a private cost shock θ_i, t and chooses effort e_i, t ∈ [0, e_max] if participating; output is
y_i, t = β_i(s_t) e_i, t + ε_i, t, ε_i, t ∈ [−σ, σ],
and agent utility is quasi-linear
$$ \Delta w_{i,t} \;=\; p_{i,t} - \tfrac{1}{2}\theta_{i,t} e_{i,t}^2 \quad \text{if participating,} \qquad \Delta w_{i,t} \;=\; u_i^{\mathrm{out}}(s_t) \quad \text{if rejecting.} $$
Given (α_t, m_t), agents best-respond myopically (a standard reduced form for repeated contracting when the state is observed and the contract is per-round). Rejection occurs when the implied best-response payoff falls below u_i^out(s_t). This design produces two empirically relevant patterns: (i) raising α_t increases incentives and output, but shifts surplus toward agents; (ii) raising m_t relaxes participation constraints, but can generate inequality if targeted unevenly through state dependence. The auditor does not observe θ_i, t or β_i(⋅); it only observes (s_t, b_t, a_{1 : n, t}, y_{1 : n, t}, p_{1 : n, t}).

To induce realistic nonstationarity, we let the principal adapt q_t(⋅ ∣ s) using a simple learning rule that trades off profit and a soft fairness penalty. In each state s, the principal maintains weights over a finite grid of contracts ℬ and updates them via exponentiated gradient on a noisy proxy objective
$$ \widehat J_t(b) \;=\; \sum_{i=1}^n \big(y_{i,t} - p_{i,t}\big) \;-\; \lambda\, \widehat{\mathrm{Ineq}}_t(b), $$
where $\widehat{\mathrm{Ineq}}_t$ is computed from recent logged wealth changes (e.g. a sliding-window variance of Δw_i, t). This induces endogenously: when the environment changes (through s_t) or the agent pool is heterogeneous, the principal shifts mass toward contracts that improve its objective. Importantly, the audit does not assume any particular learning algorithm; we use adaptivity here only to test that the confidence sequence maintains coverage under history-dependent q_t.

We focus on wealth-based fairness functionals F(W_t), including (i) 1 − Gini(W_t) with an enforced mean-wealth floor w̄_t ≥ μ_min (implemented in simulation by adding a fixed baseline transfer to all parties), and (ii) the Jain index on agent wealth alone. To align with the theory, we explicitly enforce bounded increments $\Delta w_{j,t}\in[\underline B,\overline B]$ by clipping payments and outcomes in the simulator and by ensuring m_t ∈ [0, m_max] and y_i, t ∈ [−y_max, y_max]. These are not merely technicalities: the empirical width of the confidence sequence is driven by the realized variability, but the hinges on a correct deterministic range bound. We therefore treat ``range calibration’’ as an input to the empirical exercise: the auditor must be conservative about $\underline B,\overline B$ and about the Lipschitz proxy L for the chosen F.

For each simulated run, we compute the streaming confidence sequence for
$$ \mu_t \;=\; \frac{1}{t}\sum_{k=1}^t \mathbb E\!\left[F(W_k)\mid \mathcal F_{k-1}\right], $$
using the same martingale CS construction as in our theoretical section (implemented with a bounded-differences mixture boundary). We then record the at which LCB_t ≥ τ (certification) and, symmetrically, at which UCB_t < τ (early rejection). To interpret these stopping times as ``power,’’ we run paired experiments where we can compute a ground-truth benchmark: since we control the simulator, we approximate μ_t by Monte Carlo conditioning on ℱ_t − 1 (holding the realized history fixed but resampling the one-step noise). This is not available in practice, but it allows us to measure (i) empirical of the CS and (ii) the probability of certification as a function of T and the fairness gap μ_T − τ.

The central output is an audit power curve plotting Pr (∃t ≤ T : LCB_t ≥ τ) against T, stratified by the fairness gap and by the overlap/drift regime. Three patterns are robust across parameterizations. First, when μ_T exceeds τ by a comfortable margin, certification occurs quickly and the stopping time concentrates tightly; empirically, the median stopping time scales roughly like O((μ_T − τ)⁻²log (1/δ)), consistent with concentration intuition. Second, when μ_T is close to τ, the CS often remains inconclusive over long horizons: the auditor does not falsely certify (by coverage), but it also cannot ``force’’ a decision without more information. This is precisely the operational meaning of an anytime-valid guarantee: it trades premature false assurance for a transparent dependence on data. Third, larger per-round variability—induced either by higher σ in outcomes, more heterogeneous β_i, or more aggressive policy drift—shifts the power curve rightward, increasing the rounds needed for certification.

To test robustness, we vary (i) the learning rate of the principal (faster drift) and (ii) the degree of strategic elasticity (how strongly e_i, t responds to α_t, and how often agents reject). Classical fixed-policy concentration can fail badly here because the distribution of F(W_t) is nonstationary and endogenous. In contrast, our diagnostics focus on the claim the auditor actually needs: . Across drift regimes, we track the event
ℰ = {∃t ≤ T: μ_t < LCB_t},
and estimate Pr (ℰ) over many runs. Empirically, Pr (ℰ) stays below the nominal δ (up to Monte Carlo error), even in regimes where the realized fairness trajectory is highly path-dependent and exhibits long transients. This is the main sense in which the simulation supports the theory: not that the audit is always decisive, but that when it is decisive it is not spuriously so.

Two limitations are worth making explicit. First, the simulator necessarily hard-codes boundedness and (for Gini) a mean-wealth floor; in deployment these must be justified institutionally (e.g. payment caps, escrow constraints, or explicit baseline compensation). Second, our power curves are conditional on the chosen fairness functional and on conservative bounds. A regulator that insists on a highly sensitive metric, or that cannot credibly bound increments, should expect the audit to require substantially more data or to be frequently inconclusive. We view this not as a weakness but as a policy-relevant output: it makes transparent which fairness notions are at reasonable sample sizes under adaptive contracting.

These simulation results set up the next empirical module, where we move from fully specified synthetic environments to semi-synthetic platform-style logs, and evaluate not only stopping times but also false positive/false negative rates under controlled injections of unfairness.

10. Empirics II (semi-synthetic platform-pay): constructing logs, injecting controlled unfairness, and evaluating false positive/false negative audit rates; operational guidance.

Our simulation study isolates the statistical logic of anytime-valid auditing under adaptivity, but it deliberately abstracts from the messy structure of real platform logs: heterogeneous tasks, irregular participation, missingness, and payment rules that are only partially parameterized. We therefore complement it with a semi-synthetic empirical module that uses while preserving experimental control over the contracting and fairness properties that the auditor is asked to certify. The goal is operational: quantify (i) how often our audit would certify or reject in finite samples when the underlying system is truly fair or unfair, and (ii) how sensitive these rates are to the range and Lipschitz calibrations that the theory requires.

We start from a platform-style dataset consisting of time-stamped interactions between a principal (the platform) and a population of workers. Each interaction includes covariates that we treat as the observable state s_t (task category, predicted duration, location, baseline price, worker tenure, congestion indicators, etc.) and a realized outcome proxy y_i, t (e.g. completed units, verified quality score, or revenue attributable to worker i). Because the raw data typically record realized payments but not counterfactual payments under alternative contracts, we use the dataset only as a source of (s_t, y_i, t) trajectories and participation patterns, and then payments under a specified contract class. Concretely, we define a bounded contract grid ℬ of auditable linear-share forms
b = (α, m) ∈ [0, 1] × [0, m_max], p_i, t(b) = m + α y_i, t,
together with deterministic payment caps that enforce p_i, t ∈ [0, p_max] and hence bounded wealth increments. Participation is treated as an action a_i, t ∈ {accept, reject}: in the simplest variant, we take the observed accept/reject decisions as fixed, while in a richer variant we fit a probabilistic participation model on historical data and then simulate a_i, t under counterfactual b to capture selection effects. In either case, the resulting semi-synthetic log matches the auditor’s interface: each round records (s_t, b_t, a_{1 : n, t}, y_{1 : n, t}, p_{1 : n, t}) and a declared propensity vector q_t(⋅ ∣ s_t).

A practical nuance is that real systems do not natively provide VRF proofs. In our semi-synthetic evaluation we therefore treat propensity integrity in two layers: (i) the layer, where the auditor trusts the logged q_t and uses it for (optional) overlap and counterfactual analyses; and (ii) the layer, where we separately evaluate the value of VRF-style commitment by simulating two regimes: a committed'' regime in which $(q_t,b_t)$ are consistent by construction, and anopportunistic’’ regime in which we allow adversarial post hoc editing of q_t to illustrate how quickly off-policy estimates can become meaningless without verifiable randomization. This split lets us be explicit about what is statistical and what is cryptographic: the confidence sequence controls false certification , whereas VRF mechanisms are what make that condition institutionally credible.

To evaluate false positive/false negative rates, we need repeated instances where the ground-truth deployment-average fairness μ_T is above or below the threshold τ by a known margin. We accomplish this by injecting unfairness through mechanisms that mirror plausible platform behaviors while remaining auditable from the log.

The first injection channel is . Partition agents into two observed groups g(i) ∈ {0, 1} (interpretable as cohorts, regions, or any policy-relevant partition available to the auditor). For an unfairness level κ ≥ 0, define a modified payment rule
p_i, t^(κ)(b) = min {p_max, max {0, m + αy_i, t − κ ⋅ 1[g(i) = 1]}},
which preserves boundedness and LL by construction but creates systematic wealth divergence.

The second channel is . Here the principal uses different propensities conditional on group membership or on state variables correlated with group. For instance, the principal may allocate high-α contracts (high incentive/high worker surplus) at lower probability for g = 1:
q_t^(κ)(b ∣ s_t, g) ∝ q_t⁽⁰⁾(b ∣ s_t) ⋅ exp ( − κ ⋅ 1[b ∈ ℬ^high-α] ⋅ 1[g = 1]),
followed by renormalization. This kind of unfairness is subtle in one-step outcomes but accumulates in wealth, which is precisely why wealth-based F(W_t) is a useful audit target.

The third channel is . Even when posted contracts are identical, a platform may effectively induce differential rejection (e.g. through frictions, delayed payouts, or information asymmetries). In the variant with simulated participation, we apply a group-dependent outside-option shift Δu^out that changes reject rates, which then changes realized wealth distributions through selection. This injection is valuable because it produces unfairness without explicit payment discrimination, stressing that an audit based on W_t must treat rejection as an action that affects welfare and inequality.

For each unfairness level κ and horizon T, we generate many semi-synthetic deployments by replaying the observed (s_t, y_i, t) trace and sampling contracts b_t ∼ q_t(⋅ ∣ s_t) under the specified regime. On each deployment we run the streaming audit and record (i) the first time t at which LCB_t ≥ τ (certification), (ii) the first time t at which UCB_t < τ (rejection), and (iii) whether neither occurs by T (inconclusive). Because the audit target is the conditional-expectation average $\mu_t=\frac{1}{t}\sum_{k=1}^t\mathbb E[F(W_k)\mid\mathcal F_{k-1}]$, we approximate ground truth in this semi-synthetic setting by resampling only the internal randomization (contract draws and, when applicable, participation draws) while holding the empirical (s_t, y_i, t) path fixed. This mirrors how an auditor would reason about a fixed deployment history with stochastic policy execution.

We summarize performance with two complementary notions of error. The first is , the probability that the audit ever certifies when the true μ_t is below threshold at some time. Operationally we estimate
Pr (∃t ≤ T: LCB_t ≥ τ and μ_t < τ),
which should be controlled at approximately δ when the deterministic bounds are correctly specified. The second is with an indifference band: for a tolerance γ > 0, we treat μ_T ∈ [τ − γ, τ + γ] as ``near-threshold’’ and report false positives Pr (certify by T ∣ μ_T ≤ τ − γ) and false negatives Pr (not certify by T ∣ μ_T ≥ τ + γ). This decomposition is important for practice: near threshold, inconclusiveness is not a failure but a truthful reflection of insufficient evidence.

Three qualitative results recur across datasets and injection channels. First, under conservative range bounds and correct implementation, false certification is rare and empirically tracks the nominal level: when we set δ = 0.05, the probability of ever certifying an actually unfair deployment remains near or below 5% across a broad range of κ that place μ_T below τ. Second, false negatives are governed primarily by variance and selection: regimes with high dispersion in y_i, t and high reject elasticity produce wide confidence sequences and thus delayed or absent certification even when the system is truly fair. Third, miscalibration of the deterministic bounds is the dominant failure mode: if we intentionally underestimate the per-round range (e.g. by ignoring rare but real outlier outcomes), false certification can rise sharply. This is not surprising mathematically, but the semi-synthetic exercise puts numbers on it and motivates an institutional response: range bounds are not an innocuous modeling choice, they are a compliance-critical engineering parameter.

The semi-synthetic results suggest a concrete checklist for deploying wealth-based fairness audits.

If payments or outcomes are unbounded or subject to rare spikes, the correct response is not to ``hope the tails behave’’ but to impose caps, escrow rules, or throttles that make $\underline B,\overline B$ credible and log-verifiable. The audit’s validity is only as strong as these deterministic constraints.
Choosing F ex post invites both gaming and confusion. For sensitive functionals such as 1 − Gini, enforce a mean-wealth floor (via baseline transfers or restricted reporting windows) so that the Lipschitz proxy is well-defined on the feasible set.
Near threshold, an anytime-valid procedure will often refuse to decide quickly. Reporting should therefore include not only a binary decision but also the terminal interval [LCB_T, UCB_T] and, when useful, the implied sample size required to separate τ from the current estimate at the desired confidence.
If the audit relies on propensities (for overlap diagnostics or counterfactual analyses), then propensity integrity must be enforced with commitments such as VRFs; otherwise, the statistical layer can be satisfied while the system is manipulable. In contrast, LL violations can be certified deterministically from payments alone, so they should be monitored continuously regardless of whether propensities are committed.
Before using an audit for enforcement, we recommend the exact semi-synthetic exercise we report here: take historical logs, inject controlled unfairness at multiple strengths, and verify that the audit (a) does not spuriously certify, and (b) has adequate power at policy-relevant effect sizes.

Taken together, these semi-synthetic exercises bridge the gap between our theoretical guarantees and a regulator’s implementation concerns: they show where the audit is robust (adaptivity, selection, heterogeneity) and where it is brittle (range miscalibration, unverifiable propensities). This sets up the broader question we turn to next: how to design contracting and logging rules so that the set of fairness notions is as large as possible without unduly constraining performance.

11. Discussion: the stability–auditability frontier; design recommendations (minimal randomization for overlap, transparent propensities); limitations and open problems.

Our results can be read as mapping a ``stability–auditability frontier’’ for fairness regulation in online contracting systems. On one axis sits the ambition of the compliance claim—for example, whether one wishes to certify a simple pointwise constraint (LL), a deployment-average scalar of realized welfare, or a distribution-sensitive functional of cumulative wealth such as 1 − Gini or a Rawlsian minimum. On the other axis sits the stability of the statistic being audited, in the precise sense needed for anytime-valid inference under adaptivity: bounded increments and a Lipschitz envelope that controls how much the target can move when the ledger moves by one round. The frontier is sharp: whenever the audited object is sensitive to rare events, to near-zero denominators (as with Gini when mean wealth is small), or to unlogged confounding (as in many counterfactual questions), validity is not merely harder—it can become ill-posed unless the institution redesigns the system to restore stability.

This perspective clarifies why wealth-based fairness is simultaneously attractive and delicate. It is attractive because wealth accumulation is what makes repeated interactions policy-relevant: small per-round disparities that are invisible in myopic outcome metrics can compound into durable inequality. But it is delicate because compounding also amplifies sensitivity. A fairness functional F(W_t) can be stable only on a restricted domain: bounded increments, bounded horizon (or controlled growth), and, for scale-normalized measures, lower bounds on relevant denominators. The practical implication is that auditability is not a purely statistical feature of the world; it is a design choice. A regulator who insists on certifying an unstable metric without imposing stabilizing design constraints is, in effect, demanding a proof without an axiom.

The most direct way to move along the frontier is to impose constraints that are both operational and log-verifiable. Payment caps, floors, escrow rules, and throttles are often discussed as product or risk controls; our contribution is to interpret them as . If $\Delta w_{j,t}\in[\underline B,\overline B]$ is violated even rarely, the mathematics of anytime-valid certification deteriorates quickly because the relevant concentration tools depend on worst-case increments. Conversely, when boundedness is enforced by design, the auditor can treat range parameters as institutional facts rather than model estimates. Similarly, for fairness functionals with normalization (e.g., 1 − Gini), a mean-wealth floor w̄_t ≥ μ_min is not a technical nuisance but an explicit policy lever: baseline transfers, restricted reporting windows, or a requirement to audit net-of-fixed endowments can keep the system in a domain where F is Lipschitz and thus certifiable.

Any counterfactual or policy comparison'' question requires overlap, and overlap is expensive because it forces the principal to sometimes select contracts that are not myopically optimal. The right goal is therefore not maximal randomization but \emph{minimal randomization sufficient for audit}. Concretely, one can impose a lower bound $q_t(b\mid s)\ge \eta$ on a small, pre-registered exploration set of contracts, where $\eta$ is chosen to meet a target audit precision at horizon $T$. This reframes exploration as compliance infrastructure: the platform purchasesauditability capital’’ by paying a small efficiency cost. In practice we expect η to be heterogeneous across states—larger where fairness concerns or heterogeneity are greatest, smaller where behavior is well understood—but the principle remains that overlap should be treated as a first-class policy variable, not as an incidental artifact of experimentation. Moreover, because the relevant guarantees are anytime-valid, this exploration can be scheduled adaptively (e.g., turned on when intervals widen) without sacrificing inferential validity, provided the logging and propensity integrity are maintained.

The statistical layer of any overlap-based audit depends on the integrity of the logged propensities q_t(⋅ ∣ s_t). Absent a commitment mechanism, a sophisticated principal can ``explain’’ any realized action ex post by retroactively editing propensities, defeating both diagnostics and counterfactual estimation. The VRF-based construction we study is a clean solution because it enforces a simple sequencing norm: propensities are declared and committed outcomes are realized, and the realized b_t is demonstrably sampled from that declaration. Beyond cryptography, there is a governance lesson: regulators should specify not only what must be logged, but also it must be logged relative to outcome realization. In our setting, timing is part of the definition of manipulability. A practical corollary is that platforms should standardize propensity schemas (state variables included, discretization of ℬ, and versioning of contract classes) so that auditors can interpret q_t consistently across time and across product changes.

LL illustrates a class of constraints that are deterministically auditable from logs: they admit cryptographic proofs of violation and require no probabilistic calibration. Fairness targets, by contrast, are inherently statistical when defined in expectation under adaptive behavior. Enforcement regimes should respect this difference. We view it as a mistake to treat all compliance checks as if they were of the same epistemic type. A more coherent enforcement design is two-tiered: (i) continuously monitor hard, pointwise constraints (LL, payment caps, declared contract class membership, VRF verification), and (ii) monitor soft, distributional targets with confidence sequences that explicitly allow inconclusive outcomes near threshold. This separation reduces both regulatory risk (by making ``hard’’ violations quickly contestable) and platform risk (by making clear when non-decision is the correct outcome).

Several limitations are substantive rather than cosmetic. First, our fairness guarantees rely on a Lipschitz envelope for F on a bounded domain. Many appealing notions—tail risk measures, quantile-based parity constraints, or metrics that condition on rare subpopulations—are intrinsically non-Lipschitz or effectively unbounded in finite samples; auditing them may require stronger design interventions (e.g., enforced minimum sample sizes per subgroup) or different inferential tools (e.g., robust or trimmed functionals). Second, the log-based approach presumes that outcomes y_i, t are verifiable proxies. In many platforms, outcome attribution is noisy, delayed, or manipulable by the principal (e.g., through measurement choices). Without an external measurement channel or a trusted attestation mechanism, any audit can be undermined at the measurement layer even if payments and propensities are committed.

Third, while our model accommodates strategic agent responses in the martingale sense (adaptivity), it does not model collusion, sybil attacks, or coordinated gaming of the fairness functional. A principal who can create or merge ``agents’’ can mechanically alter inequality measures without changing underlying treatment; conversely, agents might coordinate to reshape the wealth distribution. Addressing identity and collusion requires institutional primitives (identity verification, anti-sybil rules) that sit outside our statistical argument but are essential for real deployments.

Several directions are natural. One is to tighten the stability requirements. Lipschitzness is sufficient but not necessary; exploiting structure in F (e.g., smoothness, self-bounding properties) may yield narrower confidence sequences and hence less required randomization. Another is to move from one-step counterfactual auditing to truly dynamic contracting in Markov settings where actions affect future states, participation, and learning. Here, valid off-policy evaluation typically requires either bounded importance weights over trajectories or mixing assumptions that are often empirically contestable; formalizing what can be credibly assumed—and what must be engineered through policy constraints—remains open.

A third direction is mechanism design under audit constraints: if the principal internalizes that only auditable fairness notions will be enforced, how should contracts be chosen to maximize performance subject to (τ, δ)-compliance? This turns the audit from a passive diagnostic into an active design constraint, yielding a new frontier between efficiency and certifiable equity. Finally, privacy and confidentiality matter: logs rich enough to support fairness auditing can reveal sensitive information about workers or trade secrets about contracting. Developing cryptographic or differential-privacy layers that preserve auditability while protecting participants is, in our view, not an optional add-on but a central requirement for deployment.

Taken together, these considerations support a pragmatic message. Fairness auditing is feasible under adaptivity when the institution commits to stable targets, bounded ledgers, and transparent randomization. When these conditions are absent, the right response is not to overfit statistical fixes, but to redesign the system so that the desired fairness claim becomes a property that can, in fact, be audited.

12. Conclusion.

We set out to clarify a simple but often blurred question in online contracting environments: when a regulator demands that a principal be ``fair,’’ what exactly is being demanded in terms of what can be from the operational record, and what must instead be under uncertainty? Our central message is that fairness regulation in adaptive, strategically populated systems is not primarily limited by the sophistication of statistical tooling, but by the stability properties of the objects being audited and by the extent to which the platform commits to log- and timing-level primitives that make manipulation contestable.

At the conceptual level, we framed fairness auditing as an exercise in under adaptivity. The principal may change contract propensities q_t(⋅ ∣ s_t) as it learns or optimizes; agents may respond strategically; and the resulting data are emphatically non-i.i.d. This adaptivity is not an inconvenient detail: it is the defining feature of real platforms. The appropriate target for regulation is therefore not a stationary population quantity, but a deployment-average quantity indexed by the information set ℱ_t − 1 that is available to the platform when it acts. In our formulation, the compliance claim concerns the running average of conditional expectations,
$$ \mu_t \;:=\; \frac{1}{t}\sum_{k=1}^t \mathbb E\!\left[F(W_k)\mid \mathcal F_{k-1}\right], $$
which is the natural object that remains meaningful when policies drift and responses are endogenous.

At the technical level, our contribution is to show how this target becomes auditable with anytime-valid inference once two design commitments are in place. The first is : an append-only, tamper-evident record of states, realized contracts, actions, outcomes, and payments. This transforms certain compliance notions from statistical claims into deterministic checks. Limited liability is the canonical example: because payments p_i, t are explicitly logged, LL violations are not matters of estimation, but of record. The auditor can either certify compliance by verifying min_i, tp_i, t ≥ 0, or point to the precise offending entry. The second commitment is : the principal must commit to q_t(⋅ ∣ s_t) before outcomes are realized and must sample b_t in a publicly verifiable manner (e.g., via a VRF-based construction). This commitment is not merely cryptographic hygiene. It makes randomization a governed object: the platform cannot retroactively rationalize behavior by rewriting propensities, and the auditor can treat the logged propensities as factual inputs rather than as self-reported narratives.

With these primitives established, the remaining burden is statistical, and here stability is decisive. We emphasized that fairness metrics are auditable only to the extent that they are stable functionals of the ledger. Bounded per-round wealth increments $\Delta w_{j,t}\in[\underline B,\overline B]$ and Lipschitz continuity of the fairness functional F on the relevant domain imply that the adapted process Z_t := F(W_t) cannot jump arbitrarily in a single round. This bounded-difference structure is exactly what is required by modern martingale concentration tools to deliver confidence sequences that remain valid under optional stopping and continuous monitoring. The operational output is an anytime-valid interval [LCB_t, UCB_t] for μ_t such that, with probability at least 1 − δ, the interval contains the true deployment-average target t ≤ T. Consequently, a regulator can implement a clean enforcement rule: if LCB_T ≥ τ, then fairness compliance μ_T ≥ τ holds at level 1 − δ, even though the platform adapted throughout deployment.

This yields an interpretive lens that we believe is useful beyond the particular constructions in the paper. Some compliance goals are hard'' in the sense of being directly testable from the ledger (nonnegativity of payments, contract-class membership, or VRF verification). Others aresoft’’ in the sense of being inherently expectation-based and therefore requiring statistical tolerance near the threshold. Treating these two classes symmetrically leads to predictable pathologies: either over-enforcement (punishing noise as if it were misconduct) or under-enforcement (allowing manipulation by exploiting ambiguity). Our framework makes the separation explicit and provides a route to operationalizing soft targets without pretending they are deterministic facts.

The framework also clarifies the economic meaning of randomization. In many platforms, exploration is defended as a learning device internal to the firm. In regulated environments, exploration has an additional role: it creates overlap and hence makes counterfactual questions identifiable from logs. When overlap is enforced and propensities are verifiable, the auditor can ask not only whether realized fairness met a threshold, but (in restricted settings) what fairness would have been under a fixed reference policy. The important economic point is that overlap is costly: it diverts actions away from myopic profit maximization. Our analysis therefore supports a design principle that is naturally expressed in policy language: require randomization sufficient for audit precision at the relevant horizon, rather than maximal randomization or ad hoc experimentation.

We do not claim that our conditions are innocuous. The need for bounded increments and Lipschitz stability is not an artifact of proof technique; it reflects a genuine impossibility of certifying unstable objects in adversarially adaptive systems. Likewise, our reliance on verifiable outcomes y_i, t is a substantive institutional assumption. If the principal can manipulate measurement or attribution, then even perfect payment logs and perfect VRF proofs cannot rescue the audit: the failure occurs at the sensing layer, not at the inference layer. These limitations are valuable because they point regulators and system designers toward the correct locus of intervention. When auditability fails, it is often because the system has not been engineered to support the desired claim.

Looking forward, we see three directions as especially consequential. First, there is room to sharpen the stability–efficiency trade-off: Lipschitzness is a sufficient condition, but not always tight, and exploiting additional structure in fairness metrics may reduce the data requirements for certification. Second, the interaction between audit constraints and optimal contracting remains largely open. If the principal must maintain (τ, δ)-compliance in an anytime sense, what contracts maximize performance subject to that constraint, and how does the answer depend on the volatility bounds and on the fairness functional? Third, privacy and confidentiality must be treated as co-equal design requirements. Logs that make fairness auditable can also expose sensitive worker information or proprietary policy details; resolving this tension will likely require cryptographic commitments, secure aggregation, or carefully designed disclosure regimes that preserve verifiability while limiting leakage.

The broader takeaway is pragmatic. Fairness regulation for online contracting systems can be made operational if it is formulated as a claim about stable, logged quantities; if it distinguishes hard constraints from soft targets; and if it treats randomization, logging, and timing as compliance infrastructure rather than as afterthoughts. Under these conditions, the regulator is not asking for faith, and the platform is not asked to prove the unprovable. Instead, the fairness claim becomes what it ought to be in a high-stakes economic environment: a property that can be contested, audited, and certified under explicit assumptions that are themselves subject to institutional design.