Digital sellers in 2026 face a demand environment that is simultaneously data-rich and structurally unstable. Prices can be updated in milliseconds, yet the mapping from a posted price to a purchase event is routinely perturbed by forces outside the seller’s control: rapid shifts in consumer sentiment, platform-wide recommendation changes, advertising auctions that reallocate traffic composition, regulatory interventions that alter disclosure and consent, and even competing algorithms that learn and react. In such settings, the classical premise that ``demand is stationary’’ is less a benign simplification than a brittle point of failure. When the distribution of willingness-to-pay drifts, an algorithm that optimizes against yesterday’s demand may be systematically wrong today, not because it is under-exploring, but because the object it is learning has moved.
This paper studies a particularly stark version of this problem: contextual dynamic pricing with only binary feedback. The seller observes rich covariates (the ``context’’), posts a price, and then learns only whether a purchase occurred. The economic attraction of this formulation is that it corresponds closely to practice (conversion events are far more reliably measured than valuations) while forcing the algorithm to respect informational constraints that matter for policy (privacy-preserving measurement, coarse reporting, and strategic data minimization). The technical difficulty is that binary feedback is low signal-to-noise, and nonstationarity compounds this difficulty by making old data potentially misleading rather than merely uninformative.
A large literature on contextual pricing and bandits delivers regret guarantees under stationarity: the demand curve (or the noise distribution around a parametric signal) is fixed over time. These guarantees can be impressive, but they encode a strong economic assumption: that once we control for observed covariates, the residual heterogeneity in valuations does not drift. In platform markets, that assumption is hard to defend. Moreover, the direction of misspecification is asymmetric. Stationary algorithms naturally pool information over long horizons to reduce variance; when demand drifts, this pooling creates bias, and bias is precisely what generates persistent revenue losses. Put differently, stationarity buys statistical efficiency by spending robustness.
Our starting point is the insight behind the VAPE framework: even with binary feedback, one can exploit structure in valuations to separate what is transferable across contexts from what is not. The context enters through a linear index (the ``expected valuation’’), while the residual uncertainty enters through an additive noise term that determines the purchase probability as a function of the price increment relative to that index. In the stationary model, this decomposition allows the seller to learn the linear index via occasional randomized probing and then learn the increment-demand relationship by pooling data across contexts. The algorithmic message is appealingly economic: we can treat the covariates as shifting the level of valuations, while the shape of residual demand can be learned once and reused.
The contribution of the present work is to show that this logic survives an important form of nonstationarity. We allow the demand curve in increment space to drift over time, subject to a total-variation budget. This is a natural model of gradual or intermittent changes in the distribution of unobserved willingness-to-pay after conditioning on covariates. Crucially, we do allow the linear valuation parameter to drift; doing so would conflate two distinct phenomena (changes in how observables map into value versus changes in unobservables) and would require substantially different instrumentation. Economically, our assumption corresponds to stable preferences over observed product features and user attributes, but time-varying residual shocks (macroeconomic conditions, attention constraints, platform frictions, and measurement artifacts). This division is not merely convenient: it reflects how practitioners often build pricing systems, with a stable representation layer (features and value model) and a faster-moving calibration layer (conversion model).
Our main conceptual move is to replace global pooling of demand information with pooling. If demand drifts, then the relevant comparison set is not the full history but a recent window (or, in piecewise-stationary regimes, the current segment). Doing so introduces the classic bias–variance tradeoff: shorter windows track changes but suffer higher estimation noise; longer windows reduce noise but average over outdated regimes. We formalize this tradeoff in a way that is compatible with VAPE’s cross-context learning: the seller still leverages many contexts to learn demand as a function of the increment, but only uses observations that are recent enough to be informative about the current demand curve. The resulting algorithmic family—windowed and restart-based variants of VAPE—can be interpreted as implementing an economically meaningful form of ``recency weighting’’ that disciplines how the seller extrapolates from past consumers to present ones.
We summarize our technical contributions as follows.Beyond the formal guarantees, the economic interpretation is straightforward. A seller who insists on a single, persistent estimate of demand implicitly assumes stable residual heterogeneity; when that assumption fails, the seller either must accept biased pricing or must discount history. Our algorithms provide a disciplined way to do the latter while retaining the gains from contextual structure. In practice, the framework suggests how a platform might separate long-run estimation of stable value components (feature effects) from short-run calibration of conversion probabilities (demand shocks), and how to tune the aggressiveness of adaptation based on measured instability.
We also acknowledge limitations. Our drift model constrains changes through a bounded-variation budget and requires a mild smoothness (Lipschitz) condition in the price increment; abrupt, adversarially oscillating demand would defeat any method that relies on recent averaging. We maintain a linear valuation structure and bounded noise, both of which may be violated in high-dimensional personalization or in markets with heavy-tailed shocks. Finally, our benchmark is per-round optimal posted pricing for myopic buyers; richer strategic interactions, inventory constraints, and fairness objectives are important but orthogonal extensions. Within these boundaries, the model illuminates the central tradeoff faced by modern pricing systems: exploiting structure to learn efficiently while retaining enough adaptivity to remain robust when the environment drifts.
We study a seller who repeatedly posts prices to a sequence of myopic buyers over rounds t = 1, …, T. The seller observes a vector of covariates (the ) xt ∈ ℝd before pricing, and then receives only binary feedback indicating whether a purchase occurred. Our emphasis is on two features that are central in modern digital pricing settings: (i) contexts can be highly non-random and may depend on the seller’s past behavior (e.g., due to platform allocation, targeting rules, or competing algorithms), and (ii) the mapping from prices to purchase probabilities may drift over time even after conditioning on the observed context.
Let ℋt − 1 := {(xs, ps, os)}s = 1t − 1 denote the public history. Each round proceeds as follows:We assume |ξt| ≤ Bξ almost surely and 𝔼[ξt ∣ ℋt − 1, xt] = 0, which formalizes bounded, centered residual heterogeneity. Conditional on (ℋt − 1, xt), the shock ξt is drawn from a (possibly time-varying) distribution; across time, we impose conditional independence given the past and contexts to avoid degenerate dependence patterns that would render binary feedback uninformative.
A convenient way to describe purchase behavior is through the
relative to the linear index:
δ := p − x⊤θ.
Define the (time-dependent) demand curve in increment space by
Dt(δ) := ℙ(ξt ≥ δ ∣ ℋt − 1, xt),
which takes values in [0, 1]. Under the
threshold-purchase rule, the expected revenue at time t for context x and price p is
πt(x, p) := 𝔼[p 1{p ≤ yt} ∣ ℋt − 1, x] = p ℙ(ξt ≥ p − x⊤θ ∣ ℋt − 1, x) = p Dt(p − x⊤θ).
To obtain meaningful tracking guarantees, we impose mild regularity on
how sharply purchase probabilities can change with the increment: each
Dt is
Lξ-Lipschitz in
δ over the relevant range
(e.g., δ ∈ [−By, By]),
so that small errors in increments translate into controlled errors in
demand. Economically, this excludes knife-edge point-mass behavior and
captures the idea that idiosyncratic heterogeneity smooths demand.
The key departure from stationary models is that the curve δ ↦ Dt(δ)
may change with t. We quantify
the extent of nonstationarity via a total-variation (in sup norm)
budget:
$$
V_T \;:=\; \sum_{t=2}^T \sup_{\delta\in[-B_y,B_y]}
\big|D_t(\delta)-D_{t-1}(\delta)\big|.
$$
This definition is intentionally distribution-free: it does not assume
parametric evolution of ξt, nor does it
tie drift to any particular covariate. Instead, it bounds the cumulative
magnitude of changes in purchase probabilities across all possible
increments. The economic interpretation is that VT measures how
much the demand environment (after controlling for xt⊤θ)
shifts over the selling horizon. A small VT corresponds
to gradual evolution (e.g., slowly changing consumer sentiment), while a
larger VT
allows intermittent disruptions (e.g., platform redesigns, policy
changes, or competitor entry) provided they do not occur too frequently
or too violently.
It is useful to emphasize what is drifting: we keep θ fixed. This is a modeling choice that separates (i) stable valuation components tied to observed features (the representation layer) from (ii) time-varying shocks and frictions not captured by the covariates (the calibration layer). In applications, this corresponds to settings where feature engineering and the mapping from features to baseline value are updated infrequently relative to the short-run movements in conversion behavior. Allowing θ to drift is an important extension, but it changes the identification problem and would require different instrumentation and benchmarks.
Given (xt, Dt)
and the unknown θ, the posted
price is
pt*(xt) ∈ arg maxp ∈ [0, By]πt(xt, p) = arg maxp ∈ [0, By]p Dt(p − xt⊤θ).
We evaluate an online pricing policy {pt}t = 1T
via against this moving benchmark:
$$
R_T^{\mathrm{dyn}}
\;:=\;\sum_{t=1}^T \Big(\pi_t(x_t,p_t^*)-\pi_t(x_t,p_t)\Big).
$$
Dynamic regret is the appropriate objective when the environment
changes: even an omniscient seller who knows θ and all past demand curves would
generally not want to commit to a single time-invariant pricing rule.
The requirement is therefore to track, rather than to converge.
For comparison, one can also define a benchmark that ignores drift,
such as the best fixed pricing rule in hindsight,
$$
R_T^{\mathrm{sta}}
\;:=\;\max_{p(\cdot)} \sum_{t=1}^T \pi_t(x_t,p(x_t)) \;-\; \sum_{t=1}^T
\pi_t(x_t,p_t),
$$
where the maximization ranges over measurable maps p : ℝd → [0, By].
In a drifting environment, RTsta
can be small even when the seller performs poorly relative to pt*,
simply because the best fixed rule is itself a weak comparator. Our
focus on RTdyn
reflects an economic standard: we ask how much revenue is lost relative
to the contemporaneously optimal price that would be chosen if the
seller could instantaneously adapt to the current demand curve.
The quantity VT governs a fundamental tradeoff between statistical efficiency and robustness. Pooling long histories reduces variance in demand estimation under binary feedback, but if Dt changes, the same pooling creates bias by averaging over outdated regimes. A bounded VT formalizes the idea that recent data are informative about the near future, and it permits guarantees that degrade gracefully with instability: when VT = 0 we recover a stationary environment, while larger VT forces any algorithm to discount history more aggressively, increasing uncertainty and hence regret. In this sense, VT plays the role of an economically interpretable ``instability index’’ for residual demand, bridging algorithmic tuning (how much to forget) with market conditions (how quickly consumers and frictions evolve).
Our algorithmic design rests on a simple but powerful decomposition: once we condition on the linear index v := x⊤θ, all remaining price sensitivity can be represented as a one-dimensional function of the δ := p − v. This isolates what is transferable across heterogeneous contexts from what must be tracked over time.
Fix any round t and any
context–price pair (x, p). Writing p = v + δ with
v = x⊤θ,
the purchase event {p ≤ yt}
becomes
{v + δ ≤ v + ξt} = {ξt ≥ δ}.
Therefore the expected revenue admits the decomposition
Equation is the key structural statement: the context x enters payoffs only through the
scalar x⊤θ, while all
nonstationarity enters only through the time-varying demand curve δ ↦ Dt(δ).
Economically, we can interpret θ as a stable ``representation
layer’’ mapping features into a baseline willingness-to-pay, while Dt captures
short-run fluctuations in conversion frictions and residual
heterogeneity.
This decomposition is not merely notational. It implies that learning about Dt(δ) at a given increment δ is, in principle, useful across contexts, because it corresponds to the same event {ξt ≥ δ}. This cross-context pooling is what makes it possible to handle adversarial (and high-dimensional) contexts without having to learn a separate demand curve for each region of the covariate space.
A central obstacle is that we never observe yt; we only see the binary indicator ot = 1{pt ≤ yt}. Nevertheless, on specially designated rounds, we can convert this one bit into an unbiased proxy for the valuation.
Suppose on a probing round we draw the posted price independently
as
pt ∼ Unif([0, By]),
and post this pt regardless of
the current xt (we can later
restrict probing to particular times, but the identification argument is
per-round). Conditional on yt ∈ [0, By],
we have
$$
\mathbb{E}[o_t\mid y_t]
\;=\;\mathbb{P}(p_t\le y_t\mid y_t)
\;=\;\frac{y_t}{B_y}.
$$
Thus the transformed observation
satisfies 𝔼[zt ∣ yt] = yt.
Taking another conditional expectation and using 𝔼[ξt ∣ ℋt − 1, xt] = 0
yields
𝔼[zt ∣ ℋt − 1, xt] = 𝔼[yt ∣ ℋt − 1, xt] = xt⊤θ.
In other words, probing produces an unbiased regression label for the
linear index xt⊤θ
even though we observe only purchases. This remains valid under
time-varying ξt distributions
because the argument does not require stationarity of Dt; it uses only
(i) independent randomization of pt on probing
rounds, (ii) bounded support so that yt ∈ [0, By],
and (iii) mean-zero residuals.
From a practical standpoint, is an ``instrumentation’’ device: by injecting controlled random prices occasionally, a platform can recover a signal about willingness-to-pay that would otherwise be censored by strategic pricing. The economic cost is immediate revenue loss on probe rounds (we sometimes post prices far from optimal), but the benefit is that we can estimate θ under worst-case context sequences using standard self-normalized concentration tools for ridge regression. This is precisely the tradeoff a platform faces when deciding how much experimentation to permit in production systems.
Once we have an estimator θ̂t (built from
past probing rounds), we can form a valuation prediction v̂t := xt⊤θ̂t
and restrict our pricing decisions to the form
pt = v̂t + δ,
where δ is chosen from a
discrete set of candidate increments. If v̂t is accurate,
then the realized increment relative to the true vt = xt⊤θ
is close to δ, and the
purchase probability is close to Dt(δ)
by Lipschitzness. This reduces the problem to learning, for each
increment δ, the time-varying
conversion probability and the associated revenue (vt + δ)Dt(δ).
A limitation is worth highlighting: the above separation relies on the assumption that, conditional on xt, the distribution of ξt does not depend on xt in any additional way (beyond the mean shift already captured by xt⊤θ). If residual dispersion or skewness systematically varies with covariates, then there is no single Dt shared across contexts, and the cross-context pooling logic weakens. Our approach is therefore best viewed as a model of settings where features absorb most systematic heterogeneity, and the remaining residual is ``approximately common’’ across segments.
Even with θ known, the
per-round objective is
maxp ∈ [0, By] p Dt(p − xt⊤θ) = maxδ: xt⊤θ + δ ∈ [0, By] (xt⊤θ + δ) Dt(δ).
The feasible set for δ depends
on vt = xt⊤θ
through the constraint vt + δ ∈ [0, By],
i.e.,
δ ∈ [−vt, By − vt].
Since vt
can vary substantially across contexts—and we explicitly allow xt to be
adversarial—the union of these feasible increment intervals over time
can span essentially the full range [−By, By].
Consequently, an algorithm that only learns Dt(δ)
locally around a single neighborhood of increments is brittle: an
adversary can present contexts with very different vt, making
previously learned parts of the curve irrelevant.
This observation motivates our global increment grid. By maintaining estimates for Dt(δ) on a discretization of [−By, By], we ensure that for every context we can approximately evaluate revenues of candidate prices. The Lipschitz property provides the approximation guarantee: if δ lies within ε of a grid point δk, then |Dt(δ) − Dt(δk)| ≤ Lξε, so the discretization error can be controlled uniformly. The economic interpretation is that, to robustly personalize prices across heterogeneous customers, the platform must understand conversion behavior not only at a single ``typical’’ markup, but across a spectrum of markups and discounts relative to predicted value.
Taken together, the decomposition , the probing-based identification of x⊤θ, and the need for global increment coverage lead naturally to a two-timescale learning strategy: we estimate the stable representation θ using occasional randomized probes, and we track the drifting calibration object Dt using time-local aggregation in increment space. The next section operationalizes this logic in the Windowed/Restarted VAPE algorithm.
We now operationalize the two-timescale logic described above. The stable object is the linear index vt = xt⊤θ, which we learn through occasional randomized rounds. The drifting object is the increment-demand map δ ↦ Dt(δ), which we track by pooling data across contexts in but discounting stale observations via a sliding window (or, alternatively, explicit restarts).
Fix a grid resolution ε > 0 and define the increment
index set
𝒦 := {k ∈ ℤ: δk := kε ∈ [−By, By]}.
On a non-probing round t,
after forming a valuation prediction v̂t = xt⊤θ̂t,
we restrict attention to prices of the form
pt(k) := v̂t + δk, k ∈ 𝒦t := {k ∈ 𝒦 : pt(k) ∈ [0, By]}.
This construction ensures that, across heterogeneous contexts, we
evaluate a common set of markups/discounts relative to the predicted
baseline. When v̂t is ε-accurate, the realized increment
relative to the true vt differs from
δk by at
most ε, so Lipschitzness of
Dt
converts valuation error into controlled demand error of order Lξε.
For each increment k, we
maintain a sliding-window dataset of the most recent w rounds in which increment k was played (eligibility simply
means we are not probing and the valuation predictor is certified
accurate; we formalize this certification below). Let κs ∈ 𝒦 denote
the increment index actually used at round s (when not probing), and define
windowed counts
$$
N_{t,w}^k\;:=\;\sum_{s=t-w}^{t-1}\mathbf{1}\{\kappa_s=k\}\mathbf{1}\{\text{$s$
eligible}\}.
$$
The natural estimator of Dt(δk)
is the windowed purchase frequency
$$
\hat D_{t,w}(k)\;:=\;
\begin{cases}
\frac{1}{N_{t,w}^k}\sum_{s=t-w}^{t-1}
o_s\,\mathbf{1}\{\kappa_s=k\}\mathbf{1}\{\text{$s$ eligible}\}, &
N_{t,w}^k\ge 1,\\[4pt]
\frac{1}{2}, & N_{t,w}^k=0,
\end{cases}
$$
where the default value when Nt, wk = 0
can be arbitrary in [0, 1] (it is
immediately dominated by wide confidence bounds). The central reason
this estimator is meaningful under adversarial contexts is that
conditioning on the increment k pins down the event {ξs ≥ δ}
up to the valuation approximation error, so samples from different
contexts remain comparable.
We convert D̂t, w(k)
into a time-local confidence interval using standard bounded-difference
concentration on windowed averages. A convenient form is
$$
\mathrm{rad}_{t,w}(k)\;:=\;
\sqrt{\frac{c_0\log T}{\max\{1,N_{t,w}^k\}}},
\qquad
\mathrm{UCB}^D_{t,w}(k):=\min\{1,\hat
D_{t,w}(k)+\mathrm{rad}_{t,w}(k)\},
\quad
\mathrm{LCB}^D_{t,w}(k):=\max\{0,\hat
D_{t,w}(k)-\mathrm{rad}_{t,w}(k)\},
$$
for a constant c0
chosen to deliver an overall failure probability at most T−c after a
union bound over t and k.
On a non-probing round, we rank increments by an upper confidence
bound on revenue. Because the instantaneous revenue at increment δ is (vt + δ)Dt(δ)
and we only know v̂t and D̂t, w(k),
we use a robust proxy that accounts for (i) statistical uncertainty in
D̂t, w(k),
(ii) the fact that the true increment differs from δk by at most
ε when |v̂t − vt| ≤ ε,
and (iii) discretization error from optimizing over a grid. One
convenient choice is
where the ±ε terms crudely
propagate index uncertainty into the multiplicative price factor, and
the ByLξε
term controls the demand change induced by the increment mismatch. We
then choose
kt ∈ arg maxk ∈ 𝒦tUCBt, w(k), pt = v̂t + δkt,
observe ot ∈ {0, 1}, and
update only the statistics for kt (dropping
data older than w rounds).
The algorithm above leaves open how we ensure that non-probing rounds indeed satisfy |v̂t − vt| ≤ ε. We use the same self-normalized control as in stationary VAPE: probe precisely when the design matrix Vt − 1 suggests high uncertainty in direction xt, i.e., when ∥xt∥Vt − 1−1 > μ. Standard ridge concentration then yields an event on which, for all non-probing rounds, the index error is uniformly bounded by a target tolerance that we identify with ε (after absorbing constants and logarithms into the choice of μ and λ). This mechanism is robust to adversarial contexts because the probing decision depends only on xt and past probes, not on any stationarity of ξt.
Windowing is the most direct way to respect the variation budget: old data are down-weighted by being discarded. When drift is abrupt (piecewise stationary), explicit restarts can be preferable. A simple implementation is to run, for each k with enough samples, a two-sample test comparing purchase frequencies in the most recent w/2 eligible plays of k to the preceding w/2; if the absolute difference exceeds a threshold of order $\sqrt{\log T/n}$, we declare a change. In response, we reset the demand statistics (and optionally restart the probing/estimation module, although θ is stable in our model and need not be reset). This restart view connects our method to change-point bandits while preserving the key cross-context pooling in increment space.
Let |𝒦| ≍ By/ε be the number of increments. A literal implementation computes UCBt, w(k) for all feasible k each non-probing round, costing O(|𝒦|) time per round; the ridge update on probe rounds costs O(d2) per probe (or O(d) with recursive least squares updates and maintained inverse). Storage for demand tracking is O(|𝒦|w) if we keep full queues, but can be reduced to O(|𝒦|) by storing only windowed counts and sums together with timestamps (or by using exponential forgetting). In typical regimes where ε is chosen as a negative power of T, |𝒦| is polynomial in T; our regret bounds below remain meaningful precisely in the range where this discretization is statistically warranted.
The key design parameters are the grid spacing ε (bias–complexity tradeoff), the window length w (variance–drift tradeoff), and the probing threshold μ (exploration cost for learning θ). We treat these as functions of (T, d, VT) chosen to balance the corresponding regret contributions; the next section states the resulting rates and shows how the drift budget VT enters only through the choice of time localization (windowing or restarts), while the dimension d enters only through the probing needed to stabilize v̂t.
Our performance criterion is against the per-round optimal posted
price,
$$
R_T^{\mathrm{dyn}}
:=\sum_{t=1}^T\Big(\pi_t(x_t,p_t^*)-\pi_t(x_t,p_t)\Big),
\qquad
p_t^*\in\arg\max_{p\in[0,B_y]} p\,D_t\!\big(p-x_t^\top\theta\big),
$$
which is the natural benchmark when demand conditions drift but the
seller is evaluated relative to what could be achieved with
contemporaneous knowledge of Dt.
The following theorem states that Windowed/Restarted VAPE attains sublinear dynamic regret while allowing (i) adversarially chosen contexts, (ii) only binary purchase feedback, and (iii) time-varying demand curves subject to the total-variation budget VT.
The bound makes the algorithmic tradeoffs transparent. Larger windows reduce sampling noise but increase sensitivity to drift; finer grids reduce discretization but require more data per increment and amplify the drift penalty (because drift must be controlled uniformly over a larger effective action set).
Optimizing the right-hand side in (w, ε) yields the stated headline rate.
Two limiting cases are worth highlighting. When drift is mild (VT small), the second term is negligible and the rate is essentially the stationary benchmark driven by learning θ under adversarial contexts. When drift is substantial, the term T2/3VT1/3 dominates and reflects the unavoidable price of tracking a time-varying demand curve with only censored (binary) feedback.
Setting VT = 0 (i.e., Dt ≡ D for all t) collapses the nonstationary component and recovers the stationary performance of VAPE up to log factors.
This corollary clarifies the conceptual separation embedded in the model: the linear index xt⊤θ is stable and is learned at the same cost as in stationary VAPE, whereas all time variation is confined to the one-dimensional demand object δ ↦ Dt(δ).
The d2/3T2/3 term stems from the fact that contexts may be adversarial and we only see binary feedback. Probing ensures sufficient excitation to certify that xt⊤θ̂t is uniformly accurate on non-probing rounds; the elliptical-potential argument then converts this certification into a sublinear exploration cost. Notably, demand drift does not directly interfere with this step: the probing analysis relies on bounded martingale differences and the symmetry/centering used to form unbiased pseudo-labels, not on stationarity of Dt.
The dependence on VT enters only through demand tracking in increment space. The scaling T2/3VT1/3 is the characteristic balance one obtains when a learner must choose a time-localization scale (here a window) to trade off variance ( ∝ w−1/2) against nonstationary bias ( ∝ wVT), with an additional ε-layer because we learn demand on an increment grid rather than on a continuum.
A direct attempt to run a standard contextual bandit or contextual pricing method in faces two obstacles. First, with adversarial contexts, one cannot pool across heterogeneous xt without a structural reduction: the same posted price p corresponds to different effective markups δ = p − xt⊤θ, so purchase outcomes at price p are not comparable across contexts. Second, in a drifting environment, using all historical data (as a stationary algorithm would) induces a systematic bias that can accumulate over time; in the worst case, regret scales polynomially with T because outdated demand estimates persistently mis-rank prices.
Our approach avoids both issues by (i) translating every action into an increment δ relative to a learned baseline xt⊤θ, so that demand data become comparable across contexts, and (ii) localizing demand estimation in time via windows or restarts, so that the effect of drift is paid only through VT rather than through T itself. Relative to stationary VAPE, the only new phenomenon is the drift-induced bias term, and the analysis shows precisely how to tune time localization to keep this term sublinear.
When demand changes are abrupt rather than gradual, restarts can
replace windows. If Dt is piecewise
stationary with at most S
change points, a restart-based variant (with a standard two-sample test
on increment-level purchase frequencies) yields a bound of the
form
RTdyn ≤ Õ (d2/3T2/3) + Õ (T2/3S1/3),
up to detection-delay and false-alarm costs. This aligns with the
economic intuition that a small number of regime shifts should be
cheaper to track than continuous drift with the same horizon, provided
shifts can be detected quickly from binary feedback.
We prove Theorem~ by decomposing regret into four components—(i) valuation approximation (learning the stable linear index xt⊤θ), (ii) discretization in increment space, (iii) statistical error in estimating demand from binary outcomes, and (iv) a drift-induced bias term that is specific to the nonstationary environment. The main technical point is that only the last component truly requires nonstationary tools; the valuation-learning step is essentially unchanged from stationary VAPE because θ is time-invariant and the probing construction does not assume a fixed Dt.
Using the structural identity
πt(x, p) = p Dt (p − x⊤θ) and p = x⊤θ + δ,
we rewrite expected revenue as πt(x, x⊤θ + δ) = (x⊤θ + δ)Dt(δ).
This isolates all time variation in the one-dimensional object δ ↦ Dt(δ),
while all context dependence enters through the level vt := xt⊤θ.
The algorithm therefore aims to (a) maintain an accurate estimate v̂t := xt⊤θ̂t
and (b) run a time-local bandit-style procedure over increments δ using windowed data. The proof
mirrors this separation: we bound the regret from errors in v̂t, then treat
the remaining problem as choosing (approximately) optimal increments
under drifting demand.
Fix a round t and let δt*
denote the optimal increment given the true level vt:
δt* ∈ arg maxδ: vt + δ ∈ [0, By](vt + δ)Dt(δ).
The algorithm instead chooses δ̂t based on
v̂t and
windowed estimates of Dt on a grid. We
insert an intermediate ``oracle-on-the-grid’’ benchmark that knows Dt but is
constrained to the discretized increment set {δk = kε}
and uses vt rather than
v̂t.
Concretely, we decompose per-round regret into
$$
\pi_t(x_t,p_t^*)-\pi_t(x_t,p_t)
\;\le\;
\underbrace{\big[\pi_t(v_t,\delta_t^*)-\pi_t(v_t,\delta_{k_t^\star})\big]}_{\text{discretization}}
+\underbrace{\big[\pi_t(v_t,\delta_{k_t^\star})-\pi_t(\hat
v_t,\delta_{k_t^\star})\big]}_{\text{valuation approximation}}
+\underbrace{\big[\pi_t(\hat v_t,\delta_{k_t^\star})-\pi_t(\hat
v_t,\hat\delta_t)\big]}_{\text{demand learning on a grid}},
$$
where δkt⋆
is the best grid increment for the true level vt (and the
appropriate feasibility constraint). The first term is controlled by
Lipschitzness and boundedness, yielding the O(ε) loss per round that
aggregates to Õ(Tε). The second
term is controlled by the uniform approximation guarantee |v̂t − vt| ≤ ε
on non-probing rounds (and a direct bound on probing rounds). The third
term is where we use windowed confidence bounds and where drift
enters.
We create pseudo-labels using randomized probing prices whose
distribution is symmetric and independent of ξt, so that the
corresponding regression noise remains a bounded martingale difference.
The key lemma shows that, after a controlled number of probe rounds,
ridge regression yields a high-probability uniform guarantee of the
form
|xt⊤(θ̂t − θ)| ≤ ε on
all non-probing rounds.
The proof uses standard self-normalized concentration for linear
martingales plus the elliptical potential argument to bound the number
of rounds on which ∥xt∥Vt−1
is large (and thus additional probing is needed). Importantly, this step
does require stationarity of Dt: the
regression analysis conditions on the past and uses only boundedness and
centering/symmetry to ensure unbiasedness of the constructed labels.
This yields the Õ(d2/3T2/3)
contribution to regret, matching stationary VAPE up to logs and
constants.
For each increment δk, we pool
outcomes across contexts by translating posted prices into increments
using v̂s
at the time of play. Over the last w eligible rounds, we form the
empirical purchase frequency D̂t, w(k)
and then a reward estimate
π̂t, w(k) := (v̂t + δk)D̂t, w(k),
together with time-local upper and lower confidence bounds. The analysis
of D̂t, w(k)
splits into three errors:
$$
\hat D_{t,w}(k) - D_t(\delta_k)
=
\underbrace{\big(\hat D_{t,w}(k)-\mathbb{E}[\hat
D_{t,w}(k)\mid\mathcal{F}]\big)}_{\text{sampling noise}}
+\underbrace{\big(\mathbb{E}[\hat D_{t,w}(k)]-\tilde
D_t(\delta_k)\big)}_{\text{valuation approximation / discretization
bias}}
+\underbrace{\big(\tilde D_t(\delta_k)-D_t(\delta_k)\big)}_{\text{drift
bias}},
$$
where D̃t
denotes the effective average demand curve over the window induced by
using past data. The first term is handled by Hoeffding/Freedman
inequalities for bounded Bernoulli variables (uniformly over k and t via a union bound). The second
term is controlled via Lipschitzness of Ds(⋅) and the
fact that using v̂s instead of
vs
perturbs the realized increment by at most ε, contributing an O(Lξε)
bias. The third term is the genuinely nonstationary term: comparing
Ds(δ)
to Dt(δ)
within the window yields
$$
\sup_\delta |D_s(\delta)-D_t(\delta)|
\;\le\;
\sum_{u=s+1}^t \sup_\delta |D_u(\delta)-D_{u-1}(\delta)|,
$$
so the window-average mismatch is bounded by the local variation budget
$V_{t,w}:=\sum_{u=t-w+1}^t
\sup_\delta|D_u(\delta)-D_{u-1}(\delta)|$, and summing over t leads to the global term O(wVT)
after careful counting.
Given time-local UCB/LCB intervals for πt(v̂t, δk), we analyze the increment-selection rule (optimism or elimination) exactly as in stationary VAPE, except that the confidence radius now includes both a variance term scaling like $\sqrt{1/N_{t,w}^k}$ and an additive bias term scaling like Vt, w. The usual argument shows that, on rounds when a suboptimal increment is chosen, either (i) its confidence interval still overlaps the optimal one (which can happen only a limited number of times before enough samples accrue in the window), or (ii) the choice is explained by drift contamination (charged to Vt, w). Summing these contributions yields a demand-learning term $\tilde O(T\sqrt{1/w})$ and a drift term Õ(wVT/ε) (the factor 1/ε arises because we control regret uniformly over a grid, so we allocate variation budgets across increments at resolution ε).
Finally, we add: probing/valuation-learning regret, discretization loss, windowed statistical error, and drift bias. This produces the bound in Theorem~. Optimizing in (w, ε) then delivers the stated Õ (d2/3T2/3 + T2/3VT1/3) rate, making explicit the economic tradeoff: time localization is valuable only to the extent demand drifts, and its cost is the increased variance from discarding older data.
Our main guarantee is stated in terms of the (a priori unknown) variation budget VT, and the prescribed tuning of the window length w (and grid spacing ε) implicitly trades off statistical precision against nonstationary bias. In practice, the platform rarely has a credible numerical upper bound on VT, and even if it did, the relevant drift may be highly nonuniform over time. A natural extension is therefore to make time localization adaptive.
One simple approach is to maintain a finite set of candidate window
lengths 𝒲 = {w1, …, wM}
(e.g., a dyadic grid from 1 up to T) and run a separate instance of
the increment-selection procedure for each w ∈ 𝒲, each instance producing a
recommendation δ̂t(w)
and an estimated (upper) value for playing it. We then combine these
instances using a bandit-over-bandits meta-algorithm (e.g., a variant of
or exponential-weights with importance weighting) that selects which
window to follow each round. Because realized revenue is observed
regardless of which window is used, the meta-layer can treat each base
instance as an
expert'' whose advice is a price $p_t^{(w)}=\hat v_t+\hat\delta_t^{(w)}$. Under standard stability conditions (bounded losses/rewards and controlled importance weights), this yields an adaptive guarantee of the form \[ R_T^{\mathrm{dyn}} \;\lesssim\; \min_{w\in\mathcal{W}} \Big\{\tilde O\!\big(d^{2/3}T^{2/3}\big) +\tilde O\!\big(T\sqrt{1/w}\big) +\tilde O\!\big(wV_T/\varepsilon\big) +\tilde O(T\varepsilon)\Big\} \;+\;\tilde O\!\big(\sqrt{T\log M}\big), \] so we pay only a modest overhead for not knowing the right time scale. Economically, this meta-selection corresponds to letting the platform endogenously decide how quickly itforgets’’
old demand information: in calm periods it will put weight on longer
windows (low variance), while around shocks it will switch toward
shorter windows (fast tracking). A limitation is computational: running
many windows in parallel multiplies the bookkeeping for {Nt, wk}
and the confidence bounds. In implementations, we can share sufficient
statistics across nested windows or restrict to a small dyadic
family.
A similar issue arises for discretization. The grid spacing ε governs both approximation error
and how finely we must control drift across increments. When Lξ is large
(steep demand), small ε is
valuable; when drift is severe, coarse discretization can be preferable
because it reduces the effective number of increment
arms'' that must be tracked. One can again deploy a multiscale scheme with $\varepsilon\in\mathcal{E}$, each instance maintaining its own increment grid $\{\delta_k\}$, and aggregate using a meta-algorithm. Alternatively, one may refine the grid online: start coarse, and split a bin only when its confidence interval is small relative to its width (a one-dimensionalzooming’’
heuristic in increment space). While a full analysis would be lengthy,
the key point is that the increment space is one-dimensional and
bounded, so adaptive discretization can be done without compromising the
adversarial-context robustness of the valuation-learning layer.
Windowing is a soft form of discounting; a complementary strategy is hard resets when evidence suggests that Dt has changed materially. This is especially natural in environments with abrupt breaks (policy changes, competitor entry, supply disruptions), where mixing pre- and post-break data inside a window is economically undesirable even if VT is moderate in aggregate.
A restart-based variant proceeds as follows. We maintain the valuation estimator θ̂t continuously (since θ is stable), but we segment time into episodes within which we treat demand as approximately stationary and run a stationary VAPE-style increment learner. To detect changes, we can monitor, for each increment δk, a statistic comparing purchase rates in two adjacent subwindows (CUSUM), or a generalized likelihood ratio (GLR) test for Bernoulli means. Because the algorithm pools data by increment (after translating by v̂s), the relevant inputs are the binary outcomes os associated with (approximate) plays of δk. When the test for any k crosses a threshold, we restart the demand learner (clear demand statistics and confidence sets), while keeping the current θ̂t and continuing probing only as needed to preserve the ε-accurate valuation guarantee.
The benefit of restarts is conceptual and practical: when a break occurs, we prevent contaminated confidence bounds that would otherwise be overconfident due to large sample sizes accumulated under an obsolete regime. The cost is that false alarms create unnecessary resets, and detection delay means we may still incur some regret after a break. Both can be controlled with standard concentration tools, yielding guarantees that scale with the number of true changes and the chosen thresholds.
If demand is piecewise stationary with at most S change points, then the restart
perspective becomes canonical. In that case, we can express regret as
the sum of (i) within-segment learning costs and (ii) switching overhead
from detection and restart. With an oracle that knows the change points,
we would simply run stationary VAPE independently on each segment,
obtaining approximately
$$
\sum_{j=1}^{S+1} \tilde O\!\big(d^{2/3}T_j^{2/3}\big)
\;\le\;
\tilde O\!\big(d^{2/3}T^{2/3}(S+1)^{1/3}\big),
$$
where Tj
is segment length and we used concavity of x2/3. With online
detection, an additional term appears that depends on the detection
delay and false-alarm rate; in regimes where breaks are well-separated
and sufficiently large (in the sense that the change in Dt(δ)
exceeds the confidence radius for some increment δ), one can often sharpen this to
bounds closer to $\tilde
O(d^{2/3}T^{2/3}+\sqrt{ST})$. From an economic standpoint, this
corresponds to markets with stable ``seasons’’ punctuated by regime
changes: the platform benefits from explicitly recognizing regimes
rather than continuously averaging.
Our baseline permits adversarial contexts, which is essential in applications where the platform can influence the context stream (through targeting, ranking, or experimentation) or where contexts correlate with unmodeled state variables. In settings where contexts are stochastic and satisfy favorable coverage conditions (e.g., xt i.i.d. with well-conditioned covariance), the valuation-learning layer can be improved. Specifically, ridge regression under i.i.d. designs typically yields estimation error scaling like $\tilde O(\sqrt{d/t})$, and the number of probe rounds needed to ensure |xt⊤(θ̂t − θ)| ≤ ε can drop substantially relative to the elliptical-potential worst case. This can reduce the d2/3T2/3 component toward more classical $\tilde O(\sqrt{dT})$ behavior, while the nonstationary demand component (windowing or restarting) remains essentially unchanged. The broader message is that our separation between a stable linear index and a drifting one-dimensional demand curve is modular: improvements in either layer (better covariate structure, richer demand smoothness, or stronger feedback) translate directly into improved overall performance without altering the economic logic of increment pooling.
Our theoretical results are intentionally modular—a stable linear index vt = xt⊤θ coupled with a drifting one-dimensional demand curve δ ↦ Dt(δ). An empirical illustration should therefore stress-test both modules separately and jointly, and report diagnostics that map cleanly back to the terms in the regret decomposition (estimation variance, discretization error, and drift bias). We outline a synthetic evaluation protocol that is faithful to the binary-feedback setting and makes drift transparent and controllable.
We fix d, bounds Bx, Bθ, Bξ,
and draw a ground-truth θ
uniformly on the sphere scaled to ∥θ∥2 = Bθ
(or from a Gaussian then rescaled). Contexts xt are generated
under two regimes: (i) , xt ∼ 𝒩(0, Σ)
truncated to satisfy ∥xt∥2 ≤ Bx
with Σ well-conditioned; and
(ii) , where we cycle through nearly collinear directions to create
difficult ridge geometry (large ∥xt∥Vt−1),
while still respecting ∥xt∥2 ≤ Bx.
In both cases, the seller observes only (xt, pt, ot).
For demand, we specify a family of noise distributions ξt with bounded
support and explicit drift. A convenient choice is a truncated logistic
or truncated Gaussian, parameterized by a location μt and scale
σt, so
that
Dt(δ) = ℙ(ξt ≥ δ) = 1 − Fμt, σt(δ), δ ∈ [−By, By],
and we enforce |ξt| ≤ Bξ
by truncation. This allows us to implement (a) smooth gradual drift
(changing μt slowly), (b)
time-varying steepness (changing σt), and (c)
structured non-monotone changes via mixtures (e.g., switching between
two components to mimic heterogeneity shocks). Because the theory
measures drift via the sup-norm variation budget VT, we record
both the realized VT and the local
window variation Vt, w
to compare against observed performance.
We propose three canonical regimes that correspond to common market
narratives while remaining analytically interpretable.
(i) set μt = μ0 + asin (2πt/P)
with period P and amplitude
a, holding σt fixed. This
creates predictable oscillations in demand at every increment δ and highlights the windowing
tradeoff: longer windows reduce variance but blur seasonal swings.
(ii) set (μt, σt)
constant within segments and apply S abrupt changes at random times.
This isolates the benefit of restarts and change-point detection, and
permits reporting detection delay and false-alarm rates alongside
revenue.
(iii) combine (i) and (ii), or add a slow random walk in μt with
occasional jumps. This setting is empirically plausible (slow macro
drift with discrete competitive events) and is useful for comparing
adaptive-window meta-selection against pure windowing or pure
restarts.
We evaluate: (a) with a grid spacing ε and fixed window w; (b) with a CUSUM/GLR detector on
selected increments δk; (c) via a
small dyadic family 𝒲 combined by a
meta-algorithm; and, as benchmarks, (d) (effectively w = T) to quantify
degradation under drift; (e) a that uses only v̂t = xt⊤θ̂t
and posts pt = v̂t
(no increment learning), to show the value of learning Dt; and (f) an
that knows (θ, Dt)
and posts pt*,
which defines the regret comparator. For fairness, all non-oracle
methods operate under the same price cap [0, By] and the
same probing schedule (or, if probing is endogenous, we report its
realized frequency).
To isolate mechanisms, we include targeted ablations: (1) hold θ known and learn only demand (removing the valuation layer); (2) hold Dt known and learn only θ (removing demand nonstationarity); (3) turn off probing to illustrate failure under adversarial contexts; and (4) vary ε and w to empirically recover the bias–variance–drift tradeoff predicted by the bound terms Tε, $T/\sqrt{w}$ (up to logs), and wVT/ε.
Beyond cumulative revenue, we emphasize time-resolved measures that
diagnose regret occurs.
Primary outcomes are: (i) ∑t(πt(xt, pt*) − πt(xt, pt))
computed with Monte Carlo access to πt (possible in
simulation), and (ii) ∑trt/∑trt*
to give an economically interpretable scale. We also report: (iii) |δ̂t − δt*|
and |pt − pt*|;
(iv) |xt⊤(θ̂t − θ)|
on non-probing rounds to confirm the intended ε-accuracy; (v) Nt, wk
across increments to show which parts of increment space are being
learned; and (vi) for restart-based methods, (time between a true break
and the first restart) and . Finally, since our analysis relies on
confidence bounds, we can directly assess : the fraction of times Dt(δk)
lies within the constructed interval, stratified by drift intensity.
We recommend plotting cumulative regret and instantaneous regret
(per-round gap) against time, annotated with true shock times and with
estimated change points. For seasonality, heat maps of chosen increments
δ̂t over
t visually reveal whether the
learner tracks periodic structure or averages it out. Stress tests
should vary (d, T),
the steepness proxy Lξ (via σt), and the
drift magnitude (via a, jump
sizes, and S), and should
explicitly report the realized VT so that
performance can be normalized by the theoretical drift scale. The goal
is not to ``win’’ a leaderboard, but to validate the economic logic of
increment pooling under drift and to clarify when windowing suffices
versus when explicit restarts are operationally superior.
Our model deliberately separates two economic forces that are often conflated in practice: a relatively stable vt = xt⊤θ (how observable attributes translate into willingness to pay) and a one-dimensional, potentially drifting relationship δ ↦ Dt(δ) (how acceptance responds to a markup/markdown relative to the index). This decomposition is not merely a proof device; it suggests a pragmatic design principle for dynamic pricing systems operating under nonstationarity and limited feedback. In particular, it clarifies what should be learned across contexts (the mapping from x to v) versus what should be tracked (the demand curve over δ), and it makes the costs of each layer legible through quantities like the drift budget VT and window length w.
A central operational challenge is that learning requires
experimentation, yet experimentation is precisely what creates short-run
revenue risk and, in regulated settings, potential consumer harm. Our
setting is informative because it uses only binary feedback ot = 1{pt ≤ yt},
which mirrors many real marketplaces where the seller observes
purchase/no-purchase but not latent willingness to pay. The regret
decomposition that underlies our guarantee can be read as a safety
ledger: discretization and estimation contribute terms scaling like
Tε and (up to
logarithms) $T/\sqrt{w}$, while drift
contributes a term of order wVT/ε.
This immediately yields two safety-relevant prescriptions. First, :
probing prices are not an invitation to explore arbitrarily high
markups, but a targeted way to maintain |xt⊤(θ̂t − θ)| ≤ ε
on non-probing rounds even under adversarial context sequences. In
practice, this suggests implementing explicit on experimentation, such
as restricting probes to a conservative interval within [0, By],
enforcing per-customer price change limits, and capping the fraction of
traffic assigned to probing when demand is highly sensitive. Second, .
Under sustained drift, the relevant comparator is the per-round optimum
pt*,
not a fixed price, so a policy that ``graduates’’ from exploration to
exploitation can become systematically miscalibrated. Windowing (or
restarts) operationalizes this insight: it effectively commits the
system to forgetting outdated demand information at a controlled rate,
trading variance for responsiveness.
There is also a policy-relevant asymmetry in what can safely be pooled. Estimating θ benefits from pooling across time and contexts because θ is assumed stable; indeed, the probing frequency needed to control valuation error depends on geometry (through ∥xt∥Vt−1) rather than on demand drift. By contrast, pooling demand observations across long horizons can be unsafe under drift: outdated acceptance data can induce persistent overpricing or underpricing, with welfare and revenue consequences. The model therefore supports a two-speed architecture: slow, conservative updates to θ̂ (with explicit uncertainty tracking) coupled with fast, windowed updates to D̂t, w(k).
Because nonstationarity is the norm rather than the exception in many
markets, governance should treat drift as a first-class object to be
monitored. The theoretical quantities VT and Vt, w
suggest concrete diagnostics: we can track, for each increment grid
point δk,
the discrepancy between short-window and long-window demand estimates,
or equivalently the instability of D̂t, w(k)
as w varies over a small
family. Large and persistent discrepancies are an operational proxy for
high local variation and should trigger either a reduction in window
length (to reduce drift bias) or a restart/change-point procedure (to
purge contaminated history). Importantly, governance is not just about
reacting to drift, but about verifying that uncertainty statements
remain calibrated. Since our learning rule relies on time-local
confidence bounds, one can audit online: how often realized purchase
indicators fall within predicted acceptance intervals at chosen δk values.
Systematic under-coverage is a red flag for misspecification (e.g.,
violation of bounded noise, heavy tails, or unmodeled heterogeneity) and
should escalate to a fallback pricing mode.
Governance also includes and checks that map to our modeling assumptions. Pre-deployment, one should validate that the index structure is approximately linear (or at least stable) over the intended operational range, because the claim that cross-context information is transferable hinges on vt being the right sufficient statistic for context. Post-deployment, we recommend maintaining a ``shadow’’ evaluation that periodically tests counterfactual prices on a small randomized holdout, not to maximize short-run revenue, but to estimate whether the current policy is drifting away from the per-round optimality conditions implied by (v + δ)Dt(δ). This sort of monitoring is particularly important when the platform faces strategic behavior or competitive responses, since such forces can manifest as abrupt changes in Dt even if θ is stable.
From a compliance and consumer-protection perspective, windowing and restarts have an additional advantage: they create a natural policy. Sliding windows limit the extent to which long-past behavior affects current pricing, which can be aligned with privacy norms and reduce the risk that historical artifacts or temporary shocks permanently distort prices. That said, reduced retention is not a substitute for fairness or non-discrimination constraints; rather, it should be combined with explicit constraints (e.g., bounding price dispersion across protected classes conditional on x) and audits for disparate impact. Our framework can accommodate such constraints at the action-selection stage (restricting feasible prices), but the resulting regret benchmarks and safety guarantees would need to be re-derived for the constrained comparator.
The main economic promise of the increment formulation is : by learning
Dt(δ),
we reuse information across heterogeneous contexts because only δ = p − v matters
for acceptance. Pooling is justified to the extent that unobserved
heterogeneity enters valuations additively and independently of x after conditioning on the linear
index, so that ℙ(ot = 1 ∣ xt, pt) = Dt(pt − xt⊤θ).
Practically, this is a strong claim, and we should treat it as an
empirical hypothesis rather than a truth. A simple diagnostic is to
check whether acceptance residuals, after mapping prices into increments
using v̂t = xt⊤θ̂t,
appear invariant across different slices of the context space. If the
estimated demand curves differ systematically by context segment even
after increment normalization, then pooling will induce bias that no
choice of w can fix: the model
is misspecified.
When pooling fails, the policy implication is not necessarily to abandon the approach, but to refine the notion of transfer. One option is : maintain separate increment-demand estimates Dt(m)(δ) for a small number of market regimes or customer segments (defined by coarse, policy-approved covariates), while keeping a shared θ̂ or a hierarchical prior across segments. Another is to relax linearity of vt via richer representations while preserving the one-dimensional increment structure; in that case, the key governance requirement is to ensure that the representation is stable enough over time to justify transferring demand information. Conversely, if drift is primarily in the mapping from context to valuation (i.e., θ itself changes), then the recommended architecture flips: demand pooling may remain valuable, but one must track θt with explicit nonstationary linear bandit methods, and our guarantees no longer directly apply.
Our analysis abstracts from inventory constraints, repeated-customer
dynamics, and strategic behavior, each of which can create feedback
loops that violate conditional independence or stationarity of θ. Nonetheless, the high-level
lesson is robust: safe and effective dynamic pricing under drift
requires (i) explicit uncertainty tracking, (ii) explicit forgetting
mechanisms, and (iii) careful justification for what is pooled across
contexts. By tying these design choices to interpretable objects—ε as a tolerance for valuation
approximation, w as a
responsiveness knob, and VT as a drift
scale—the model illuminates a concrete tradeoff that pricing teams can
govern rather than merely observe.
This appendix collects the technical components that are intentionally suppressed in the main text in order to keep the economic logic transparent. Our guiding principle is to make each step auditable: (i) where the linear index structure is used, (ii) where boundedness and Lipschitz regularity enter, and (iii) where the drift budget VT is the only nonstationary quantity that matters. To that end, we separate the proofs into modular lemmas (concentration, discretization, and drift control) and then assemble them into the regret bound via a bookkeeping argument that mirrors the algorithmic design.
We provide complete proofs for the structural decomposition, valuation approximation, windowed demand estimation, and the dynamic regret theorem. The proofs are written so that each claim can be traced to a small number of primitives: the binary observation model ot = 1{pt ≤ yt}, the index signal vt = xt⊤θ, and the increment-demand map δ ↦ Dt(δ). In particular, the regret analysis is organized into three parts.This modular structure also clarifies which assumptions are essential: bounded noise and Lipschitz demand are only invoked to translate valuation error and grid spacing into demand error; the adversarial contexts are handled entirely through self-normalized concentration for the index estimation step.
A technical bottleneck is that we estimate acceptance probabilities using only the last w ``eligible’’ plays of each increment. Because the environment drifts, the relevant target is Dt(δk) at the current time t, not an average over the window. We therefore decompose estimation error into (i) a sampling term and (ii) a drift-induced bias term. The key concentration statement is formulated for a generic windowed Bernoulli process with predictable sampling.
Let 𝒮t, wk
denote the set of times s ∈ {t − w + 1, …, t}
at which increment k is played
(i.e., ps = v̂s + δk
on non-probing rounds), and let Nt, wk := |𝒮t, wk|.
Define the windowed estimator
$$
\hat D_{t,w}(k)\;:=\;\frac{1}{N_{t,w}^k}\sum_{s\in\mathcal{S}_{t,w}^k}
o_s,
\qquad\text{(with the convention $\hat D_{t,w}(k)=1$ if $N_{t,w}^k=0$).}
$$
The appendix proves the following lemma (the version we use is uniform
over t and k via a union bound).
$$
\bigl|\hat D_{t,w}(k)-D_t(\delta_k)\bigr|
\;\le\;
\sqrt{\frac{2\log(2/\delta)}{N_{t,w}^k}}
\;+\;
L_\xi\,\varepsilon
\;+\;
V_{t,w}.
$$
The proof follows a standard pattern but with two nonstandard wrinkles. First, because 𝒮t, wk is history-dependent, we treat {os − 𝔼[os ∣ ℱs − 1]} as bounded martingale differences and apply Freedman (or Hoeffding) conditional on the sampling rule. Second, to compare 𝔼[os ∣ ℱs − 1] to Dt(δk) we telescope across time using the sup-norm variation budget; this is where Vt, w enters additively and cannot be improved without stronger smoothness assumptions in t.
We include a self-contained tuning guide that starts from an explicit
high-probability regret decomposition and then selects (ε, w) to balance
discretization, estimation variance, and drift bias. The tuning argument
is deliberately constructive: we keep track of how the grid size |K| = Θ(By/ε)
interacts with union bounds over increments, and we spell out the
constraints needed for the elimination procedure to remain well-posed
under windowing (e.g., minimal sample requirements for confidence
intervals to shrink). This yields choices of the form
ε ≍ T−1/3VT1/3, w ≍ T2/3VT−2/3,
up to logarithmic and problem-dependent factors, which recover the
stated dependence T2/3VT1/3
in the nonstationary component. We also discuss two practically relevant
variants: (i) , handled by a small grid of candidate window lengths and
a meta-selection rule based on empirical instability of demand
estimates; and (ii) , where restarts (triggered by a change-point test)
dominate sliding windows and improve constants when drift is sparse.
Finally, we report supplementary simulations designed to stress-test the modeling choices rather than merely showcase headline regret. The additional experiments include: (a) smoothly drifting demand Dt (e.g., sinusoidal shifts in a logistic family) to visualize the window-length responsiveness tradeoff; (b) piecewise-stationary regimes with abrupt breaks to compare sliding windows versus explicit restarts; (c) adversarial context sequences that amplify the need for probing to stabilize the index estimate; and (d) controlled violations of assumptions (mild heavy tails, non-Lipschitz kinks, or context-dependent residual demand) to diagnose failure modes. Across these settings, we summarize three empirical regularities that align with the theory: shorter windows reduce systematic mispricing after a drift event, finer ε improves performance only until estimation noise dominates, and the probing mechanism primarily affects early-round stability (and then becomes dormant once θ̂ is sufficiently accurate). We view these results as complements to the regret bound: they illustrate how the theoretical knobs (ε, w) translate into operational behavior, and they make clear where the model is predictive versus where it is a convenient approximation.