← Back

Contextual Dynamic Pricing under Demand Drift: Windowed VAPE and Dynamic Regret (T2/3 + T2/3VT1/3)

Metadata

Table of Contents

  1. 1. Introduction (≤900w): Motivation from 2026-era drift; why stationarity is brittle; summarize contributions and how they extend VAPE.
  2. 2. Model and Benchmarks (≤900w): Formal timing, adversarial contexts, drifting demand D_t; define static vs dynamic regret and drift budgets; discuss economic interpretation of V_T.
  3. 3. Structural Decomposition and Identification (≤800w): Re-derive π_t(x,p)=(x^⊤θ+δ)D_t(δ), valuation-estimation trick via randomized prices, and why demand must be learned over a wide increment range.
  4. 4. Algorithm: Windowed/Restarted VAPE (≤900w): Pseudocode; windowed demand estimators per increment; time-local confidence intervals; optional restart via drift tests; computational complexity and parameters.
  5. 5. Main Theorems (≤900w): Dynamic regret bound; stationary corollary; discussion of rates and dependence on d and V_T; comparison to naive methods and to stationary VAPE.
  6. 6. Proof Roadmap (≤900w): Regret decomposition into valuation approximation, discretization, demand estimation, and drift-induced bias; where nonstationary tools enter.
  7. 7. Extensions and Variants (≤900w): Unknown V_T (adaptive window selection), change-point detection variant, piecewise-stationary special case, and (optional) stochastic contexts.
  8. 8. Empirical Illustration Plan (≤700w): Synthetic drift regimes; seasonality and sudden shocks; metrics; ablations (window vs restart).
  9. 9. Discussion and Policy Implications (≤700w): Safe experimentation under drift; monitoring and governance; when pooling is justified.
  10. Appendices (rest): Full proofs; concentration lemmas for sliding windows; parameter tuning; additional experiments.

Content

1. Introduction (≤900w): Motivation from 2026-era drift; why stationarity is brittle; summarize contributions and how they extend VAPE.

Digital sellers in 2026 face a demand environment that is simultaneously data-rich and structurally unstable. Prices can be updated in milliseconds, yet the mapping from a posted price to a purchase event is routinely perturbed by forces outside the seller’s control: rapid shifts in consumer sentiment, platform-wide recommendation changes, advertising auctions that reallocate traffic composition, regulatory interventions that alter disclosure and consent, and even competing algorithms that learn and react. In such settings, the classical premise that ``demand is stationary’’ is less a benign simplification than a brittle point of failure. When the distribution of willingness-to-pay drifts, an algorithm that optimizes against yesterday’s demand may be systematically wrong today, not because it is under-exploring, but because the object it is learning has moved.

This paper studies a particularly stark version of this problem: contextual dynamic pricing with only binary feedback. The seller observes rich covariates (the ``context’’), posts a price, and then learns only whether a purchase occurred. The economic attraction of this formulation is that it corresponds closely to practice (conversion events are far more reliably measured than valuations) while forcing the algorithm to respect informational constraints that matter for policy (privacy-preserving measurement, coarse reporting, and strategic data minimization). The technical difficulty is that binary feedback is low signal-to-noise, and nonstationarity compounds this difficulty by making old data potentially misleading rather than merely uninformative.

A large literature on contextual pricing and bandits delivers regret guarantees under stationarity: the demand curve (or the noise distribution around a parametric signal) is fixed over time. These guarantees can be impressive, but they encode a strong economic assumption: that once we control for observed covariates, the residual heterogeneity in valuations does not drift. In platform markets, that assumption is hard to defend. Moreover, the direction of misspecification is asymmetric. Stationary algorithms naturally pool information over long horizons to reduce variance; when demand drifts, this pooling creates bias, and bias is precisely what generates persistent revenue losses. Put differently, stationarity buys statistical efficiency by spending robustness.

Our starting point is the insight behind the VAPE framework: even with binary feedback, one can exploit structure in valuations to separate what is transferable across contexts from what is not. The context enters through a linear index (the ``expected valuation’’), while the residual uncertainty enters through an additive noise term that determines the purchase probability as a function of the price increment relative to that index. In the stationary model, this decomposition allows the seller to learn the linear index via occasional randomized probing and then learn the increment-demand relationship by pooling data across contexts. The algorithmic message is appealingly economic: we can treat the covariates as shifting the level of valuations, while the shape of residual demand can be learned once and reused.

The contribution of the present work is to show that this logic survives an important form of nonstationarity. We allow the demand curve in increment space to drift over time, subject to a total-variation budget. This is a natural model of gradual or intermittent changes in the distribution of unobserved willingness-to-pay after conditioning on covariates. Crucially, we do allow the linear valuation parameter to drift; doing so would conflate two distinct phenomena (changes in how observables map into value versus changes in unobservables) and would require substantially different instrumentation. Economically, our assumption corresponds to stable preferences over observed product features and user attributes, but time-varying residual shocks (macroeconomic conditions, attention constraints, platform frictions, and measurement artifacts). This division is not merely convenient: it reflects how practitioners often build pricing systems, with a stable representation layer (features and value model) and a faster-moving calibration layer (conversion model).

Our main conceptual move is to replace global pooling of demand information with pooling. If demand drifts, then the relevant comparison set is not the full history but a recent window (or, in piecewise-stationary regimes, the current segment). Doing so introduces the classic bias–variance tradeoff: shorter windows track changes but suffer higher estimation noise; longer windows reduce noise but average over outdated regimes. We formalize this tradeoff in a way that is compatible with VAPE’s cross-context learning: the seller still leverages many contexts to learn demand as a function of the increment, but only uses observations that are recent enough to be informative about the current demand curve. The resulting algorithmic family—windowed and restart-based variants of VAPE—can be interpreted as implementing an economically meaningful form of ``recency weighting’’ that disciplines how the seller extrapolates from past consumers to present ones.

We summarize our technical contributions as follows.

Beyond the formal guarantees, the economic interpretation is straightforward. A seller who insists on a single, persistent estimate of demand implicitly assumes stable residual heterogeneity; when that assumption fails, the seller either must accept biased pricing or must discount history. Our algorithms provide a disciplined way to do the latter while retaining the gains from contextual structure. In practice, the framework suggests how a platform might separate long-run estimation of stable value components (feature effects) from short-run calibration of conversion probabilities (demand shocks), and how to tune the aggressiveness of adaptation based on measured instability.

We also acknowledge limitations. Our drift model constrains changes through a bounded-variation budget and requires a mild smoothness (Lipschitz) condition in the price increment; abrupt, adversarially oscillating demand would defeat any method that relies on recent averaging. We maintain a linear valuation structure and bounded noise, both of which may be violated in high-dimensional personalization or in markets with heavy-tailed shocks. Finally, our benchmark is per-round optimal posted pricing for myopic buyers; richer strategic interactions, inventory constraints, and fairness objectives are important but orthogonal extensions. Within these boundaries, the model illuminates the central tradeoff faced by modern pricing systems: exploiting structure to learn efficiently while retaining enough adaptivity to remain robust when the environment drifts.


2. Model and Benchmarks (≤900w): Formal timing, adversarial contexts, drifting demand D_t; define static vs dynamic regret and drift budgets; discuss economic interpretation of V_T.

We study a seller who repeatedly posts prices to a sequence of myopic buyers over rounds t = 1, …, T. The seller observes a vector of covariates (the ) xt ∈ ℝd before pricing, and then receives only binary feedback indicating whether a purchase occurred. Our emphasis is on two features that are central in modern digital pricing settings: (i) contexts can be highly non-random and may depend on the seller’s past behavior (e.g., due to platform allocation, targeting rules, or competing algorithms), and (ii) the mapping from prices to purchase probabilities may drift over time even after conditioning on the observed context.

Let t − 1 := {(xs, ps, os)}s = 1t − 1 denote the public history. Each round proceeds as follows:

We assume |ξt| ≤ Bξ almost surely and 𝔼[ξt ∣ ℋt − 1, xt] = 0, which formalizes bounded, centered residual heterogeneity. Conditional on (ℋt − 1, xt), the shock ξt is drawn from a (possibly time-varying) distribution; across time, we impose conditional independence given the past and contexts to avoid degenerate dependence patterns that would render binary feedback uninformative.

A convenient way to describe purchase behavior is through the relative to the linear index:
δ := p − xθ.
Define the (time-dependent) demand curve in increment space by
Dt(δ) := ℙ(ξt ≥ δ ∣ ℋt − 1, xt),
which takes values in [0, 1]. Under the threshold-purchase rule, the expected revenue at time t for context x and price p is
πt(x, p) := 𝔼[p1{p ≤ yt} ∣ ℋt − 1, x] = p ℙ(ξt ≥ p − xθ ∣ ℋt − 1, x) = pDt(p − xθ).
To obtain meaningful tracking guarantees, we impose mild regularity on how sharply purchase probabilities can change with the increment: each Dt is Lξ-Lipschitz in δ over the relevant range (e.g., δ ∈ [−By, By]), so that small errors in increments translate into controlled errors in demand. Economically, this excludes knife-edge point-mass behavior and captures the idea that idiosyncratic heterogeneity smooths demand.

The key departure from stationary models is that the curve δ ↦ Dt(δ) may change with t. We quantify the extent of nonstationarity via a total-variation (in sup norm) budget:
$$ V_T \;:=\; \sum_{t=2}^T \sup_{\delta\in[-B_y,B_y]} \big|D_t(\delta)-D_{t-1}(\delta)\big|. $$
This definition is intentionally distribution-free: it does not assume parametric evolution of ξt, nor does it tie drift to any particular covariate. Instead, it bounds the cumulative magnitude of changes in purchase probabilities across all possible increments. The economic interpretation is that VT measures how much the demand environment (after controlling for xtθ) shifts over the selling horizon. A small VT corresponds to gradual evolution (e.g., slowly changing consumer sentiment), while a larger VT allows intermittent disruptions (e.g., platform redesigns, policy changes, or competitor entry) provided they do not occur too frequently or too violently.

It is useful to emphasize what is drifting: we keep θ fixed. This is a modeling choice that separates (i) stable valuation components tied to observed features (the representation layer) from (ii) time-varying shocks and frictions not captured by the covariates (the calibration layer). In applications, this corresponds to settings where feature engineering and the mapping from features to baseline value are updated infrequently relative to the short-run movements in conversion behavior. Allowing θ to drift is an important extension, but it changes the identification problem and would require different instrumentation and benchmarks.

Given (xt, Dt) and the unknown θ, the posted price is
pt*(xt) ∈ arg maxp ∈ [0, By]πt(xt, p) = arg maxp ∈ [0, By]pDt(p − xtθ).
We evaluate an online pricing policy {pt}t = 1T via against this moving benchmark:
$$ R_T^{\mathrm{dyn}} \;:=\;\sum_{t=1}^T \Big(\pi_t(x_t,p_t^*)-\pi_t(x_t,p_t)\Big). $$
Dynamic regret is the appropriate objective when the environment changes: even an omniscient seller who knows θ and all past demand curves would generally not want to commit to a single time-invariant pricing rule. The requirement is therefore to track, rather than to converge.

For comparison, one can also define a benchmark that ignores drift, such as the best fixed pricing rule in hindsight,
$$ R_T^{\mathrm{sta}} \;:=\;\max_{p(\cdot)} \sum_{t=1}^T \pi_t(x_t,p(x_t)) \;-\; \sum_{t=1}^T \pi_t(x_t,p_t), $$
where the maximization ranges over measurable maps p : ℝd → [0, By]. In a drifting environment, RTsta can be small even when the seller performs poorly relative to pt*, simply because the best fixed rule is itself a weak comparator. Our focus on RTdyn reflects an economic standard: we ask how much revenue is lost relative to the contemporaneously optimal price that would be chosen if the seller could instantaneously adapt to the current demand curve.

The quantity VT governs a fundamental tradeoff between statistical efficiency and robustness. Pooling long histories reduces variance in demand estimation under binary feedback, but if Dt changes, the same pooling creates bias by averaging over outdated regimes. A bounded VT formalizes the idea that recent data are informative about the near future, and it permits guarantees that degrade gracefully with instability: when VT = 0 we recover a stationary environment, while larger VT forces any algorithm to discount history more aggressively, increasing uncertainty and hence regret. In this sense, VT plays the role of an economically interpretable ``instability index’’ for residual demand, bridging algorithmic tuning (how much to forget) with market conditions (how quickly consumers and frictions evolve).


3. Structural Decomposition and Identification (≤800w): Re-derive π_t(x,p)=(x^⊤θ+δ)D_t(δ), valuation-estimation trick via randomized prices, and why demand must be learned over a wide increment range.

Our algorithmic design rests on a simple but powerful decomposition: once we condition on the linear index v := xθ, all remaining price sensitivity can be represented as a one-dimensional function of the δ := p − v. This isolates what is transferable across heterogeneous contexts from what must be tracked over time.

Fix any round t and any context–price pair (x, p). Writing p = v + δ with v = xθ, the purchase event {p ≤ yt} becomes
{v + δ ≤ v + ξt} = {ξt ≥ δ}.
Therefore the expected revenue admits the decomposition

Equation is the key structural statement: the context x enters payoffs only through the scalar xθ, while all nonstationarity enters only through the time-varying demand curve δ ↦ Dt(δ). Economically, we can interpret θ as a stable ``representation layer’’ mapping features into a baseline willingness-to-pay, while Dt captures short-run fluctuations in conversion frictions and residual heterogeneity.

This decomposition is not merely notational. It implies that learning about Dt(δ) at a given increment δ is, in principle, useful across contexts, because it corresponds to the same event {ξt ≥ δ}. This cross-context pooling is what makes it possible to handle adversarial (and high-dimensional) contexts without having to learn a separate demand curve for each region of the covariate space.

A central obstacle is that we never observe yt; we only see the binary indicator ot = 1{pt ≤ yt}. Nevertheless, on specially designated rounds, we can convert this one bit into an unbiased proxy for the valuation.

Suppose on a probing round we draw the posted price independently as
pt ∼ Unif([0, By]),
and post this pt regardless of the current xt (we can later restrict probing to particular times, but the identification argument is per-round). Conditional on yt ∈ [0, By], we have
$$ \mathbb{E}[o_t\mid y_t] \;=\;\mathbb{P}(p_t\le y_t\mid y_t) \;=\;\frac{y_t}{B_y}. $$
Thus the transformed observation

satisfies 𝔼[zt ∣ yt] = yt. Taking another conditional expectation and using 𝔼[ξt ∣ ℋt − 1, xt] = 0 yields
𝔼[zt ∣ ℋt − 1, xt] = 𝔼[yt ∣ ℋt − 1, xt] = xtθ.
In other words, probing produces an unbiased regression label for the linear index xtθ even though we observe only purchases. This remains valid under time-varying ξt distributions because the argument does not require stationarity of Dt; it uses only (i) independent randomization of pt on probing rounds, (ii) bounded support so that yt ∈ [0, By], and (iii) mean-zero residuals.

From a practical standpoint, is an ``instrumentation’’ device: by injecting controlled random prices occasionally, a platform can recover a signal about willingness-to-pay that would otherwise be censored by strategic pricing. The economic cost is immediate revenue loss on probe rounds (we sometimes post prices far from optimal), but the benefit is that we can estimate θ under worst-case context sequences using standard self-normalized concentration tools for ridge regression. This is precisely the tradeoff a platform faces when deciding how much experimentation to permit in production systems.

Once we have an estimator θ̂t (built from past probing rounds), we can form a valuation prediction t := xtθ̂t and restrict our pricing decisions to the form
pt = t + δ,
where δ is chosen from a discrete set of candidate increments. If t is accurate, then the realized increment relative to the true vt = xtθ is close to δ, and the purchase probability is close to Dt(δ) by Lipschitzness. This reduces the problem to learning, for each increment δ, the time-varying conversion probability and the associated revenue (vt + δ)Dt(δ).

A limitation is worth highlighting: the above separation relies on the assumption that, conditional on xt, the distribution of ξt does not depend on xt in any additional way (beyond the mean shift already captured by xtθ). If residual dispersion or skewness systematically varies with covariates, then there is no single Dt shared across contexts, and the cross-context pooling logic weakens. Our approach is therefore best viewed as a model of settings where features absorb most systematic heterogeneity, and the remaining residual is ``approximately common’’ across segments.

Even with θ known, the per-round objective is
maxp ∈ [0, By]pDt(p − xtθ) = maxδ: xtθ + δ ∈ [0, By] (xtθ + δ) Dt(δ).
The feasible set for δ depends on vt = xtθ through the constraint vt + δ ∈ [0, By], i.e.,
δ ∈ [−vt, By − vt].
Since vt can vary substantially across contexts—and we explicitly allow xt to be adversarial—the union of these feasible increment intervals over time can span essentially the full range [−By, By]. Consequently, an algorithm that only learns Dt(δ) locally around a single neighborhood of increments is brittle: an adversary can present contexts with very different vt, making previously learned parts of the curve irrelevant.

This observation motivates our global increment grid. By maintaining estimates for Dt(δ) on a discretization of [−By, By], we ensure that for every context we can approximately evaluate revenues of candidate prices. The Lipschitz property provides the approximation guarantee: if δ lies within ε of a grid point δk, then |Dt(δ) − Dt(δk)| ≤ Lξε, so the discretization error can be controlled uniformly. The economic interpretation is that, to robustly personalize prices across heterogeneous customers, the platform must understand conversion behavior not only at a single ``typical’’ markup, but across a spectrum of markups and discounts relative to predicted value.

Taken together, the decomposition , the probing-based identification of xθ, and the need for global increment coverage lead naturally to a two-timescale learning strategy: we estimate the stable representation θ using occasional randomized probes, and we track the drifting calibration object Dt using time-local aggregation in increment space. The next section operationalizes this logic in the Windowed/Restarted VAPE algorithm.


4. Algorithm: Windowed/Restarted VAPE (≤900w): Pseudocode; windowed demand estimators per increment; time-local confidence intervals; optional restart via drift tests; computational complexity and parameters.

We now operationalize the two-timescale logic described above. The stable object is the linear index vt = xtθ, which we learn through occasional randomized rounds. The drifting object is the increment-demand map δ ↦ Dt(δ), which we track by pooling data across contexts in but discounting stale observations via a sliding window (or, alternatively, explicit restarts).

Fix a grid resolution ε > 0 and define the increment index set
𝒦 := {k ∈ ℤ: δk := kε ∈ [−By, By]}.
On a non-probing round t, after forming a valuation prediction t = xtθ̂t, we restrict attention to prices of the form
pt(k) := t + δk,   k ∈ 𝒦t := {k ∈ 𝒦 : pt(k) ∈ [0, By]}.
This construction ensures that, across heterogeneous contexts, we evaluate a common set of markups/discounts relative to the predicted baseline. When t is ε-accurate, the realized increment relative to the true vt differs from δk by at most ε, so Lipschitzness of Dt converts valuation error into controlled demand error of order Lξε.

For each increment k, we maintain a sliding-window dataset of the most recent w rounds in which increment k was played (eligibility simply means we are not probing and the valuation predictor is certified accurate; we formalize this certification below). Let κs ∈ 𝒦 denote the increment index actually used at round s (when not probing), and define windowed counts
$$ N_{t,w}^k\;:=\;\sum_{s=t-w}^{t-1}\mathbf{1}\{\kappa_s=k\}\mathbf{1}\{\text{$s$ eligible}\}. $$
The natural estimator of Dt(δk) is the windowed purchase frequency
$$ \hat D_{t,w}(k)\;:=\; \begin{cases} \frac{1}{N_{t,w}^k}\sum_{s=t-w}^{t-1} o_s\,\mathbf{1}\{\kappa_s=k\}\mathbf{1}\{\text{$s$ eligible}\}, & N_{t,w}^k\ge 1,\\[4pt] \frac{1}{2}, & N_{t,w}^k=0, \end{cases} $$
where the default value when Nt, wk = 0 can be arbitrary in [0, 1] (it is immediately dominated by wide confidence bounds). The central reason this estimator is meaningful under adversarial contexts is that conditioning on the increment k pins down the event {ξs ≥ δ} up to the valuation approximation error, so samples from different contexts remain comparable.

We convert t, w(k) into a time-local confidence interval using standard bounded-difference concentration on windowed averages. A convenient form is
$$ \mathrm{rad}_{t,w}(k)\;:=\; \sqrt{\frac{c_0\log T}{\max\{1,N_{t,w}^k\}}}, \qquad \mathrm{UCB}^D_{t,w}(k):=\min\{1,\hat D_{t,w}(k)+\mathrm{rad}_{t,w}(k)\}, \quad \mathrm{LCB}^D_{t,w}(k):=\max\{0,\hat D_{t,w}(k)-\mathrm{rad}_{t,w}(k)\}, $$
for a constant c0 chosen to deliver an overall failure probability at most Tc after a union bound over t and k.

On a non-probing round, we rank increments by an upper confidence bound on revenue. Because the instantaneous revenue at increment δ is (vt + δ)Dt(δ) and we only know t and t, w(k), we use a robust proxy that accounts for (i) statistical uncertainty in t, w(k), (ii) the fact that the true increment differs from δk by at most ε when |t − vt| ≤ ε, and (iii) discretization error from optimizing over a grid. One convenient choice is

where the ±ε terms crudely propagate index uncertainty into the multiplicative price factor, and the ByLξε term controls the demand change induced by the increment mismatch. We then choose
kt ∈ arg maxk ∈ 𝒦tUCBt, w(k),   pt = t + δkt,
observe ot ∈ {0, 1}, and update only the statistics for kt (dropping data older than w rounds).

The algorithm above leaves open how we ensure that non-probing rounds indeed satisfy |t − vt| ≤ ε. We use the same self-normalized control as in stationary VAPE: probe precisely when the design matrix Vt − 1 suggests high uncertainty in direction xt, i.e., when xtVt − 1−1 > μ. Standard ridge concentration then yields an event on which, for all non-probing rounds, the index error is uniformly bounded by a target tolerance that we identify with ε (after absorbing constants and logarithms into the choice of μ and λ). This mechanism is robust to adversarial contexts because the probing decision depends only on xt and past probes, not on any stationarity of ξt.

Windowing is the most direct way to respect the variation budget: old data are down-weighted by being discarded. When drift is abrupt (piecewise stationary), explicit restarts can be preferable. A simple implementation is to run, for each k with enough samples, a two-sample test comparing purchase frequencies in the most recent w/2 eligible plays of k to the preceding w/2; if the absolute difference exceeds a threshold of order $\sqrt{\log T/n}$, we declare a change. In response, we reset the demand statistics (and optionally restart the probing/estimation module, although θ is stable in our model and need not be reset). This restart view connects our method to change-point bandits while preserving the key cross-context pooling in increment space.

Let |𝒦| ≍ By/ε be the number of increments. A literal implementation computes UCBt, w(k) for all feasible k each non-probing round, costing O(|𝒦|) time per round; the ridge update on probe rounds costs O(d2) per probe (or O(d) with recursive least squares updates and maintained inverse). Storage for demand tracking is O(|𝒦|w) if we keep full queues, but can be reduced to O(|𝒦|) by storing only windowed counts and sums together with timestamps (or by using exponential forgetting). In typical regimes where ε is chosen as a negative power of T, |𝒦| is polynomial in T; our regret bounds below remain meaningful precisely in the range where this discretization is statistically warranted.

The key design parameters are the grid spacing ε (bias–complexity tradeoff), the window length w (variance–drift tradeoff), and the probing threshold μ (exploration cost for learning θ). We treat these as functions of (T, d, VT) chosen to balance the corresponding regret contributions; the next section states the resulting rates and shows how the drift budget VT enters only through the choice of time localization (windowing or restarts), while the dimension d enters only through the probing needed to stabilize t.


5. Main Theorems (≤900w): Dynamic regret bound; stationary corollary; discussion of rates and dependence on d and V_T; comparison to naive methods and to stationary VAPE.

Our performance criterion is against the per-round optimal posted price,
$$ R_T^{\mathrm{dyn}} :=\sum_{t=1}^T\Big(\pi_t(x_t,p_t^*)-\pi_t(x_t,p_t)\Big), \qquad p_t^*\in\arg\max_{p\in[0,B_y]} p\,D_t\!\big(p-x_t^\top\theta\big), $$
which is the natural benchmark when demand conditions drift but the seller is evaluated relative to what could be achieved with contemporaneous knowledge of Dt.

The following theorem states that Windowed/Restarted VAPE attains sublinear dynamic regret while allowing (i) adversarially chosen contexts, (ii) only binary purchase feedback, and (iii) time-varying demand curves subject to the total-variation budget VT.

The bound makes the algorithmic tradeoffs transparent. Larger windows reduce sampling noise but increase sensitivity to drift; finer grids reduce discretization but require more data per increment and amplify the drift penalty (because drift must be controlled uniformly over a larger effective action set).

Optimizing the right-hand side in (w, ε) yields the stated headline rate.

Two limiting cases are worth highlighting. When drift is mild (VT small), the second term is negligible and the rate is essentially the stationary benchmark driven by learning θ under adversarial contexts. When drift is substantial, the term T2/3VT1/3 dominates and reflects the unavoidable price of tracking a time-varying demand curve with only censored (binary) feedback.

Setting VT = 0 (i.e., Dt ≡ D for all t) collapses the nonstationary component and recovers the stationary performance of VAPE up to log factors.

This corollary clarifies the conceptual separation embedded in the model: the linear index xtθ is stable and is learned at the same cost as in stationary VAPE, whereas all time variation is confined to the one-dimensional demand object δ ↦ Dt(δ).

The d2/3T2/3 term stems from the fact that contexts may be adversarial and we only see binary feedback. Probing ensures sufficient excitation to certify that xtθ̂t is uniformly accurate on non-probing rounds; the elliptical-potential argument then converts this certification into a sublinear exploration cost. Notably, demand drift does not directly interfere with this step: the probing analysis relies on bounded martingale differences and the symmetry/centering used to form unbiased pseudo-labels, not on stationarity of Dt.

The dependence on VT enters only through demand tracking in increment space. The scaling T2/3VT1/3 is the characteristic balance one obtains when a learner must choose a time-localization scale (here a window) to trade off variance ( ∝ w−1/2) against nonstationary bias ( ∝ wVT), with an additional ε-layer because we learn demand on an increment grid rather than on a continuum.

A direct attempt to run a standard contextual bandit or contextual pricing method in faces two obstacles. First, with adversarial contexts, one cannot pool across heterogeneous xt without a structural reduction: the same posted price p corresponds to different effective markups δ = p − xtθ, so purchase outcomes at price p are not comparable across contexts. Second, in a drifting environment, using all historical data (as a stationary algorithm would) induces a systematic bias that can accumulate over time; in the worst case, regret scales polynomially with T because outdated demand estimates persistently mis-rank prices.

Our approach avoids both issues by (i) translating every action into an increment δ relative to a learned baseline xtθ, so that demand data become comparable across contexts, and (ii) localizing demand estimation in time via windows or restarts, so that the effect of drift is paid only through VT rather than through T itself. Relative to stationary VAPE, the only new phenomenon is the drift-induced bias term, and the analysis shows precisely how to tune time localization to keep this term sublinear.

When demand changes are abrupt rather than gradual, restarts can replace windows. If Dt is piecewise stationary with at most S change points, a restart-based variant (with a standard two-sample test on increment-level purchase frequencies) yields a bound of the form
RTdyn ≤  (d2/3T2/3) +  (T2/3S1/3),
up to detection-delay and false-alarm costs. This aligns with the economic intuition that a small number of regime shifts should be cheaper to track than continuous drift with the same horizon, provided shifts can be detected quickly from binary feedback.


6. Proof Roadmap (≤900w): Regret decomposition into valuation approximation, discretization, demand estimation, and drift-induced bias; where nonstationary tools enter.

We prove Theorem~ by decomposing regret into four components—(i) valuation approximation (learning the stable linear index xtθ), (ii) discretization in increment space, (iii) statistical error in estimating demand from binary outcomes, and (iv) a drift-induced bias term that is specific to the nonstationary environment. The main technical point is that only the last component truly requires nonstationary tools; the valuation-learning step is essentially unchanged from stationary VAPE because θ is time-invariant and the probing construction does not assume a fixed Dt.

Using the structural identity
πt(x, p) = pDt (p − xθ)  and  p = xθ + δ,
we rewrite expected revenue as πt(x, xθ + δ) = (xθ + δ)Dt(δ). This isolates all time variation in the one-dimensional object δ ↦ Dt(δ), while all context dependence enters through the level vt := xtθ. The algorithm therefore aims to (a) maintain an accurate estimate t := xtθ̂t and (b) run a time-local bandit-style procedure over increments δ using windowed data. The proof mirrors this separation: we bound the regret from errors in t, then treat the remaining problem as choosing (approximately) optimal increments under drifting demand.

Fix a round t and let δt* denote the optimal increment given the true level vt:
δt* ∈ arg maxδ: vt + δ ∈ [0, By](vt + δ)Dt(δ).
The algorithm instead chooses δ̂t based on t and windowed estimates of Dt on a grid. We insert an intermediate ``oracle-on-the-grid’’ benchmark that knows Dt but is constrained to the discretized increment set {δk = kε} and uses vt rather than t. Concretely, we decompose per-round regret into
$$ \pi_t(x_t,p_t^*)-\pi_t(x_t,p_t) \;\le\; \underbrace{\big[\pi_t(v_t,\delta_t^*)-\pi_t(v_t,\delta_{k_t^\star})\big]}_{\text{discretization}} +\underbrace{\big[\pi_t(v_t,\delta_{k_t^\star})-\pi_t(\hat v_t,\delta_{k_t^\star})\big]}_{\text{valuation approximation}} +\underbrace{\big[\pi_t(\hat v_t,\delta_{k_t^\star})-\pi_t(\hat v_t,\hat\delta_t)\big]}_{\text{demand learning on a grid}}, $$
where δkt is the best grid increment for the true level vt (and the appropriate feasibility constraint). The first term is controlled by Lipschitzness and boundedness, yielding the O(ε) loss per round that aggregates to (Tε). The second term is controlled by the uniform approximation guarantee |t − vt| ≤ ε on non-probing rounds (and a direct bound on probing rounds). The third term is where we use windowed confidence bounds and where drift enters.

We create pseudo-labels using randomized probing prices whose distribution is symmetric and independent of ξt, so that the corresponding regression noise remains a bounded martingale difference. The key lemma shows that, after a controlled number of probe rounds, ridge regression yields a high-probability uniform guarantee of the form
|xt(θ̂t − θ)| ≤ ε  on all non-probing rounds.
The proof uses standard self-normalized concentration for linear martingales plus the elliptical potential argument to bound the number of rounds on which xtVt−1 is large (and thus additional probing is needed). Importantly, this step does require stationarity of Dt: the regression analysis conditions on the past and uses only boundedness and centering/symmetry to ensure unbiasedness of the constructed labels. This yields the (d2/3T2/3) contribution to regret, matching stationary VAPE up to logs and constants.

For each increment δk, we pool outcomes across contexts by translating posted prices into increments using s at the time of play. Over the last w eligible rounds, we form the empirical purchase frequency t, w(k) and then a reward estimate
π̂t, w(k) := (t + δk)t, w(k),
together with time-local upper and lower confidence bounds. The analysis of t, w(k) splits into three errors:
$$ \hat D_{t,w}(k) - D_t(\delta_k) = \underbrace{\big(\hat D_{t,w}(k)-\mathbb{E}[\hat D_{t,w}(k)\mid\mathcal{F}]\big)}_{\text{sampling noise}} +\underbrace{\big(\mathbb{E}[\hat D_{t,w}(k)]-\tilde D_t(\delta_k)\big)}_{\text{valuation approximation / discretization bias}} +\underbrace{\big(\tilde D_t(\delta_k)-D_t(\delta_k)\big)}_{\text{drift bias}}, $$
where t denotes the effective average demand curve over the window induced by using past data. The first term is handled by Hoeffding/Freedman inequalities for bounded Bernoulli variables (uniformly over k and t via a union bound). The second term is controlled via Lipschitzness of Ds(⋅) and the fact that using s instead of vs perturbs the realized increment by at most ε, contributing an O(Lξε) bias. The third term is the genuinely nonstationary term: comparing Ds(δ) to Dt(δ) within the window yields
$$ \sup_\delta |D_s(\delta)-D_t(\delta)| \;\le\; \sum_{u=s+1}^t \sup_\delta |D_u(\delta)-D_{u-1}(\delta)|, $$
so the window-average mismatch is bounded by the local variation budget $V_{t,w}:=\sum_{u=t-w+1}^t \sup_\delta|D_u(\delta)-D_{u-1}(\delta)|$, and summing over t leads to the global term O(wVT) after careful counting.

Given time-local UCB/LCB intervals for πt(t, δk), we analyze the increment-selection rule (optimism or elimination) exactly as in stationary VAPE, except that the confidence radius now includes both a variance term scaling like $\sqrt{1/N_{t,w}^k}$ and an additive bias term scaling like Vt, w. The usual argument shows that, on rounds when a suboptimal increment is chosen, either (i) its confidence interval still overlaps the optimal one (which can happen only a limited number of times before enough samples accrue in the window), or (ii) the choice is explained by drift contamination (charged to Vt, w). Summing these contributions yields a demand-learning term $\tilde O(T\sqrt{1/w})$ and a drift term (wVT/ε) (the factor 1/ε arises because we control regret uniformly over a grid, so we allocate variation budgets across increments at resolution ε).

Finally, we add: probing/valuation-learning regret, discretization loss, windowed statistical error, and drift bias. This produces the bound in Theorem~. Optimizing in (w, ε) then delivers the stated  (d2/3T2/3 + T2/3VT1/3) rate, making explicit the economic tradeoff: time localization is valuable only to the extent demand drifts, and its cost is the increased variance from discarding older data.


7. Extensions and Variants (≤900w): Unknown V_T (adaptive window selection), change-point detection variant, piecewise-stationary special case, and (optional) stochastic contexts.

Our main guarantee is stated in terms of the (a priori unknown) variation budget VT, and the prescribed tuning of the window length w (and grid spacing ε) implicitly trades off statistical precision against nonstationary bias. In practice, the platform rarely has a credible numerical upper bound on VT, and even if it did, the relevant drift may be highly nonuniform over time. A natural extension is therefore to make time localization adaptive.

One simple approach is to maintain a finite set of candidate window lengths 𝒲 = {w1, …, wM} (e.g., a dyadic grid from 1 up to T) and run a separate instance of the increment-selection procedure for each w ∈ 𝒲, each instance producing a recommendation δ̂t(w) and an estimated (upper) value for playing it. We then combine these instances using a bandit-over-bandits meta-algorithm (e.g., a variant of or exponential-weights with importance weighting) that selects which window to follow each round. Because realized revenue is observed regardless of which window is used, the meta-layer can treat each base instance as an expert'' whose advice is a price $p_t^{(w)}=\hat v_t+\hat\delta_t^{(w)}$. Under standard stability conditions (bounded losses/rewards and controlled importance weights), this yields an adaptive guarantee of the form \[ R_T^{\mathrm{dyn}} \;\lesssim\; \min_{w\in\mathcal{W}} \Big\{\tilde O\!\big(d^{2/3}T^{2/3}\big) +\tilde O\!\big(T\sqrt{1/w}\big) +\tilde O\!\big(wV_T/\varepsilon\big) +\tilde O(T\varepsilon)\Big\} \;+\;\tilde O\!\big(\sqrt{T\log M}\big), \] so we pay only a modest overhead for not knowing the right time scale. Economically, this meta-selection corresponds to letting the platform endogenously decide how quickly itforgets’’ old demand information: in calm periods it will put weight on longer windows (low variance), while around shocks it will switch toward shorter windows (fast tracking). A limitation is computational: running many windows in parallel multiplies the bookkeeping for {Nt, wk} and the confidence bounds. In implementations, we can share sufficient statistics across nested windows or restrict to a small dyadic family.

A similar issue arises for discretization. The grid spacing ε governs both approximation error and how finely we must control drift across increments. When Lξ is large (steep demand), small ε is valuable; when drift is severe, coarse discretization can be preferable because it reduces the effective number of increment arms'' that must be tracked. One can again deploy a multiscale scheme with $\varepsilon\in\mathcal{E}$, each instance maintaining its own increment grid $\{\delta_k\}$, and aggregate using a meta-algorithm. Alternatively, one may refine the grid online: start coarse, and split a bin only when its confidence interval is small relative to its width (a one-dimensionalzooming’’ heuristic in increment space). While a full analysis would be lengthy, the key point is that the increment space is one-dimensional and bounded, so adaptive discretization can be done without compromising the adversarial-context robustness of the valuation-learning layer.

Windowing is a soft form of discounting; a complementary strategy is hard resets when evidence suggests that Dt has changed materially. This is especially natural in environments with abrupt breaks (policy changes, competitor entry, supply disruptions), where mixing pre- and post-break data inside a window is economically undesirable even if VT is moderate in aggregate.

A restart-based variant proceeds as follows. We maintain the valuation estimator θ̂t continuously (since θ is stable), but we segment time into episodes within which we treat demand as approximately stationary and run a stationary VAPE-style increment learner. To detect changes, we can monitor, for each increment δk, a statistic comparing purchase rates in two adjacent subwindows (CUSUM), or a generalized likelihood ratio (GLR) test for Bernoulli means. Because the algorithm pools data by increment (after translating by s), the relevant inputs are the binary outcomes os associated with (approximate) plays of δk. When the test for any k crosses a threshold, we restart the demand learner (clear demand statistics and confidence sets), while keeping the current θ̂t and continuing probing only as needed to preserve the ε-accurate valuation guarantee.

The benefit of restarts is conceptual and practical: when a break occurs, we prevent contaminated confidence bounds that would otherwise be overconfident due to large sample sizes accumulated under an obsolete regime. The cost is that false alarms create unnecessary resets, and detection delay means we may still incur some regret after a break. Both can be controlled with standard concentration tools, yielding guarantees that scale with the number of true changes and the chosen thresholds.

If demand is piecewise stationary with at most S change points, then the restart perspective becomes canonical. In that case, we can express regret as the sum of (i) within-segment learning costs and (ii) switching overhead from detection and restart. With an oracle that knows the change points, we would simply run stationary VAPE independently on each segment, obtaining approximately
$$ \sum_{j=1}^{S+1} \tilde O\!\big(d^{2/3}T_j^{2/3}\big) \;\le\; \tilde O\!\big(d^{2/3}T^{2/3}(S+1)^{1/3}\big), $$
where Tj is segment length and we used concavity of x2/3. With online detection, an additional term appears that depends on the detection delay and false-alarm rate; in regimes where breaks are well-separated and sufficiently large (in the sense that the change in Dt(δ) exceeds the confidence radius for some increment δ), one can often sharpen this to bounds closer to $\tilde O(d^{2/3}T^{2/3}+\sqrt{ST})$. From an economic standpoint, this corresponds to markets with stable ``seasons’’ punctuated by regime changes: the platform benefits from explicitly recognizing regimes rather than continuously averaging.

Our baseline permits adversarial contexts, which is essential in applications where the platform can influence the context stream (through targeting, ranking, or experimentation) or where contexts correlate with unmodeled state variables. In settings where contexts are stochastic and satisfy favorable coverage conditions (e.g., xt i.i.d. with well-conditioned covariance), the valuation-learning layer can be improved. Specifically, ridge regression under i.i.d. designs typically yields estimation error scaling like $\tilde O(\sqrt{d/t})$, and the number of probe rounds needed to ensure |xt(θ̂t − θ)| ≤ ε can drop substantially relative to the elliptical-potential worst case. This can reduce the d2/3T2/3 component toward more classical $\tilde O(\sqrt{dT})$ behavior, while the nonstationary demand component (windowing or restarting) remains essentially unchanged. The broader message is that our separation between a stable linear index and a drifting one-dimensional demand curve is modular: improvements in either layer (better covariate structure, richer demand smoothness, or stronger feedback) translate directly into improved overall performance without altering the economic logic of increment pooling.


8. Empirical Illustration Plan (≤700w): Synthetic drift regimes; seasonality and sudden shocks; metrics; ablations (window vs restart).

Our theoretical results are intentionally modular—a stable linear index vt = xtθ coupled with a drifting one-dimensional demand curve δ ↦ Dt(δ). An empirical illustration should therefore stress-test both modules separately and jointly, and report diagnostics that map cleanly back to the terms in the regret decomposition (estimation variance, discretization error, and drift bias). We outline a synthetic evaluation protocol that is faithful to the binary-feedback setting and makes drift transparent and controllable.


We fix d, bounds Bx, Bθ, Bξ, and draw a ground-truth θ uniformly on the sphere scaled to θ2 = Bθ (or from a Gaussian then rescaled). Contexts xt are generated under two regimes: (i) , xt ∼ 𝒩(0, Σ) truncated to satisfy xt2 ≤ Bx with Σ well-conditioned; and (ii) , where we cycle through nearly collinear directions to create difficult ridge geometry (large xtVt−1), while still respecting xt2 ≤ Bx. In both cases, the seller observes only (xt, pt, ot).

For demand, we specify a family of noise distributions ξt with bounded support and explicit drift. A convenient choice is a truncated logistic or truncated Gaussian, parameterized by a location μt and scale σt, so that
Dt(δ) = ℙ(ξt ≥ δ) = 1 − Fμt, σt(δ),   δ ∈ [−By, By],
and we enforce |ξt| ≤ Bξ by truncation. This allows us to implement (a) smooth gradual drift (changing μt slowly), (b) time-varying steepness (changing σt), and (c) structured non-monotone changes via mixtures (e.g., switching between two components to mimic heterogeneity shocks). Because the theory measures drift via the sup-norm variation budget VT, we record both the realized VT and the local window variation Vt, w to compare against observed performance.


We propose three canonical regimes that correspond to common market narratives while remaining analytically interpretable.
(i) set μt = μ0 + asin (2πt/P) with period P and amplitude a, holding σt fixed. This creates predictable oscillations in demand at every increment δ and highlights the windowing tradeoff: longer windows reduce variance but blur seasonal swings.
(ii) set (μt, σt) constant within segments and apply S abrupt changes at random times. This isolates the benefit of restarts and change-point detection, and permits reporting detection delay and false-alarm rates alongside revenue.
(iii) combine (i) and (ii), or add a slow random walk in μt with occasional jumps. This setting is empirically plausible (slow macro drift with discrete competitive events) and is useful for comparing adaptive-window meta-selection against pure windowing or pure restarts.


We evaluate: (a) with a grid spacing ε and fixed window w; (b) with a CUSUM/GLR detector on selected increments δk; (c) via a small dyadic family 𝒲 combined by a meta-algorithm; and, as benchmarks, (d) (effectively w = T) to quantify degradation under drift; (e) a that uses only t = xtθ̂t and posts pt = t (no increment learning), to show the value of learning Dt; and (f) an that knows (θ, Dt) and posts pt*, which defines the regret comparator. For fairness, all non-oracle methods operate under the same price cap [0, By] and the same probing schedule (or, if probing is endogenous, we report its realized frequency).

To isolate mechanisms, we include targeted ablations: (1) hold θ known and learn only demand (removing the valuation layer); (2) hold Dt known and learn only θ (removing demand nonstationarity); (3) turn off probing to illustrate failure under adversarial contexts; and (4) vary ε and w to empirically recover the bias–variance–drift tradeoff predicted by the bound terms Tε, $T/\sqrt{w}$ (up to logs), and wVT/ε.


Beyond cumulative revenue, we emphasize time-resolved measures that diagnose regret occurs.
Primary outcomes are: (i) t(πt(xt, pt*) − πt(xt, pt)) computed with Monte Carlo access to πt (possible in simulation), and (ii) trt/∑trt* to give an economically interpretable scale. We also report: (iii) |δ̂t − δt*| and |pt − pt*|; (iv) |xt(θ̂t − θ)| on non-probing rounds to confirm the intended ε-accuracy; (v) Nt, wk across increments to show which parts of increment space are being learned; and (vi) for restart-based methods, (time between a true break and the first restart) and . Finally, since our analysis relies on confidence bounds, we can directly assess : the fraction of times Dt(δk) lies within the constructed interval, stratified by drift intensity.


We recommend plotting cumulative regret and instantaneous regret (per-round gap) against time, annotated with true shock times and with estimated change points. For seasonality, heat maps of chosen increments δ̂t over t visually reveal whether the learner tracks periodic structure or averages it out. Stress tests should vary (d, T), the steepness proxy Lξ (via σt), and the drift magnitude (via a, jump sizes, and S), and should explicitly report the realized VT so that performance can be normalized by the theoretical drift scale. The goal is not to ``win’’ a leaderboard, but to validate the economic logic of increment pooling under drift and to clarify when windowing suffices versus when explicit restarts are operationally superior.


9. Discussion and Policy Implications (≤700w): Safe experimentation under drift; monitoring and governance; when pooling is justified.

Our model deliberately separates two economic forces that are often conflated in practice: a relatively stable vt = xtθ (how observable attributes translate into willingness to pay) and a one-dimensional, potentially drifting relationship δ ↦ Dt(δ) (how acceptance responds to a markup/markdown relative to the index). This decomposition is not merely a proof device; it suggests a pragmatic design principle for dynamic pricing systems operating under nonstationarity and limited feedback. In particular, it clarifies what should be learned across contexts (the mapping from x to v) versus what should be tracked (the demand curve over δ), and it makes the costs of each layer legible through quantities like the drift budget VT and window length w.


A central operational challenge is that learning requires experimentation, yet experimentation is precisely what creates short-run revenue risk and, in regulated settings, potential consumer harm. Our setting is informative because it uses only binary feedback ot = 1{pt ≤ yt}, which mirrors many real marketplaces where the seller observes purchase/no-purchase but not latent willingness to pay. The regret decomposition that underlies our guarantee can be read as a safety ledger: discretization and estimation contribute terms scaling like Tε and (up to logarithms) $T/\sqrt{w}$, while drift contributes a term of order wVT/ε. This immediately yields two safety-relevant prescriptions. First, : probing prices are not an invitation to explore arbitrarily high markups, but a targeted way to maintain |xt(θ̂t − θ)| ≤ ε on non-probing rounds even under adversarial context sequences. In practice, this suggests implementing explicit on experimentation, such as restricting probes to a conservative interval within [0, By], enforcing per-customer price change limits, and capping the fraction of traffic assigned to probing when demand is highly sensitive. Second, . Under sustained drift, the relevant comparator is the per-round optimum pt*, not a fixed price, so a policy that ``graduates’’ from exploration to exploitation can become systematically miscalibrated. Windowing (or restarts) operationalizes this insight: it effectively commits the system to forgetting outdated demand information at a controlled rate, trading variance for responsiveness.

There is also a policy-relevant asymmetry in what can safely be pooled. Estimating θ benefits from pooling across time and contexts because θ is assumed stable; indeed, the probing frequency needed to control valuation error depends on geometry (through xtVt−1) rather than on demand drift. By contrast, pooling demand observations across long horizons can be unsafe under drift: outdated acceptance data can induce persistent overpricing or underpricing, with welfare and revenue consequences. The model therefore supports a two-speed architecture: slow, conservative updates to θ̂ (with explicit uncertainty tracking) coupled with fast, windowed updates to t, w(k).


Because nonstationarity is the norm rather than the exception in many markets, governance should treat drift as a first-class object to be monitored. The theoretical quantities VT and Vt, w suggest concrete diagnostics: we can track, for each increment grid point δk, the discrepancy between short-window and long-window demand estimates, or equivalently the instability of t, w(k) as w varies over a small family. Large and persistent discrepancies are an operational proxy for high local variation and should trigger either a reduction in window length (to reduce drift bias) or a restart/change-point procedure (to purge contaminated history). Importantly, governance is not just about reacting to drift, but about verifying that uncertainty statements remain calibrated. Since our learning rule relies on time-local confidence bounds, one can audit online: how often realized purchase indicators fall within predicted acceptance intervals at chosen δk values. Systematic under-coverage is a red flag for misspecification (e.g., violation of bounded noise, heavy tails, or unmodeled heterogeneity) and should escalate to a fallback pricing mode.

Governance also includes and checks that map to our modeling assumptions. Pre-deployment, one should validate that the index structure is approximately linear (or at least stable) over the intended operational range, because the claim that cross-context information is transferable hinges on vt being the right sufficient statistic for context. Post-deployment, we recommend maintaining a ``shadow’’ evaluation that periodically tests counterfactual prices on a small randomized holdout, not to maximize short-run revenue, but to estimate whether the current policy is drifting away from the per-round optimality conditions implied by (v + δ)Dt(δ). This sort of monitoring is particularly important when the platform faces strategic behavior or competitive responses, since such forces can manifest as abrupt changes in Dt even if θ is stable.

From a compliance and consumer-protection perspective, windowing and restarts have an additional advantage: they create a natural policy. Sliding windows limit the extent to which long-past behavior affects current pricing, which can be aligned with privacy norms and reduce the risk that historical artifacts or temporary shocks permanently distort prices. That said, reduced retention is not a substitute for fairness or non-discrimination constraints; rather, it should be combined with explicit constraints (e.g., bounding price dispersion across protected classes conditional on x) and audits for disparate impact. Our framework can accommodate such constraints at the action-selection stage (restricting feasible prices), but the resulting regret benchmarks and safety guarantees would need to be re-derived for the constrained comparator.


The main economic promise of the increment formulation is : by learning Dt(δ), we reuse information across heterogeneous contexts because only δ = p − v matters for acceptance. Pooling is justified to the extent that unobserved heterogeneity enters valuations additively and independently of x after conditioning on the linear index, so that ℙ(ot = 1 ∣ xt, pt) = Dt(pt − xtθ). Practically, this is a strong claim, and we should treat it as an empirical hypothesis rather than a truth. A simple diagnostic is to check whether acceptance residuals, after mapping prices into increments using t = xtθ̂t, appear invariant across different slices of the context space. If the estimated demand curves differ systematically by context segment even after increment normalization, then pooling will induce bias that no choice of w can fix: the model is misspecified.

When pooling fails, the policy implication is not necessarily to abandon the approach, but to refine the notion of transfer. One option is : maintain separate increment-demand estimates Dt(m)(δ) for a small number of market regimes or customer segments (defined by coarse, policy-approved covariates), while keeping a shared θ̂ or a hierarchical prior across segments. Another is to relax linearity of vt via richer representations while preserving the one-dimensional increment structure; in that case, the key governance requirement is to ensure that the representation is stable enough over time to justify transferring demand information. Conversely, if drift is primarily in the mapping from context to valuation (i.e., θ itself changes), then the recommended architecture flips: demand pooling may remain valuable, but one must track θt with explicit nonstationary linear bandit methods, and our guarantees no longer directly apply.


Our analysis abstracts from inventory constraints, repeated-customer dynamics, and strategic behavior, each of which can create feedback loops that violate conditional independence or stationarity of θ. Nonetheless, the high-level lesson is robust: safe and effective dynamic pricing under drift requires (i) explicit uncertainty tracking, (ii) explicit forgetting mechanisms, and (iii) careful justification for what is pooled across contexts. By tying these design choices to interpretable objects—ε as a tolerance for valuation approximation, w as a responsiveness knob, and VT as a drift scale—the model illuminates a concrete tradeoff that pricing teams can govern rather than merely observe.


Appendices (rest): Full proofs; concentration lemmas for sliding windows; parameter tuning; additional experiments.

This appendix collects the technical components that are intentionally suppressed in the main text in order to keep the economic logic transparent. Our guiding principle is to make each step auditable: (i) where the linear index structure is used, (ii) where boundedness and Lipschitz regularity enter, and (iii) where the drift budget VT is the only nonstationary quantity that matters. To that end, we separate the proofs into modular lemmas (concentration, discretization, and drift control) and then assemble them into the regret bound via a bookkeeping argument that mirrors the algorithmic design.

We provide complete proofs for the structural decomposition, valuation approximation, windowed demand estimation, and the dynamic regret theorem. The proofs are written so that each claim can be traced to a small number of primitives: the binary observation model ot = 1{pt ≤ yt}, the index signal vt = xtθ, and the increment-demand map δ ↦ Dt(δ). In particular, the regret analysis is organized into three parts.

This modular structure also clarifies which assumptions are essential: bounded noise and Lipschitz demand are only invoked to translate valuation error and grid spacing into demand error; the adversarial contexts are handled entirely through self-normalized concentration for the index estimation step.

A technical bottleneck is that we estimate acceptance probabilities using only the last w ``eligible’’ plays of each increment. Because the environment drifts, the relevant target is Dt(δk) at the current time t, not an average over the window. We therefore decompose estimation error into (i) a sampling term and (ii) a drift-induced bias term. The key concentration statement is formulated for a generic windowed Bernoulli process with predictable sampling.

Let 𝒮t, wk denote the set of times s ∈ {t − w + 1, …, t} at which increment k is played (i.e., ps = s + δk on non-probing rounds), and let Nt, wk := |𝒮t, wk|. Define the windowed estimator
$$ \hat D_{t,w}(k)\;:=\;\frac{1}{N_{t,w}^k}\sum_{s\in\mathcal{S}_{t,w}^k} o_s, \qquad\text{(with the convention $\hat D_{t,w}(k)=1$ if $N_{t,w}^k=0$).} $$
The appendix proves the following lemma (the version we use is uniform over t and k via a union bound).



$$ \bigl|\hat D_{t,w}(k)-D_t(\delta_k)\bigr| \;\le\; \sqrt{\frac{2\log(2/\delta)}{N_{t,w}^k}} \;+\; L_\xi\,\varepsilon \;+\; V_{t,w}. $$

The proof follows a standard pattern but with two nonstandard wrinkles. First, because 𝒮t, wk is history-dependent, we treat {os − 𝔼[os ∣ ℱs − 1]} as bounded martingale differences and apply Freedman (or Hoeffding) conditional on the sampling rule. Second, to compare 𝔼[os ∣ ℱs − 1] to Dt(δk) we telescope across time using the sup-norm variation budget; this is where Vt, w enters additively and cannot be improved without stronger smoothness assumptions in t.

We include a self-contained tuning guide that starts from an explicit high-probability regret decomposition and then selects (ε, w) to balance discretization, estimation variance, and drift bias. The tuning argument is deliberately constructive: we keep track of how the grid size |K| = Θ(By/ε) interacts with union bounds over increments, and we spell out the constraints needed for the elimination procedure to remain well-posed under windowing (e.g., minimal sample requirements for confidence intervals to shrink). This yields choices of the form
ε ≍ T−1/3VT1/3,   w ≍ T2/3VT−2/3,
up to logarithmic and problem-dependent factors, which recover the stated dependence T2/3VT1/3 in the nonstationary component. We also discuss two practically relevant variants: (i) , handled by a small grid of candidate window lengths and a meta-selection rule based on empirical instability of demand estimates; and (ii) , where restarts (triggered by a change-point test) dominate sliding windows and improve constants when drift is sparse.

Finally, we report supplementary simulations designed to stress-test the modeling choices rather than merely showcase headline regret. The additional experiments include: (a) smoothly drifting demand Dt (e.g., sinusoidal shifts in a logistic family) to visualize the window-length responsiveness tradeoff; (b) piecewise-stationary regimes with abrupt breaks to compare sliding windows versus explicit restarts; (c) adversarial context sequences that amplify the need for probing to stabilize the index estimate; and (d) controlled violations of assumptions (mild heavy tails, non-Lipschitz kinks, or context-dependent residual demand) to diagnose failure modes. Across these settings, we summarize three empirical regularities that align with the theory: shorter windows reduce systematic mispricing after a drift event, finer ε improves performance only until estimation noise dominates, and the probing mechanism primarily affects early-round stability (and then becomes dormant once θ̂ is sufficiently accurate). We view these results as complements to the regret bound: they illustrate how the theoretical knobs (ε, w) translate into operational behavior, and they make clear where the model is predictive versus where it is a convenient approximation.