Firms in retail, travel, and digital marketplaces routinely face a pricing problem that is dynamic for two distinct reasons. First, demand is stochastic and depends on the posted price, so pricing is inherently an problem: the seller must learn which prices induce high expected revenue while continuing to earn revenue along the way. Second, and just as importantly in many applications, inventory is and must be allocated over a short selling season. When inventory is scarce, a low price early in the season can be privately optimal in a static sense yet socially costly for the seller because it accelerates stockouts and forecloses later high-margin sales; when inventory is abundant, the opposite logic can prevail. The economic object of interest is therefore not merely the best myopic price but a policy that trades off current revenue against the shadow value of remaining inventory.
A natural response is to combine a nonparametric demand learner with a dynamic program that plans over the remaining horizon. In principle, one can place a Gaussian process (GP) prior on the mean demand curve and, after each demand observation, update the posterior and recompute an optimal finite-horizon policy. In practice, this ``full model-based’’ approach quickly becomes computationally prohibitive: exact dynamic programming requires solving a Bellman recursion over all inventory levels and time periods, and it must be repeated as the posterior changes. When the season is long, the inventory is large, or when one needs to consider many candidate prices, the cost of value iteration dominates the online decision loop. Moreover, the data available to the seller are frequently censored by stockouts: we observe sales rather than unconstrained demand, so the likelihood updates required for posterior learning may be nonstandard. These frictions make it difficult to deploy policies with strong theoretical guarantees in operational settings where pricing decisions must be made in milliseconds.
At the other extreme lie policies that prioritize speed and ease of implementation. A common heuristic in industry is to treat the problem as a sequence of static pricing decisions: use Bayesian optimization (BO) or a GP upper confidence bound (GP-UCB) rule to select prices that appear promising for revenue, and then repeat. Such policies can be remarkably effective when inventory is effectively infinite, when holding back inventory has little value, or when the horizon is long enough that the seller can ``average out’’ mistakes. But in genuinely finite-inventory regimes, myopic learning rules can be systematically biased. If a BO-style policy learns that low prices generate high demand, it may concentrate experimentation in low-price regions precisely because those prices produce many observations; yet those same prices can deplete inventory and prevent learning about higher prices that matter most when inventory becomes scarce. In other words, finite inventory couples learning and planning: where we choose to collect data depends on inventory dynamics, and the value of information depends on how inventory will be used later. Ignoring this coupling is not merely a technical omission; it is a behavioral assumption about the seller that is often at odds with the economics of scarcity.
The central gap we address is thus the following. On one side are fast BO-driven heuristics that offer scalability but lack a principled account of intertemporal inventory tradeoffs. On the other side are model-based finite-inventory policies that correctly internalize these tradeoffs but are too slow to recompute online as learning proceeds, especially under nonparametric demand uncertainty. Bridging this gap requires a method that (i) plans over the remaining season in a way that respects inventory constraints, (ii) learns the demand curve flexibly and quantifies uncertainty, and (iii) can be evaluated quickly enough to serve as an online policy, rather than an offline benchmark.
Our approach is to replace full-horizon dynamic programming with a planning objective that is deliberately structured to be inexpensive to evaluate while remaining sensitive to inventory scarcity. Concretely, at each decision epoch we consider a finite lookahead window of length L and evaluate the expected revenue of holding the price constant over that window, accounting for the induced inventory depletion. This rollout-style objective is then combined with GP-based uncertainty quantification via an explicit exploration bonus. The resulting policy retains the statistical intuition of GP-UCB—optimism under uncertainty—but applies it to an objective that already internalizes, albeit approximately, the dynamic opportunity cost of inventory. The key modeling choice is that the demand family is known up to its mean, which lets us translate posterior beliefs about the mean demand function into tractable predictions of expected sales under finite inventory. This is precisely the object that enters revenue and inventory transitions, and it is the primitive that makes finite-inventory learning and planning commensurable.
This design yields two complementary benefits. From a computational standpoint, we avoid the ``curse of inventory dimensionality’’ that burdens value iteration: rather than solving for a value function over all inventory levels, we evaluate the rollout objective only along the realized inventory trajectory, and we do so using expectations of truncated demand that can be computed quickly for common demand families. The computational savings are most pronounced in the empirically relevant regime in which inventory C is large relative to the planning window L; in such cases, the receding-horizon policy can be orders of magnitude faster than recomputing a full dynamic program as learning proceeds. From a decision-theoretic standpoint, the policy behaves qualitatively differently from myopic BO: when inventory is low, the lookahead objective places more weight on conserving stock for future periods, and the exploration bonus is effectively filtered through the scarcity-sensitive revenue criterion.
The resulting theory also clarifies the economic tradeoffs induced by approximate planning. Short lookahead horizons reduce computation but introduce a : even with perfect knowledge of demand, an L-step rollout need not coincide with the fully optimal finite-horizon policy. Longer horizons reduce this approximation error but increase computational cost and, importantly, amplify sensitivity to demand misestimation because forecast errors affect multiple future steps. Our regret analysis reflects this structure by decomposing performance loss into a term that captures planning bias and a term that captures statistical learning error. This decomposition is useful beyond our specific algorithm: it makes explicit the idea that scalability in finite-inventory pricing is not ``free,’’ but rather is paid for either in computation (if one insists on exact planning) or in approximation (if one restricts planning to a tractable class). By making this tradeoff explicit, we can state conditions under which a modest lookahead horizon delivers near-optimal performance while preserving the runtime needed for practical deployment.
Finally, we emphasize limitations and scope. Receding-horizon constant-price rollouts are a controlled form of approximation; they are not intended to match the full flexibility of an optimal dynamic policy in settings with rapidly changing optimal prices or strong nonstationarities within a season. Likewise, censored demand from stockouts can complicate posterior updates, and while our framework accommodates standard demand families and tractable likelihoods, it may require numerical approximations in more complex environments. These caveats are not incidental: they delineate the boundary between what can be implemented reliably at scale and what remains computationally aspirational.
In sum, the finite-inventory pricing problem forces us to confront learning and planning simultaneously. The contribution of this paper is to show that one can retain nonparametric demand learning and provide meaningful regret and runtime guarantees solving the full dynamic program at every update, by embedding GP optimism within a scarcity-aware receding-horizon objective. This perspective aligns with practice—where fast replanning is essential—while preserving the economic logic of inventory shadow values that myopic BO-style heuristics neglect.
Our setting sits at the intersection of dynamic pricing, sequential learning, and approximate planning under operational constraints. We briefly review the most relevant strands, emphasizing how our contribution—a GP-based learner embedded in a scarcity-aware receding-horizon objective—differs from both myopic Bayesian optimization heuristics and fully model-based dynamic programs.
A large literature studies dynamic pricing when demand depends on price through an unknown low-dimensional parameter, often assuming linear or generalized linear demand with additive noise, or a parametric family for purchase probabilities. In these models, the seller faces a canonical exploration–exploitation tradeoff, and performance is commonly measured by regret relative to the best price (or best policy) under the true parameter. Parametric structure permits sharp finite-sample guarantees and policies based on optimism or Thompson sampling, and it often supports computationally tractable planning when inventory is absent or effectively infinite. However, the parametric approach can be brittle in retail environments where price response is nonlinear, exhibits local irregularities (e.g., psychological price points), or changes with promotion and visibility. Our framework instead emphasizes flexible (nonparametric) learning of the mean demand curve, while still exploiting the operational structure that demand is ``known up to its mean’’ so that the inventory transition depends on a tractable truncated-sales primitive.
Nonparametric dynamic pricing has been developed using Lipschitz/shape constraints, sieve estimators, kernel methods, and Bayesian nonparametrics. These approaches are attractive because they allow the data to reveal the shape of the price–demand relationship without committing to a specific functional form. At the same time, purely nonparametric methods often pay a statistical price: regret rates deteriorate with dimension, and even in one dimension the constants can be large unless the algorithm is carefully designed. Our use of Gaussian processes (GPs) aligns with this literature by providing a principled prior over smooth functions and a posterior that delivers both a point estimate and an uncertainty measure. The latter is crucial for online decision-making, because uncertainty quantification can be translated into exploration bonuses and ultimately into regret guarantees.
The GP-UCB and Bayesian optimization (BO) literatures provide a standard toolkit for sequentially optimizing an unknown function observed with noise, with prominent guarantees for cumulative regret in bandit settings. In pricing, a common adaptation treats the objective as one-period expected revenue p h(p) and applies GP-UCB (or expected improvement, Thompson sampling, etc.) to choose prices that are optimistic about this static objective. These methods are compelling in practice because they are simple, data-efficient in one dimension, and relatively easy to implement. The limitation for our purposes is conceptual rather than statistical: in finite-inventory problems, the economically relevant objective is not the static revenue curve but the intertemporal allocation of scarce stock. A myopic BO rule can over-explore low prices that generate many demand observations precisely when doing so accelerates stockouts and reduces the option value of saving inventory for later periods. Our algorithm can be viewed as importing the GP-UCB principle of optimism under uncertainty into a proxy objective that already prices in (approximately) the shadow value of inventory.
Dynamic pricing with inventory constraints is a classical topic in revenue management. When demand is known, the optimal policy can often be characterized via dynamic programming, and in special cases by structural monotonicity or threshold properties. With learning, the problem becomes substantially harder because the seller observes sales rather than unconstrained demand when stockouts occur, creating censoring and complicating inference. Several works study learning under stockouts using parametric models, partial monitoring techniques, or explicit censored-likelihood updates. Our approach does not attempt to resolve all censoring complications in full generality; rather, we focus on regimes and demand families where truncated-sales expectations are tractable and where censored feedback can be handled either by mild assumptions (e.g., bounded demand relative to inventory) or by using demand families that admit workable likelihoods. This emphasis reflects an implementation constraint: policies that require sophisticated latent-demand reconstruction may be difficult to deploy at scale.
A distinct line of work addresses the computational burden of exact dynamic programming by restricting the policy class or approximating the value function. Common strategies include approximate value iteration, policy iteration with function approximation, fitted value methods, and rollout algorithms that evaluate a simple base policy over a limited horizon. Receding-horizon control (model predictive control) similarly optimizes over a finite lookahead window and replans as time evolves. These methods are appealing in operational settings because they trade optimality for speed and robustness. Our method fits squarely into this tradition: we replace the full Bellman recursion with an L-step constant-price rollout objective. What is distinctive is the way we integrate this approximation with nonparametric Bayesian learning. The rollout is designed so that its evaluation depends on demand only through expected truncated sales, which can be computed quickly for standard families, making the policy viable as an online control rule rather than a computational benchmark.
In approximate dynamic programming, it is well understood that restricting the lookahead horizon introduces bias: even with a perfectly known model, the restricted planner may choose suboptimal actions because it cannot fully internalize long-run consequences. In our setting, this bias has a natural economic interpretation as the cost of limited foresight in managing scarce inventory. At the same time, using a longer horizon increases both runtime and the sensitivity of the objective to demand misestimation, because errors compound over multiple future steps. While related tradeoffs appear in reinforcement learning (e.g., model-based planning depth versus model error), the finite-inventory structure makes the tradeoff particularly transparent: the relevant state variable is inventory, and the mechanism by which errors propagate is through stock depletion. Our regret decomposition explicitly separates the cost of approximation (planning bias) from the cost of statistical error (learning), thereby clarifying when limited lookahead is an acceptable engineering compromise.
Finite-inventory pricing is also related to bandits with knapsacks (BwK) and more broadly to bandit problems with resource constraints, where each action consumes a budgeted resource and the horizon ends when the budget is exhausted. BwK algorithms provide regret bounds that scale with the resource budget and typically rely on optimism in estimated rewards and consumptions, often via primal–dual or confidence-bound arguments. The analogy is informative: inventory plays the role of a consumable resource, and price choices trade off revenue against consumption (sales). However, our environment differs in two ways that matter for both modeling and computation. First, the control is continuous (price in an interval) and demand is governed by an unknown function of price. Second, the within-season dynamics are explicitly Markovian in inventory and time, so the opportunity cost of inventory depends on remaining time, not just remaining budget. Our receding-horizon objective can be interpreted as a computationally light way to capture this time-dependent shadow value without solving the full dynamic program.
Another related literature studies bandits and online learning with switching costs or constraints on price adjustments, motivated by menu costs, consumer fairness concerns, or platform constraints. Although our baseline model does not impose explicit switching penalties, receding-horizon control has an implicit stabilizing effect because it evaluates prices through their multi-period inventory consequences; in practice this can reduce erratic price changes relative to purely myopic BO rules. Extending our framework to incorporate explicit switching costs is conceptually straightforward (one would augment the state with the previous price and penalize changes), but doing so would introduce additional computational considerations and would sharpen the motivation for approximate planning even further.
Relative to existing work, we aim to occupy a middle ground that is economically meaningful and computationally deployable. Compared with myopic GP-based pricing, our policy internalizes inventory scarcity via a dynamic (albeit approximate) objective. Compared with exact model-based dynamic pricing with nonparametric learning, we avoid repeated value iteration by restricting attention to a rollout class that can be evaluated quickly. Finally, by structuring the analysis around a decomposition into planning bias and learning error, we make explicit the sense in which scalability is purchased either with computation (longer horizons, richer planning) or with approximation (shorter horizons, simpler rollouts), a tradeoff that is central in practice but often implicit in the theory.
We study a seller who repeatedly prices a single product over seasons n = 1, 2, …, N. Each season lasts T selling periods and begins with a fresh inventory endowment C ∈ ℤ+; thus the seller faces an finite-horizon control problem in which learning can accumulate across seasons while inventory resets. Our focus is on the economically salient case in which scarcity is binding within a season—so prices affect not only contemporaneous revenue but also the probability and timing of stock depletion.
Within a fixed season, we index periods by t = 1, …, T. Let st ∈ {0, 1, …, C}
denote the inventory remaining at the of period t, with s1 = C. In each
period, the seller observes (st, t)
and posts a price pt ∈ [pℓ, ph].
Demand then realizes as an integer-valued random variable D(pt),
after which sales are
qt = min {st, D(pt)}, rt = ptqt,
and the inventory state updates according to st + 1 = st − qt
(with sT + 1
irrelevant). This defines a finite-horizon Markov decision process (MDP)
with state (st, t)
and continuous action pt. The central
modeling feature is that demand is independent across periods
conditional on posted prices, so that the dynamics are driven by the
current price only through the distribution of D(pt).
We assume demand is in the following sense. There exists a
one-parameter family of distributions {Pm : m ∈ [0, d̄]}
on {0, 1, …, d̄} such
that
D(p) ∼ Ph(p), h(p) := 𝔼[D(p)],
where h : [pℓ, ph] → [0, d̄]
is unknown to the seller. Examples include Bernoulli demand, truncated
Poisson demand, or a Binomial model with fixed number of trials and
unknown success probability (equivalently, unknown mean). This
restriction is not merely technical: it captures a common
revenue-management practice in which the of idiosyncratic purchase noise
is treated as stable (e.g., Poisson arrivals), while the mean arrival
rate is the object of estimation and control.
We impose smoothness on h to reflect that consumer response changes gradually with price at the resolution we consider. Concretely, we assume h ∈ Cβ([pℓ, ph]) for some β > 0, and we also assume h lies in the RKHS associated with a squared-exponential kernel. The latter is aligned with our GP learning framework and is best interpreted as a regularity condition enabling uniform error control for nonparametric estimation in one dimension.
The seller observes posted prices and realized sales (pt, qt), but not necessarily unconstrained demand D(pt). When st > 0 and qt < st, we infer that D(pt) = qt. When qt = st, a stockout may have occurred, and the observation is censored: we only learn that D(pt) ≥ st. This is a fundamental difficulty in finite-inventory learning problems, because aggressive low pricing can simultaneously generate many observations induce censoring by exhausting inventory.
Our baseline analysis emphasizes regimes in which censoring is either limited or tractable. One sufficient condition is d̄ ≤ C, so that single-period demand cannot exhaust inventory from typical states except near the end of a season. Alternatively, for certain families Pm, one can update beliefs using censored likelihoods (e.g., incorporating the event {D(p) ≥ s} when q = s). When neither simplification is appropriate, inference typically becomes numerical; this does not change the economic structure of the problem, but it does affect implementability and motivates computationally light planning rules.
A (possibly history-dependent) pricing policy π maps observed histories
(equivalently, the current state and accumulated data) to actions in
[pℓ, ph].
The expected within-season revenue starting from (C, 1) is
$$
V^\pi(C,1)\;=\;\mathbb{E}_\pi\Big[\sum_{t=1}^T p_t q_t\Big],
$$
where the expectation integrates over demand randomness and any
algorithmic randomization. If the true mean function h were known, the seller could
compute the finite-horizon value function via the Bellman
recursion
V*(s, t) = maxp ∈ [pℓ, ph]𝔼[p min {s, D(p)} + V*(s − min {s, D(p)}, t + 1)], V*(s, T + 1) = 0,
and would achieve V*(C, 1) = supπVπ(C, 1).
This is the natural benchmark for evaluating a learning algorithm
because it accounts for dynamic scarcity rents: a policy is judged not
against the best price, but against the best inventory-aware policy
under the true demand curve.
Across seasons, we measure performance by cumulative regret
$$
\mathrm{Regret}(N)\;=\;N V^*(C,1)\;-\;\sum_{n=1}^N
\mathbb{E}\Big[\sum_{t=1}^T r_t^{(n)}\Big].
$$
This benchmark is deliberately demanding: it compares the learning
policy to a clairvoyant controller who knows h and solves the exact
finite-horizon MDP each season. The decomposition we develop later
separates the computational cost of approximating this benchmark
(planning bias) from the statistical cost of learning h (learning error).
A recurring object is the expected sales under price p when inventory is s,
m(s, p) := 𝔼[min {s, D(p)}].
Under our ``known up to mean’’ assumption, m(s, p) depends on
p only through the scalar mean
h(p). That is, there
exists a known function g such
that
m(s, p) = g(s, h(p)).
This reduction is economically meaningful: the inventory transition and
one-step expected revenue depend on the demand curve primarily through ,
truncated by available stock. Technically, it is also what allows us to
avoid summing over all demand realizations when evaluating approximate
planning objectives. We assume g(s, ⋅) is 1-Lipschitz uniformly over s ≤ C, a mild stability
condition satisfied by many standard families and essential for
translating sup-norm estimation error in h into control error in revenue.
For some demand families, m(s, p) can be computed in closed form (or nearly so), which is pivotal for scalable dynamic pricing rules.
If D(p) ∈ {0, 1}
with 𝔼[D(p)] = h(p),
then for any s ≥ 1 we have
min {s, D(p)} = D(p)
and hence
m(s, p) = h(p) (s ≥ 1),
with m(0, p) = 0. In
this case, truncation is essentially irrelevant except at s = 0.
If D(p) ∼ Pois(λ)
with λ = h(p),
then
$$
m(s,p)\;=\;\sum_{k=0}^{s-1} k\,\mathbb{P}(K=k)\;+\;s\,\mathbb{P}(K\ge
s),\qquad K\sim \mathrm{Pois}(\lambda).
$$
Using standard Poisson identities, one can rewrite the partial
expectation as a CDF term:
𝔼[K 1{K ≤ s − 1}] = λ FPois(λ)(s − 2),
with the convention FPois(λ)(−1) = 0.
Therefore,
m(s, p) = λ FPois(λ)(s − 2) + s(1 − FPois(λ)(s − 1)),
which can be evaluated quickly using library CDF routines. If the model
uses a Poisson distribution truncated to {0, …, d̄}, the same expression
applies with normalization (or, equivalently, with truncated CDFs),
still yielding fast evaluation.
These closed forms matter because any scarcity-aware planning objective must repeatedly compute expected truncated sales at many candidate prices and inventory levels. Closed-form or CDF-based expressions reduce this step to constant time per evaluation up to numerical CDF cost.
For more general families Pm, or when the mean
parameterization does not yield a convenient CDF identity, m(s, p) may not
admit a simple analytic form. In such cases, computation is still
feasible because D(p)
has bounded support {0, …, d̄}.
One can compute
$$
m(s,p)\;=\;\sum_{k=0}^{\bar d} \min\{s,k\}\,\mathbb{P}_{h(p)}(D=k),
$$
which is O(d̄) per
evaluation, or accelerate it by (i) precomputing g(s, m) on a grid
of m values and interpolating,
(ii) caching probabilities for repeated means, or (iii) using
one-dimensional numerical quadrature when m enters through a smooth link. The
broader point is that the operational structure reduces demand
dependence to a scalar index m = h(p), so even
``numerical’’ evaluation remains one-dimensional and typically stable.
This motivates the algorithmic design choice we make later: rather than
approximating the full value function over (s, t), we target planning
objectives that can be evaluated with repeated calls to m(s, p), thereby
shifting the computational burden onto a primitive that is either closed
form (in favorable families) or amenable to fast numerical
treatment.
The Bellman benchmark in Section~3 makes clear why finite-inventory pricing is computationally demanding: even when the demand curve h(⋅) is known, dynamic optimality requires evaluating continuation values across many inventory states. In practice, sellers often rely on rolling-horizon heuristics that re-optimize locally as inventory evolves, and on statistical surrogates (e.g., parametric demand curves) that make such re-optimization fast. Our algorithm formalizes this logic in a nonparametric learning setting: we learn h(p) using a Gaussian process (GP) and, rather than approximating the full value function, we plan with an L-step rollout objective augmented with an upper-confidence exploration bonus. The resulting policy is simple to implement, naturally inventory-aware, and admits regret and runtime guarantees in the next section.
At the start of season n,
we maintain a GP posterior over the unknown mean function h : [pℓ, ph] → [0, d̄]
based on the accumulated dataset from prior seasons,
𝒟n − 1 = {(pi, qi)}i = 1(n − 1)T,
where qi = min {si, D(pi)}
is realized sales. We write the posterior mean and standard deviation as
μn(⋅) and
σn(⋅).
Because h(p) is a
mean and hence bounded, an implementation may either (i) model an
unconstrained latent function and pass it through a squashing map into
[0, d̄] (e.g., scaled
logistic), or (ii) clip μn(p)
to [0, d̄] when computing the
planning objective. For analytical clarity we treat μn as a direct
estimator of h and use σn as a
pointwise uncertainty proxy.
A subtlety is stockout censoring: when qi < si we observe D(pi) = qi, while when qi = si we only learn D(pi) ≥ si. Our baseline implementation aligns with the regimes discussed in Section~3: either censoring is limited (so treating qi as a noisy proxy for D(pi) is accurate enough), or the demand family admits a censored likelihood so that GP updates can incorporate inequality information. When neither holds, one can still implement the algorithm with approximate inference (e.g., expectation propagation or Laplace approximations for censored observations), but the computational and statistical constants become application-specific.
Fix a season n and period
t, with observed inventory
st. For a
candidate price p, we evaluate
an L-step constant-price
objective under a mean function f (later instantiated as f = μn).
The key design choice is to summarize demand dynamics through the
expected truncated sales primitive m(s, p) = g(s, f(p))
and to propagate inventory forward using this expected drift.
Concretely, letting ℓt := min {L, T − t + 1},
define the deterministic recursion
s̃0 = st, q̃k(p) = g(s̃k, f(p)), s̃k + 1 = max {0, s̃k − q̃k(p)}, k = 0, …, ℓt − 1,
and the associated rollout value
$$
J_f^L(s_t,t,p)\;:=\;\sum_{k=0}^{\ell_t-1} p\,\tilde q_k(p).
$$
Economically, JfL
captures the scarcity channel that is absent from myopic pricing: a low
price raises near-term expected sales but accelerates depletion, while a
high price preserves inventory for later periods. Computationally, JfL
is fast because it reduces planning to ℓt repeated
evaluations of g(⋅, ⋅) and
simple arithmetic; it does not require dynamic programming over all
inventory levels.
The constant-price restriction is deliberate. It matches a common managerial heuristic (choose a price and hold it for a short window), and it isolates the intertemporal tradeoff arising from inventory without introducing a high-dimensional search over price paths. This restriction does introduce approximation error relative to the fully optimal policy, which we quantify later via the planning bias term ΔL.
To learn h while earning
revenue, we adopt an optimistic action rule in the spirit of
upper-confidence bandits and Bayesian optimization. At state (st, t)
in season n, we select
p̂n, t ∈ arg maxp ∈ 𝒫n, t{JμnL(st, t, p) + κn, t σn(p)},
where κn, t > 0
controls exploration and 𝒫n, t ⊆ [pℓ, ph]
is the admissible price set (discussed below). The exploration bonus
depends on σn(p)
rather than on a full uncertainty propagation through the inventory
recursion. This is a pragmatic compromise: it retains the essential
incentive to probe uncertain prices while keeping the scoring function
one-dimensional and cheap to evaluate. Intuitively, if a price p is uncertain, then the
corresponding mean demand h(p) is uncertain, and so
is the truncated sales primitive g(s, h(p))
at each step of the rollout; the bonus is a tractable proxy for this
multi-step uncertainty.
The exploration schedule κn, t can be chosen in standard ways (e.g., increasing slowly with the total number of observations to deliver high-probability confidence bands) or tuned empirically to stabilize early-season experimentation. In applications, we often dampen exploration late within a season because the value of information is lower when little horizon remains or when inventory is nearly depleted; such dampening can be implemented via a decreasing function of t or of st.
Because the action space is one-dimensional, RH-GPBO can optimize over prices either by discretization or by continuous search.
Choose a grid 𝒫 = {p(1), …, p(P)} covering [pℓ, ph] and set 𝒫n, t = 𝒫. Grid search is robust to nonconvexity of the rollout-UCB objective and yields a transparent runtime of O(Pℓt) per decision (plus the cost of evaluating μn and σn). The tradeoff is discretization error, which appears later as Disc(P).
Alternatively, we may treat p ↦ JμnL(st, t, p) + κn, tσn(p) as a black-box one-dimensional function and apply multi-start local optimization. This avoids discretization but can be sensitive to multimodality induced by the inventory recursion (especially for small inventories where truncation creates kinks).
Many sellers face constraints such as limited price revisions or
bounded price changes. RH-GPBO accommodates these by restricting 𝒫n, t. For
example, with a step-size constraint |pt − pt − 1| ≤ δ,
we set
𝒫n, t = [pℓ, ph] ∩ [pn, t − 1 − δ, pn, t − 1 + δ],
and maximize the rollout-UCB score over this interval (or its
discretization). More complex constraints, such as at most K price changes per season, can be
handled by augmenting the state with a counter; we view this as a
straightforward extension that increases planning complexity but
preserves the learning component unchanged.
A naïve GP update over all past observations becomes expensive as n grows. To keep updates tractable, we can reduce the effective training set size to M by prices and aggregating observations within each bucket. Concretely, partition [pℓ, ph] into M bins and maintain per-bin sufficient statistics (e.g., count, total sales, and optionally an indicator of censoring intensity). We then fit the GP to the M representative price points with aggregated responses (or to inducing points in a sparse GP), yielding an update cost on the order of O(M3) rather than cubic in the raw sample size. This approximation reflects a practical regularity: when h is smooth, very fine distinctions among nearby prices add limited incremental information relative to their computational cost.
For clarity, the season-by-season procedure is:
This algorithm makes explicit the tradeoff we study in the next section: larger L improves within-season planning but raises computation and increases the exposure of the rollout value to demand-estimation error; richer GP updates improve learning but increase per-season overhead.
We now formalize the two claims that motivate RH-GPBO. First, the we optimize—the L-step constant-price rollout—is stable to errors in the estimated mean demand curve. Second, this stability converts standard nonparametric learning rates for Gaussian processes into cumulative regret guarantees, with an explicit separation between (i) the approximation error induced by a finite lookahead and constant-price restriction and (ii) the statistical error induced by learning h(⋅). We conclude with runtime and discretization guarantees that clarify when receding-horizon planning is preferable to full dynamic programming.
The rollout objective depends on h only through the scalar mean h(p) at the candidate price, propagated through the truncated-sales primitive g(s, m) = 𝔼Pm[min {s, D}]. Intuitively, if two mean functions f1, f2 are uniformly close, then at any fixed price p their implied expected sales are close in each step of the rollout, and the resulting revenue difference accumulates over at most L periods. The only subtlety is that the inventory trajectory under f1 and f2 can differ, which could in principle amplify differences. Our Lipschitz assumption on g rules out such amplification: changing the mean by ε changes expected truncated sales by at most ε, uniformly over inventories.
The bound is intentionally simple: the dependence on L is linear, reflecting the fact that we sum up to L revenue terms, each of magnitude at most ph times a one-period sales perturbation. This linear dependence is central for design: longer lookahead reduces planning bias (discussed below) but mechanically increases sensitivity to estimation error. We emphasize that Proposition~ is agnostic to the demand family beyond the Lipschitz property of g, and hence applies equally to Poisson, Bernoulli, and other one-parameter families with well-behaved truncation.
We measure performance relative to the fully optimal finite-horizon policy that knows h and solves the Bellman recursion. Since RH-GPBO optimizes a restricted L-step constant-price objective, there are two distinct reasons it may lose revenue: (i) even under the true h, an L-step constant-price plan may be suboptimal relative to full dynamic programming; (ii) the algorithm does not know h and must learn it from data. Our first step is to separate these effects.
Let πRH, L
denote an that, at each state (s, t), chooses a price
maximizing JhL(s, t, p)
under the true mean function. Define the associated approximation
gap
ΔL := sups, t{V*(s, t) − supp ∈ [pℓ, ph]JhL(s, t, p)},
which captures the worst-case loss from replacing the Bellman
continuation value with an L-step constant-price rollout. By
definition, ΔL depends only
on the primitives of the inventory problem and decreases as L grows (weakly, since larger L enlarges the set of feasible
rollouts).
Economically, Proposition~ says that we pay NΔL for committing to short-horizon constant-price planning, and then an additional for not knowing which constant-price rollout is best under h. This decomposition is useful because it allows us to study planning (through ΔL) and learning (through posterior contraction) separately, and to understand how changing L shifts weight between them.
To bound the learning term, we relate action suboptimality to errors in the rollout objective induced by estimation error in h. The key is Proposition~: if μn is close to h uniformly, then for every candidate price p the L-step rollout under μn is close to the rollout under h. In turn, an optimistic (UCB-style) maximization rule ensures that, with high probability, the algorithm does not persistently exclude high-value prices from consideration.
Under the smoothness conditions stated earlier (H"older regularity and membership in the RKHS of the squared-exponential kernel), standard GP regression results yield a sup-norm contraction rate of the posterior mean.
The scaling highlights a second design tradeoff: the learning penalty is proportional to L. This is not an artifact of the proof technique; it reflects that longer rollouts amplify the effect of mis-estimating demand at a tested price. In environments with abundant historical data (large N) and stable demand, increasing L can be beneficial because ΔL shrinks while learning error is already small. Conversely, in rapidly changing or data-scarce settings, shorter lookahead can dominate by reducing exposure to estimation error.
Putting the pieces together yields the overall regret bound
Regret(N) ≤ N ΔL + Õ (ph L T N1 − β/(2β + 1)) + Disc(P),
where the discretization term appears when prices are optimized over a
grid (discussed next).
A central motivation for RH-GPBO is computational: full value iteration scales linearly in the inventory size C because it computes values for all states (s, t), whereas receding-horizon planning evaluates only the realized inventory trajectory. With a grid of P candidate prices, each decision requires computing JμnL(st, t, p) for each p; using the expected-sales recursion, this costs O(L) per price, and hence O(PL) per period. Over T periods, the within-season planning cost is O(TPL), plus the cost of GP updates.
Two practical implications follow. First, whenever L ≪ C, RH-GPBO can be asymptotically faster than value iteration, even before accounting for the fact that dynamic programming must be rerun as the demand estimate changes over seasons. Second, discretization is a controlled source of error in one dimension: increasing P reduces Disc(P) at a rate that is typically fast enough to make grid search attractive, especially when the rollout-UCB objective is nonconvex and derivative information is unreliable.
We close with a limitation that also points to an opportunity. Our analysis treats the exploration bonus as depending on σn(p) only, rather than propagating uncertainty through the inventory recursion. This choice is computationally attractive, but it may be conservative when the mapping m ↦ g(s, m) is strongly contracting (e.g., when inventory is frequently binding, so additional demand uncertainty does not translate into additional sales). Refining the bonus to reflect state-dependent attenuation is feasible, but would require a more elaborate analysis of uncertainty propagation; we leave this to future work.
Our theoretical bounds make the algorithmic tuning parameters economically interpretable: the lookahead horizon L controls how aggressively we internalize intertemporal scarcity, while the price-search resolution P controls how accurately we optimize the (estimated) rollout objective. Because these parameters affect statistical and computational performance, we view their selection as a design problem rather than a purely technical detail.
The overall guarantee has the schematic form
$$
\mathrm{Regret}(N)\;\lesssim\; N\,\Delta_L \;+\;
\underbrace{(p_h\,T)\,L\,N^{1-\beta/(2\beta+1)}}_{\text{learning term}}
\;+\; \mathrm{Disc}(P),
$$
which isolates a deterministic approximation component NΔL
from a stochastic learning component that scales linearly in L. This decomposition clarifies a
simple comparative static: increasing L weakly decreases ΔL (more
lookahead, less myopia) but increases the sensitivity of the objective
to demand-estimation error (more periods over which a mistaken mean is
propagated through the inventory recursion).
When ΔL
obeys a mild decay condition such as ΔL = O(T/L)—a
natural benchmark when the optimal price path is ``slowly varying’’ or
when the marginal value of inventory changes smoothly in time—the
preceding display suggests an explicit back-of-the-envelope choice.
Treating constants as fixed and writing a := β/(2β + 1),
we minimize
$$
N\cdot \frac{T}{L} \;+\; T\,L\,N^{1-a},
$$
yielding the scaling L⋆ ≍ Na/2
(capped by T). Operationally,
this says that as the seller accumulates seasons of data, it becomes
optimal to plan further ahead: early on, a short horizon hedges against
model error; later, once μn is accurate,
longer lookahead buys real intertemporal value by mitigating scarcity
misallocation.
A second, more structural, comparative static comes from inventory tightness. If inventory is rarely binding (e.g., C large relative to typical total demand over T), then dynamic scarcity considerations are weak and L = 1 is often close to optimal even under full information: the problem approaches a sequence of independent static price choices. Conversely, when stockouts are common (small C or large T), scarcity rents are first-order, and the value of increasing L grows because the opportunity cost of selling a unit today rather than later is high. This logic is consistent with practice: fast-moving, scarce products (limited drops, capacity-constrained services) warrant deeper lookahead than high-capacity commodity settings.
A natural question is why we would ever prefer receding-horizon planning to ``doing the right thing’’ via value iteration on an estimated model. The answer is that with an estimated demand curve, full dynamic programming can magnify approximation error through two channels.
First, : Bellman recursion repeatedly applies a nonlinear maximization-and-expectation operator over T steps and across the entire inventory state space {0, 1, …, C}. If the model used inside the Bellman operator is misspecified or noisy, errors enter not only through immediate expected sales but also through continuation values, and can accumulate as we recurse. In contrast, RH-GPBO limits the propagation depth by design: the rollout objective only ``pushes’’ the demand estimate forward for L steps. This truncation is precisely what makes the Lipschitz stability bound linear in L, rather than in T.
Second, : value iteration computes values for states that will never be visited under the realized trajectory. Under learning, these off-path values are nonetheless computed from the same uncertain model and can influence actions via maximization. The resulting policy may be sensitive to estimation artifacts in regions of the state space that are irrelevant ex post but influential ex ante. Receding-horizon planning largely avoids this issue because it evaluates (and re-evaluates) the objective only at the realized states along the season. Put differently, RH-GPBO is an ``on-policy’’ planner: it spends computation and statistical fidelity where the system actually goes.
These considerations suggest a practical rule: when the demand estimate is noisy, rapidly drifting, or structurally misspecified (e.g., the true arrival process is not in the chosen family Pm), a shorter lookahead can outperform full DP even if full DP is computationally feasible. Short lookahead effectively regularizes the planner by limiting the horizon over which model errors can distort intertemporal tradeoffs. This is the same economic logic behind robust or myopic heuristics in operations settings with unstable forecasts: committing too strongly to a long-run plan can be counterproductive if the plan is built on fragile predictions.
The grid size P determines the fineness of the price search when we optimize p ↦ JμnL(st, t, p) + κn, tσn(p). The key advantage of working in one dimension is that discretization is easy to control and easy to diagnose. If JhL is Lipschitz in p with constant KJ (uniformly over (s, t)), then a uniform grid yields error on the order of (ph − pℓ)KJ/P. This suggests two guidelines.
First, if computation is plentiful, it is almost always worthwhile to make P large enough that Disc(P) is negligible relative to the other terms; otherwise one risks paying regret for a purely numerical reason. Second, because KJ can vary with inventory tightness (scarcity can make revenue more sensitive to price), the ``right’’ P is larger in scarce regimes. In practice, we can implement this by using a coarse grid when st is large and a refined grid when st is small, or by using a multi-resolution scheme that refines around the current maximizer.
From an operational perspective, L and P are constrained by a per-period
planning budget. Since online planning cost scales as O(PL) per decision
(and O(TPL)
per season), one can view PL as the effective
planning intensity.'' If the system must reprice quickly (e.g., sub-second latency), we recommend fixing a budget $\mathcal{B}$ and choosing $(L,P)$ to satisfy $PL\le \mathcal{B}$, then allocating the budget based on the dominant source of error: \begin{itemize} \item When data are scarce (early seasons, high posterior uncertainty), prioritize a slightly larger $P$ with a smaller $L$: it is more valuable to avoid missing the right \emph{current} price than to plan far ahead with a noisy model. \item When data are abundant (late seasons, small $\sigma_n$), shift budget toward larger $L$: the marginal value of deeper planning rises because estimation error is no longer the bottleneck. \end{itemize} Thisannealing’’
of lookahead with learning is consistent with the scaling L⋆ ≍ Na/2
and gives an implementable tuning heuristic: increase L only when a suitable summary of
posterior uncertainty (e.g., suppσn(p))
falls below a threshold.
Finally, it is useful to interpret L not merely as a computational knob, but as a policy stance. Small L corresponds to frequent replanning with limited commitment, akin to managerial heuristics that emphasize near-term margins under uncertainty. Large L corresponds to committing to a scarcity-aware plan, akin to revenue-management systems that forecast and protect inventory for later demand. Our results formalize that neither stance is uniformly optimal: the right degree of foresight depends jointly on (i) how binding inventory is, (ii) how stable and learnable demand is, and (iii) the computational constraints of the pricing system. This framing guides the experiments that follow, where we vary (C, T), misspecification, and algorithmic components to show how these tradeoffs materialize in finite samples.
We design experiments to stress the two central tradeoffs highlighted by the theory: (i) intertemporal scarcity versus computational burden (controlled by (C, T) and by the lookahead L), and (ii) statistical learning versus model error (controlled by the smoothness/shape of h and by demand-family misspecification). Our evaluation emphasizes performance—season-level revenue and regret growth—while also reporting wall-clock planning time, since RH-GPBO is motivated as a computationally disciplined alternative to full dynamic programming with a learned model.
To generate meaningful scarcity incentives, we use demand curves for
which the instantaneous expected revenue p h(p) is on [pℓ, ph],
so that overly aggressive pricing destroys volume while overly low
pricing wastes scarcity rents. Concretely, we consider smooth unimodal
mean functions such as
$$
h(p) \;=\; \alpha \exp\!\Big(-\tfrac{(p-p_0)^2}{2\ell^2}\Big)
\qquad\text{and}\qquad
h(p) \;=\; \frac{\alpha}{1+\exp(a(p-b))},
$$
with parameters chosen so that total expected demand over a season is
comparable to C in the
scarce'' regimes and well below $C$ in theabundant’’
regimes. We simulate demand as either (i) Poisson arrivals with D(p) ∼ Pois(h(p))
(optionally truncated at d̄),
or (ii) Bernoulli purchases with D(p) ∼ Bernoulli(h(p)),
which creates a qualitatively different noise structure and a more
severe information bottleneck under censoring. Prices lie in a fixed
interval [pℓ, ph]
and are either optimized over a uniform grid of size P or via a one-dimensional optimizer
applied to the acquisition function p ↦ JμnL(st, t, p) + κn, tσn(p).
A key empirical question is whether the receding-horizon structure is robust when the within-period likelihood is wrong. To test this, we run a misspecified setting in which the demand is Bernoulli with a logit mean h(p), but the algorithm updates the GP using a Poisson likelihood (or, in a simpler variant, using a Gaussian observation model on qt). This isolates the role of the Lipschitz stability property of the rollout objective: even if the observation model is imperfect, the planner only uses the posterior mean and uncertainty as a , and the truncation to L steps limits how far model errors propagate through continuation values. In addition, we include a ``hard’’ scarcity case in which stockouts are common (small C relative to total demand), making censoring frequent; this is precisely where full-model methods tend to be brittle because they must infer counterfactual demand in censored regions.
We compare RH-GPBO to four classes of baselines, chosen to reflect standard approaches in the literature.All methods observe only (pt, qt) and inventory, i.e., they do not observe latent demand when qt = st; this keeps the information structure consistent.
We vary (C, T) along two axes: (i) (increasing C with fixed T), and (ii) (increasing T with fixed C), while keeping expected total demand proportional to C so that scarcity remains economically meaningful. Two patterns emerge. First, per-season planning time for GP-Fin-Model-Based grows approximately linearly in C (and in T), consistent with the O(CTP) complexity of value iteration, while RH-GPBO grows primarily with TPL and is essentially insensitive to C beyond the realized trajectory. Second, the gap between RH-GPBO and full DP is small at moderate L in the scarce regime, but the gap becomes dramatic as C increases. This is exactly the regime where a learned DP is least attractive in practice: it pays the full state-space cost even though only a thin subset of states is visited on-policy.
Holding compute budget fixed, we observe the predicted tension: very small L can be too myopic under scarcity (it over-sells early at low prices), while very large L can be statistically fragile early in learning (it ``trusts’’ an inaccurate μn too far into the future). In cumulative regret, this generates an empirical inverted-U in performance as a function of L at small n, which gradually flattens and shifts toward larger L as seasons accumulate and posterior uncertainty contracts. Under misspecification, the sweet spot typically occurs at a L than under correct specification, reinforcing the interpretation of short lookahead as a form of robustness regularization. Importantly, the best fixed L depends on scarcity: when inventory is abundant, L = 1 or 2 is nearly optimal and additional planning yields little; when inventory is tight, moving from L = 1 to L = 5 or 10 often yields substantial revenue gains.
Although our theory assumes a squared-exponential kernel, we test robustness to kernel mismatch by comparing squared-exponential and Mat'ern kernels (with several smoothness parameters) while holding hyperparameter tuning fixed. In smooth environments, kernels perform similarly; in rougher environments (e.g., steeper logit slopes), Mat'ern kernels can reduce early-season bias and improve exploration calibration, but the dominant effect remains the choice of L rather than the exact kernel. This aligns with the conceptual decomposition: kernel choice affects how quickly μn approaches h, while L controls how strongly any remaining error is amplified through planning.
To test end-to-end feasibility, we implement GP updates with an effective training size M via simple bucketing of price observations (or, equivalently, inducing points on a fixed grid). As expected, reducing M lowers cubic update cost and improves wall-clock time, with a mild statistical penalty when buckets are too coarse. The key empirical point is that RH-GPBO remains effective even with aggressive compression because the control objective is one-dimensional in price and re-optimized each period; as long as μn(⋅) captures the coarse location of the revenue peak and σn(⋅) reflects uncertainty at the right scale, the receding-horizon planner can convert partial learning into near-term revenue.
Across scarce Poisson instances with correct specification, RH-GPBO with moderate L matches or slightly underperforms GP-Fin-Model-Based in final-season revenue while using substantially less computation, and it often dominates in early seasons due to reduced error propagation. Against BO-Fin-Heuristic, RH-GPBO yields consistent gains when scarcity is binding, precisely because it internalizes the option value of inventory. Against deep RL methods, two regimes stand out: with extensive offline simulator training, RL can approach strong performance but typically requires orders of magnitude more interactions than the seasonal budgets considered here; in the online-from-scratch regime, RL lags substantially in both revenue and stability. Under misspecification and censoring, RH-GPBO’s advantage over the model-based DP benchmark becomes more pronounced: full DP can overfit artifacts of the misspecified likelihood in off-path states, while RH-GPBO’s limited lookahead makes it less sensitive to these distortions.
Taken together, these experiments support the paper’s organizing claim: receding-horizon Bayesian optimization is not merely a computational shortcut, but a principled way to allocate statistical trust and computation to the economically relevant part of the dynamic pricing problem—the realized trajectory under scarcity.
Our baseline model isolates a clean triangle of forces—scarcity, learning, and computation—by assuming a one-dimensional price action, a stationary mean demand curve h(⋅) across seasons, and within-season independence of demands conditional on posted prices. Many applications violate at least one of these simplifications. In this section we sketch four extensions that are economically central: contextual demand, nonstationarity across seasons, operational constraints that limit price changes, and multi-product coupling. Our goal is not to fully re-prove the theory in each case, but to clarify (i) which components of the analysis are robust ``as is’’ (notably the rollout-based decomposition and the runtime advantage), and (ii) which steps require new ideas (typically those that hinge on one-dimensional GP contraction in sup norm, or on the tractability of the constant-price rollout objective).
In many settings demand depends on observed covariates xn, t
(day-of-week, marketing spend, competitor price signals, macro
indicators). A natural generalization replaces h(p) by h(p, x) with x ∈ 𝒳 observed before choosing pt. Demand
satisfies D(pt, xn, t) ∼ Ph(pt, xn, t),
so the truncated sales expectation becomes m(s, p, x) = g(s, h(p, x)).
RH-GPBO extends mechanically: at state (st, t, xn, t)
we maximize
p ↦ JμnL(st, t, p ; xn, t) + κn, t σn(p, xn, t),
where μn(⋅, ⋅) and
σn(⋅, ⋅)
are the GP posterior mean and standard deviation for the contextual mean
function. The computational story largely survives: for a fixed
candidate p, the rollout still
updates inventory via expected truncated sales, and can be evaluated in
O(L) time given (p, xn, t).
What changes is statistical. The sup-norm contraction rate driving Proposition~4 is intrinsically dimension-dependent: if we treat (p, x) as a (1 + d)-dimensional input, nonparametric learning slows sharply as d grows, and uniform bounds of the form ∥μn − h∥∞ may become too pessimistic to be informative at seasonal sample sizes. Practically, the extension is most compelling when we impose structure: additive kernels (e.g., $k((p,x),(p',x'))=k_p(p,p')+\sum_{j=1}^d k_j(x_j,x_j')$), low-rank embeddings, or a partially linear form h(p, x) = h0(p) + ⟨θ, ϕ(x)⟩. Under such restrictions, the Lipschitz stability of the rollout objective (our Proposition~2) remains the main bridge from statistical error to control loss; the difference is that one typically bounds an error (only at realized contexts) rather than a worst-case uniform error over 𝒳. A related subtlety is that censoring becomes more severe: stockouts may occur systematically at certain contexts (e.g., weekend peaks), and the GP update must correctly handle missing counterfactual demand. This pushes the analysis toward censored-likelihood GP regression or conservative estimators that remain calibrated under truncation.
Season-to-season drift is the norm in retail and travel: the mapping p ↦ h(p) changes with consumer tastes, competitor actions, or product aging. A simple model is hn(p) varying with n, perhaps via a slow drift constraint ∥hn + 1 − hn∥∞ ≤ ρ or a piecewise-stationary change-point process. RH-GPBO can be adapted by replacing the ``all-history’’ GP fit with a forgetting mechanism: a sliding window, exponential discounting, or a state-space GP (e.g., hn + 1(p) = hn(p) + εn(p) with εn drawn from a GP). The control rule is unchanged in form; only the posterior (μn, σn) reflects the chosen temporal prior.
The theoretical regret benchmark, however, must be reinterpreted. If the environment changes, regret relative to a fixed V*(C, 1) is no longer meaningful; one instead considers relative to the season-specific oracle Vn*(C, 1) induced by hn. Our decomposition in Proposition~3 still provides a useful scaffold—a planning term plus a learning term—but the learning term acquires an additional drift component because even a perfect estimator of hn − 1 can be wrong for season n. Moreover, the approximation bias ΔL can improve under drift: shorter lookahead can become when long-horizon planning relies on a model that will change before the horizon fully plays out (a form of robustness akin to model-predictive control). Formalizing this requires coupling arguments that track both estimation error and temporal variation, and it typically yields bounds scaling with a variation budget such as $\sum_{n=1}^{N-1}\|h_{n+1}-h_n\|_\infty$.
Many firms cannot freely re-optimize prices each period. Constraints take two common forms: (i) a hard limit on the number of price changes per season, or on the magnitude |pt − pt − 1|, and (ii) a switching cost η|pt − pt − 1| (or a fixed menu cost) subtracted from revenue. These constraints introduce a new state variable, typically the previous price (and possibly a remaining-change budget). The fully optimal DP becomes substantially larger: the state space expands from (s, t) to (s, t, pt − 1, b).
RH-GPBO can still be deployed by expanding the receding-horizon
rollout to include the additional state and restricting the maximization
to feasible p. For instance,
with a Lipschitz price-movement constraint |p − pt − 1| ≤ δ,
we choose
pt ∈ arg maxp: |p − pt − 1| ≤ δ{JμnL(st, t, p ; pt − 1) + κn, tσn(p)},
where the rollout accounts for switching costs within the L-step objective if present. The
runtime advantage can remain strong because we still avoid enumerating
all inventory levels; however, the constant-price simplification becomes
less natural when frequent price changes are disallowed. One remedy is
to use a rollout with a small number of segments (e.g., two prices over
L steps), which increases
computation but remains far cheaper than full DP when C is large.
On the theory side, Proposition~2 extends if the truncated-sales map remains Lipschitz in the mean and if the rollout value is evaluated over a fixed, low-dimensional policy class (constant price or few segments). What breaks is any argument that relies on the constant-price rollout being a good proxy for the optimal continuation value; ΔL can be larger when the true optimal policy would adjust prices but is prevented from doing so by constraints. In that case the right comparator is the optimal policy, and ΔL should be redefined accordingly.
The most consequential extension is to multiple products i = 1, …, I with inventories sti and prices pti. Coupling arises through (a) shared resources (a single capacity constraint, shared labor, or a common inventory pool), (b) demand substitution/complementarity (demand for i depends on the full price vector pt), and (c) joint customer arrivals (one shopper chooses among products). In such models, the state becomes vector-valued and full DP is quickly infeasible.
The receding-horizon philosophy remains attractive precisely because
it scales with the trajectory rather than the full state lattice. A
direct extension defines a rollout objective over the price vector,
e.g.,
$$
J_f^L(s,t,p) \;=\; \mathbb{E}_f\Big[\sum_{k=0}^{L-1}\sum_{i=1}^I
p^i\,\min\{s_k^i, D_k^i(p)\}\Big],
$$
with inventory recursion componentwise (or via a shared constraint). Yet
two new difficulties appear. First, the optimization over p ∈ [pℓ, ph]I
is no longer one-dimensional, so the ``P-point grid’’ approach becomes
exponentially costly and the acquisition landscape can be highly
nonconvex. Second, the learning problem becomes substantially harder:
estimating a vector-valued mean function h(p) over ℝI suffers from the curse
of dimensionality unless one imposes strong structure (e.g.,
separability, low-rank demand systems, or parametric cross-effects).
What can be retained is the of the regret decomposition: approximation error from a restricted planning class plus learning error controlled by stability of the rollout value. If we can show a multivariate analogue of Proposition~2 (e.g., Lipschitzness of expected truncated sales in the mean vector under a suitable norm), then any bound on ∥μn − h∥ implies a bound on rollout value error, and hence on the loss from planning with μn rather than h. What likely fails without additional assumptions is a clean, dimension-free posterior contraction rate and a practical global optimizer for the acquisition function. From an economic-design perspective, this suggests a fruitful compromise: keep RH-GPBO as the , but feed it low-dimensional controls (e.g., a common multiplier that shifts a base price vector, or a small number of price ``buckets’’), thereby preserving computational tractability while capturing the most important cross-product scarcity tradeoffs.
Across these extensions, the modular components are (i) the rollout-based planner, which can incorporate additional state variables and constraints with limited incremental cost, and (ii) the stability logic that converts demand estimation error into bounded planning loss. The non-modular components are those that depend on one-dimensional smooth-function learning (uniform GP contraction) and on constant-price rollouts being a close surrogate for the true continuation value. These limitations are not merely technical: they identify where economic structure—low-dimensional sufficient statistics, slowly varying environments, or restricted adjustment policies—is essential for receding-horizon Bayesian optimization to deliver both statistical and computational leverage.
We study a finite-inventory pricing problem in which the seller must simultaneously (i) allocate scarce units over time, (ii) learn an unknown demand curve from censored sales data, and (iii) do so under tight computational budgets when the inventory state space is large. The economic tension is straightforward: prices are both (they ration inventory intertemporally) and (they generate information about demand). What is less straightforward is how to operationalize this tension when the exact dynamic program is too costly to solve repeatedly as beliefs evolve. Our central message is that a receding-horizon viewpoint provides a tractable middle ground: it internalizes scarcity through forward simulation, while avoiding the global backward induction that makes full value iteration expensive in C.
The policy we propose, RH-GPBO, combines two ingredients. First, we
plan with an L-step
constant-price rollout objective JfL(s, t, p),
which can be evaluated quickly along the realized inventory trajectory
using the expected truncated-sales map m(s, p) = 𝔼[min {s, D(p)}].
Second, we learn the mean demand function h(⋅) using a Gaussian process
posterior and choose prices by an optimism principle,
pt ∈ arg maxp ∈ [pℓ, ph]{JμnL(st, t, p) + κn, tσn(p)}.
Economically, the rollout is the mechanism by which the policy ``prices
the shadow cost of inventory’’ without explicitly computing a value
function for every inventory level; statistically, the exploration bonus
is the mechanism by which the policy trades off immediate revenue
against reducing uncertainty about h where it matters for pricing.
Two takeaways follow from this separation. The first is computational: evaluating an L-step rollout is linear in L, so a P-candidate search yields per-season planning cost on the order of TPL (plus GP updates). This replaces the CTP dependence of value iteration with an L dependence; the improvement is most pronounced precisely in the regimes that motivate finite-inventory pricing—large C and moderate T—where full DP is burdensome even before one accounts for repeated re-solving under learning. The second is conceptual: the regret analysis becomes modular. We decompose performance loss into a term, ΔL, induced by restricting attention to the L-step constant-price class, and a term, controlled by how well μn approximates h and by the stability of the rollout objective to mean-demand perturbations. This decomposition clarifies the role of L: increasing L reduces approximation bias but increases both computation and the sensitivity of the objective to estimation error (because the rollout aggregates demand effects over more future steps). In practice, this predicts an interior optimum: very short lookahead can be myopic when inventory is tight, while very long lookahead can overfit early-season uncertainty and can be computationally unnecessary.
At the same time, our analysis highlights three open problems that are not merely technical; they correspond to economically meaningful limits of the current approach.
Our regret bound contains the approximation term NΔL,
where
ΔL := sups, t{V*(s, t) − suppJhL(s, t, p)}.
This quantity captures how costly it is to compress a genuinely dynamic
pricing problem into an L-step
constant-price proxy. While it is natural that ΔL decreases in
L, useful decision guidance
requires more: we want conditions under which ΔL is small at
economically meaningful horizons (say L on the order of a week in retail,
rather than L ≈ T).
One promising direction is to relate ΔL to the of the
optimal price path and to the curvature of the revenue function p ↦ p m(s, p).
Intuitively, constant-price planning is accurate when the shadow value
of inventory evolves slowly over time or when the optimal policy is
nearly stationary across adjacent periods. Formalizing this could
proceed by showing that the optimal policy is Lipschitz in the state
(s, t) under
regularity conditions on h and
the demand family, and then bounding the benefit of re-optimizing the
price within the L-window
relative to holding it constant. Another direction is to develop
worst-case lower bounds: even with smooth h, there may exist instances where
any constant-price rollout is provably far from optimal unless L grows with T. Such lower bounds would sharpen
the economic interpretation of L as a structural restriction rather
than a mere computational knob.
Our learning analysis leverages GP regression tools that are most mature under Gaussian noise, and we rely on demand families where mean-based updates are tractable and truncated-sales expectations admit closed forms or fast approximations. In many pricing applications, however, the natural observation model is non-Gaussian: sales are counts, demand can be overdispersed, and stockouts censor the signal exactly when demand is high (and thus most informative about the upper tail). From a Bayesian perspective, the right object is a posterior over h induced by a censored Poisson/negative-binomial (or more general) likelihood, not a Gaussian observation model.
Two challenges arise. The first is computational: GP inference under non-Gaussian likelihoods typically requires Laplace approximations, expectation propagation, or variational methods, and censored observations further complicate the likelihood geometry. The second is theoretical: the exploration term κn, tσn(p) is meaningful only to the extent that σn is calibrated for the non-Gaussian posterior, and existing sup-norm contraction tools do not directly carry over. A clean path forward is to develop regret bounds that depend on under the correct likelihood, possibly in an on-policy sense that conditions on the realized inventory process and censoring events. More ambitiously, one could seek likelihood-robust bonuses—optimism terms that remain valid under a class of demand distributions sharing the same mean structure—which would align well with the economic motivation for the mean-indexed family {Pm}.
Many practical pricing problems are constrained beyond a simple interval [pℓ, ph]: prices may be discrete, changes may be limited by operational rules, promotions may require minimum durations, and firms may face explicit service-level constraints (e.g., controlling stockout probabilities) or implicit constraints (e.g., reputational costs of volatility). Receding-horizon planning is well suited to incorporate such constraints at the simulation layer—one can embed switching costs, adjustment budgets, or feasibility sets directly in the rollout. The more delicate issue is how to perform exploration under constraints in a way that preserves both computational tractability and statistical efficiency.
Standard Bayesian optimization offers constrained acquisition functions, but finite-inventory dynamics create an additional coupling: ``unsafe’’ exploratory prices can irreversibly deplete inventory early, reducing both revenue and future learning opportunities. This suggests that constraint-aware RH-BO should not only enforce pointwise feasibility (e.g., |pt − pt − 1| ≤ δ), but should also reason about constraints over the lookahead window, such as bounding the probability of hitting s = 0 before a target time. Developing principled bonuses and planning objectives that respect such dynamic constraints, while remaining fast to evaluate, is an open and practically important problem.
Stepping back, the broader contribution of our approach is to make explicit how three resources—inventory, information, and computation—trade off in finite-horizon pricing. The rollout planner converts computation into a forward-looking shadow value of inventory; the GP layer converts data into a calibrated uncertainty set over demand; and the regret decomposition makes transparent where performance is lost: either because the planning class is restricted (ΔL) or because the model is not yet learned. This perspective suggests a practical workflow: choose L to match the time scale over which inventory scarcity meaningfully changes, discretize prices only as finely as needed to make Disc(P) negligible relative to learning error, and allocate modeling effort to demand likelihoods and uncertainty quantification precisely when censoring and non-Gaussianity are first-order.
The remaining agenda is clear. If we can (i) characterize when short-horizon rollouts are nearly optimal, (ii) build GP-like uncertainty quantification for censored count data with theoretical guarantees, and (iii) integrate realistic operational constraints into the exploration–planning loop, then receding-horizon Bayesian optimization can become not just a computational convenience, but a principled design paradigm for dynamic pricing under scarcity.