Reference prices have become a central design choice of digital
commerce platforms. By 2026, it is routine for product pages to display
a statistic such as a usual price,''typical price,’’ or
recent median,'' often accompanied by a badge (e.g.,Deal’’
or ``Great deal’‘) when the posted price is sufficiently low relative to
that reference. These signals are not mere descriptive add-ons: they
shape how consumers interpret a given price, they influence which offers
consumers click and buy, and they discipline (or distort) sellers’
incentives when setting prices in the shadow of the platform’s displayed
benchmark. At the same time, the platform’s reference statistic is
itself a function of sellers’ past prices and sales, so reference
display and pricing behavior form a feedback loop that is both
economically meaningful and policy-relevant.
This paper studies that feedback loop through the lens of . A reference statistic is, in effect, a memory device: it compresses a history of past prices into a single number that consumers can easily process at the moment of purchase. Whether that memory is short (dominated by the last price or two) or long (reflecting an extended history) is consequential. With short memory, sellers who raise prices today can quickly ``reset’’ the anchor and then discount tomorrow to appear generous; with long memory, such manipulation is harder because the reference remains tied to earlier prices for longer. Conversely, long memory can amplify incentives for intertemporal pricing strategies that are innocuous in a static setting but potent once the future reference depends on today’s price. In the extreme, a seller may rationally choose an initially high price to elevate the future reference, followed by a sequence of markdowns that exploit consumer sensitivity to perceived gains.
Our motivating premise is that platforms, perhaps unintentionally, choose an implicit intertemporal policy when they choose a reference algorithm. In practice, reference statistics are computed by proprietary rules that may resemble an unweighted window average of posted prices, an exponentially smoothed moving average, or a purchase-weighted average that discounts prices that did not transact. These engineering choices map directly into economic primitives: they determine the effective lag weights placed on past prices and thus the horizon over which sellers can influence consumer perceptions. They also interact with deal labels, which discretize the reference comparison into salient categories. Small changes in the platform rule—for instance, extending the window length or switching from posted-price averaging to sales-weighting—can meaningfully change equilibrium prices, price dispersion, and consumer surplus.
We use panel data from an e-commerce platform that displays a reference statistic Rit and, in some contexts, a deal label Lit. The unit of observation is a product (or SKU) i over time t, with observed posted prices pit, impressions mit, and purchases yit. Our empirical strategy leverages two features that are increasingly common in platform settings. First, the reference statistic is displayed and can be treated as an outcome of a reference-formation algorithm applied to past prices (and possibly past sales). Second, the label rule often has a threshold form, e.g., a badge appears when pit ≤ (1 − τ)Rit, creating discontinuities that we can use to separate the effect of the discrete label from smooth variation in the underlying price–reference gap.
We begin with a parsimonious demand specification that captures a
central behavioral regularity: consumers respond differently to being
below'' versusabove’’ a reference point. Formally, we model
purchase probability (or conversion) as a known parametric function of
the current posted price and the displayed reference statistic, with
piecewise-linear gain/loss terms. This captures the idea that a discount
relative to the reference may increase demand by more than an
equal-sized markup decreases it (or vice versa), and it nests the
possibility of loss aversion, η− > η+.
We emphasize tractability: the specification is simple enough to
estimate on large-scale panel data and to use for counterfactual policy
evaluation, while still encoding the asymmetry that motivates
reference-price displays in the first place.
On the supply side, we interpret sellers as choosing posted prices pit, taking the platform-displayed signals (Rit, Lit) as given at the moment of pricing. This yields a natural baseline equilibrium notion: a myopic best response in each period, with the platform updating the reference statistic mechanically from observed history. We also consider a dynamic benchmark in which sellers internalize the effect of current prices on future references—a channel that becomes salient precisely when the platform’s reference has long memory. The dynamic benchmark is not meant to be a fully general model of seller behavior; rather, it provides a disciplined way to translate estimated demand and reference parameters into predictions about dynamic pricing patterns and the welfare consequences of alternative platform rules.
The main conceptual challenge is identification: we must separately
learn (i) how the platform constructs Rit
from past data, and (ii) how consumers respond to (pit, Rit, Lit),
despite the fact that sellers may set prices strategically in response
to the reference and the label. Our approach is to use the platform’s
own displayed reference as a measurement of the underlying state, and to
exploit policy variation in the mapping from history to display. When
Rit is
observed, a broad class of reference-formation models becomes a
regression problem: if the platform uses a finite-lag kernel on posted
prices,
$$
R_{it}=\sum_{k=1}^K \omega_k p_{i,t-k},
$$
then the weights ω = (ω1, …, ωK)
are identified under standard rank conditions on the lagged price
process. This immediately delivers a notion of ``memory length’’ that is
empirically interpretable (e.g., a half-life under geometric decay) and
that can be compared across candidate models. In particular, exponential
smoothing (ESM) corresponds to geometric weights, ωk ∝ (1 − α)αk − 1,
whereas ARM-style rules are closer to relatively flat weights over a
longer window. Because these models are nested via restrictions on ω, we can both estimate the implied
memory and test whether ESM-style decay is consistent with observed
references.
Demand estimation then uses both continuous and discrete variation in the price–reference relationship. The continuous variation identifies how conversion changes with the gap Rit − pit, while the label threshold supplies quasi-experimental variation: at the cutoff pit = (1 − τ)Rit, the label changes discontinuously even as other determinants of demand vary smoothly in a neighborhood. Under a local smoothness condition on sellers’ pricing noise around the threshold, a regression discontinuity design identifies the causal effect of the label (or, depending on specification, the effect of crossing into the ``gain’’ region) on conversion. This is particularly valuable because it helps separate a purely informational or attention-driven label effect from the continuous reference dependence encoded by (Rit − pit)+ and (pit − Rit)+.
Our contributions are threefold.
First, we provide an identification framework that treats the platform’s displayed reference statistic as an econometric object rather than an exogenous covariate. In many applied settings, reference prices are treated as either latent constructs or as noisy proxies for consumer beliefs. Here, we leverage the institutional fact that platforms compute Rit by a deterministic or quasi-deterministic algorithm from observed history. This allows us to estimate the reference kernel directly and to interpret it as a design parameter chosen (implicitly or explicitly) by the platform. In doing so, we connect reduced-form patterns in Rit to interpretable notions of memory and manipulability.
Second, we estimate asymmetric gain/loss sensitivities in demand in a manner that is tightly linked to platform mechanics. The piecewise specification is not a generic behavioral embellishment; it maps into the exact comparison consumers are prompted to make by the interface. By combining observed references with label discontinuities and rich controls Xit (e.g., position, shipping, ratings, and seasonality), we can quantify how strongly consumers respond to perceived gains relative to perceived losses, and how much incremental effect the discrete label adds beyond the continuous gap. These estimates are crucial inputs for evaluating counterfactual platform policies, because the welfare impact of memory depends on whether consumers are primarily motivated by avoiding losses, seeking gains, or reacting to badge salience.
Third, we use the estimated model to evaluate counterfactual platform designs. Because the platform can, in principle, change the rule mapping history into Rit (and thus into Lit), the platform effectively chooses a policy that shapes equilibrium behavior. We compare three families of rules that correspond closely to practice: (i) unweighted averaging over posted prices (an ARM-like benchmark), (ii) ESM (a short-memory benchmark), and (iii) sales-weighted averaging, which reduces the influence of prices that did not generate purchases. The last rule is particularly relevant for policy: it can be interpreted as an anti-manipulation design that makes it costly for sellers to inflate the reference via transient, non-transactional price spikes. Using our myopic equilibrium mapping from Rit to best-response prices and our dynamic benchmark mapping from memory to intertemporal incentives, we quantify the predicted changes in price volatility, label incidence, seller revenue, and consumer surplus.
Our analysis is closely related to recent theory on reference-dependent demand and dynamic pricing with long memory, including work that characterizes markdown strategies under an (ARM) in which the reference is a long-run average of past prices. That literature emphasizes a key mechanism: when the reference evolves slowly, early prices have persistent effects on later perceived gains, making two-phase pricing policies (high initial price followed by markdowns) close to optimal under certain conditions. We view our empirical environment as a natural testing ground for these ideas because the platform’s reference statistic operationalizes a consumer-facing state variable that is explicitly computed from past prices. At the same time, platforms sometimes employ ESM-like smoothing that yields short memory, and they can modify the memory length through engineering choices. This motivates our model-selection exercise: using policy changes or A/B tests that shift the reference algorithm gt(⋅), we ask whether consumer behavior is better rationalized by long-memory (ARM-like) or short-memory (ESM-like) reference formation.
Several limitations are worth stating upfront. We model the reference algorithm as a finite-lag kernel (possibly sales-weighted), which is a flexible and interpretable approximation but may not capture all platform features, such as category-level normalization, personalization, or censoring rules when data are sparse. We also take the displayed reference as the salient signal consumers respond to; in reality, consumers may hold additional beliefs formed from external price history or cross-seller comparisons. Moreover, our baseline pricing equilibrium is myopic, which is not intended as a literal description of all sellers but as a tractable benchmark that isolates the role of platform-mediated signals. The dynamic benchmark partially addresses forward-looking behavior, yet it remains stylized relative to the richness of seller strategy in competitive marketplaces. Finally, the causal interpretation of label effects via threshold designs requires smoothness of unobservables at the cutoff; we treat this as an identifying assumption and assess it using standard diagnostics in the empirical sections.
Despite these caveats, the model illuminates a concrete tradeoff faced by platforms: longer memory can make references more stable and less manipulable, potentially protecting consumers from artificial anchors, but it can also intensify intertemporal incentives that encourage front-loaded pricing and subsequent markdowns. Short memory can reduce the reward to early price inflation, but it may produce more volatile references and may weaken the informational content of ``usual price’’ as a summary of typical transaction prices. Sales-weighting offers a distinct lever: it can preserve stability while tying the reference more tightly to realized purchase prices, thereby discouraging manipulation without necessarily shortening memory.
The remainder of the paper proceeds as follows. The next section describes institutional details on how platforms compute reference statistics and deal labels, the structure of the panel data, and the policy variation that generates identifying shifts in reference formation and labeling rules. We then formalize the model, estimate reference kernels and demand parameters, and use the estimated primitives to conduct counterfactual evaluations of alternative reference algorithms under both myopic and dynamic seller behavior.
Our empirical setting is a large e-commerce platform that, for many
product pages, displays a alongside the currently posted price and, when
certain conditions are met, a salient . The reference statistic is
phrased in consumer-facing language (e.g.,
usual price'' ortypical price’‘) and is presented as a
summary of the product’s recent price history. The deal badge (e.g.,
Deal'' orGreat deal’’) is a discrete label that is
triggered when the current price is sufficiently low relative to the
displayed reference. These interface elements are central to how
consumers form an on-page comparison set and, in turn, to how sellers
anticipate and respond to the platform’s presentation of price
history.
The platform’s ``usual price’’ is generated mechanically from historical pricing data. While the exact production code is proprietary, our institutional discussions and internal documentation describe a common structure that is well-approximated by a : for each product i, the platform maintains a state variable Rit that updates over time using past posted prices and, in some variants, past sales. The key design choice is the —how much weight the algorithm places on recent prices versus more distant history.
Operationally, the platform constructs Rit from a trailing window of past observations at the product level. The engineering implementation includes routine data-quality features that are typical in commerce settings: (i) missing-price days (stockouts or suppressed offers) are skipped rather than imputed; (ii) extreme price outliers can be winsorized or excluded; and (iii) for products with insufficient history (new listings or sparse sales), the platform falls back to a category-level heuristic or withholds the display. These features matter empirically because they imply that the reference statistic is usually, but not always, a deterministic function of the observed price history in our panel. We therefore treat Rit as an observed platform output that is generated by an algorithm with a stable functional form over defined policy regimes, allowing for well-documented exceptions (e.g., cold-start rules).
A crucial detail for interpretation is that the platform computes the ``usual price’’ for the shown to consumers, rather than for an arbitrary seller listing in a multi-seller environment. In practice, consumers interact primarily with the default offer on the product page; the platform’s displayed reference and badge are tied to that offer and are updated using that offer’s price history as observed by the platform. In our data construction, we align the posted price pit with the featured-offer price at time t, so that the triplet (pit, Rit, Lit) reflects exactly what consumers see when they arrive on the page.
Finally, the platform’s update cadence is regular: the reference statistic is recomputed at a fixed interval (daily in our main sample), using all qualifying observations up to t − 1 when displaying Rit at time t. This institutional timing aligns naturally with a model in which consumers observe (pit, Rit, Lit) contemporaneously, while the reference evolves as a lagged function of observed history.
The deal badge is a discrete, salient classification intended to
highlight unusually low current prices. In the regimes we study, the
label rule is threshold-based in the platform’s internal documentation:
the badge appears when the current posted price is sufficiently below
the displayed reference statistic. Letting Lit
denote the indicator for a badge, the core rule is well summarized
by
Lit = 1{pit ≤ (1 − τ)Rit},
for a platform-chosen threshold parameter τ > 0. In some parts of the
platform, multiple badges exist (e.g., Deal'' versusGreat
deal’’), corresponding to a small set of ordered thresholds; our
baseline analysis focuses on a binary label, and we treat multi-badge
cases as robustness exercises by collapsing to an indicator for any deal
label or by estimating separate discontinuities at each cutoff when the
data allow.
Importantly, the badge system also includes eligibility screens that are not purely price-based. For example, the platform may require that the offer is in stock, that shipping speed meets a promised standard, or that the seller meets performance criteria. These screens can induce into labeled status. We address this in two ways. First, we restrict attention to observations where the offer is eligible by construction (e.g., in-stock, standard shipping available), using the platform’s eligibility flags when available. Second, we control for rich time-varying covariates Xit that proxy for the remaining determinants of eligibility and consumer choice (e.g., shipping promises, ratings, and page position). The threshold structure nonetheless generates economically valuable variation: conditional on eligibility, crossing the cutoff changes the label discretely, creating a sharp change in what consumers see even when the underlying price–reference gap varies smoothly.
Our primary dataset is a panel constructed from the platform’s
page-view and purchase logs, merged with the price and merchandising
feeds that determine what is displayed to consumers. The unit of
observation is a product–time cell (i, t) at a daily
frequency. For each cell we observe:
(i) the posted featured-offer price pit;
(ii) the displayed reference statistic Rit
when shown on the page; (iii) the deal label Lit
(and, when applicable, the label type); (iv) impressions (page views)
mit;
(v) purchases yit;
and (vi) a set of covariates Xit
capturing time-varying product and offer attributes observed by
consumers and/or used by the platform (e.g., shipping speed, rating and
review counts, whether the offer is fulfilled by the platform, and page
placement proxies). We construct the conversion rate as q̂it = yit/mit
and treat (mit, yit)
as the sufficient statistics for demand at the daily level.
The panel is unbalanced for institutional reasons: products enter and exit (new listings, delistings), and reference display can be suppressed under cold-start or insufficient-history rules. We keep the unbalanced structure but impose minimal requirements needed for lag-based analysis: for estimation of a K-lag reference kernel, we require at least K prior days with observed prices for a given product, and we track the exact set of usable lags product-by-product. Missing days do not mechanically break the panel; instead, they reduce the available lag information and trigger the platform’s own fallback rules, which we treat explicitly in sample selection and robustness checks.
We emphasize that our demand outcomes are measured conditional on exposure: impressions mit reflect consumer arrivals to the product page (or featured offer module) and are affected by platform ranking and traffic allocation. Because impressions are themselves endogenous to merchandising and competition, we separate the margin from the margin. Our baseline demand estimation conditions on mit and models yit as a binomial draw given mit and the on-page signals. In complementary analyses, we also examine whether the badge affects traffic (e.g., by influencing click-through in search modules), but the core identification of reference dependence and label effects operates on the conversion margin.
Marginal costs ci are not directly observed in the transactional data for all third-party sellers. When needed for counterfactual profit calculations, we use either (i) cost proxies available for first-party retail offers, (ii) category-level cost estimates, or (iii) an inferred cost bound based on long-run price floors, and we report sensitivity of welfare calculations to these assumptions. Because the key identification of reference formation and demand responses relies on (pit, Rit, Lit, mit, yit, Xit), the main empirical results do not require exact costs.
A central advantage of the platform environment is that the mapping from history to displayed reference and labels is sometimes shifted by engineering changes or deliberate experimentation. We observe two forms of variation.
First, the platform periodically changes the reference algorithm or its parameters (e.g., the lookback window, smoothing factor, or the inclusion of sales-weighting). These changes occur at known dates and are documented in release notes. When such a change is rolled out globally at time t0, it creates a regime shift in the law of motion for Rit holding fixed the underlying product universe. In the data, these events are visible as systematic changes in how quickly Rit responds to shocks in pit, and they generate valuable identifying variation for distinguishing short-memory (fast adjustment) from long-memory (slow adjustment) reference formation.
Second, the platform runs randomized experiments in which users (or sessions) are assigned to different display rules. In these A/B tests, the platform may vary either (i) the reference statistic algorithm (e.g., exponential smoothing versus a longer windowed average) or (ii) the label rule (e.g., tightening or loosening the threshold τ, or introducing a multi-badge scheme). The experiment assignment is logged at the user-group level. Because our main panel is aggregated to the product–day level, we construct treatment exposure measures by merging assignment shares into (i, t) cells (e.g., the fraction of impressions exposed to the treatment display). This allows us to treat algorithm assignment as an instrument that shifts Rit and/or Lit in a way that is orthogonal to product-specific demand shocks, strengthening identification of the causal demand response to displayed signals.
We apply standard filters to ensure that (pit, Rit, Lit) correspond to a stable display environment. We exclude days with obvious data corruption (e.g., zero price with positive sales), restrict to currency-consistent observations, and remove products with pervasive missingness in key covariates. Because reference display can be suppressed for sparse history, we distinguish between (i) the ``reference-observed’’ sample where Rit is displayed and directly observed, and (ii) the broader sample where only labels and prices are observed. This distinction is important because our identification strategy can proceed either by directly modeling Rit (when observed) or by using the label threshold to back out the relevant variation in the price–reference relationship.
We also harmonize timing: the displayed reference and label are recorded at the time of impression. When aggregating to the daily level, we use impression-weighted averages of the displayed signals, so that Rit and Lit correspond to what a typical consumer saw on day t. This is especially important during experiments, where different users may see different displays within the same day.
Several descriptive patterns, reported in the summary tables and figures in the Appendix, motivate our modeling choices. First, the reference statistic is highly persistent: within a product, Rit moves more smoothly than pit, consistent with an updating rule that averages over time. This persistence varies across policy regimes in ways consistent with platform design changes, suggesting meaningful variation in effective memory.
Second, deal labels occur with nontrivial frequency and are concentrated around plausible cutoffs. In particular, the distribution of the running variable sit = pit − (1 − τ)Rit shows substantial mass near zero, and the probability of receiving a badge shifts sharply at sit = 0. This visual ``first stage’’ supports the institutional claim that the label rule is approximately threshold-based and provides a foundation for regression discontinuity designs that use local variation around the cutoff.
Third, conversion rates vary systematically with both price levels and the price–reference gap. Conditional on product and time controls, conversion is typically higher when pit < Rit than when pit > Rit, and the slope differs across sides of the reference in a manner suggestive of asymmetric gain/loss effects. At the same time, labeled days exhibit additional conversion uplift beyond what would be predicted by the continuous price–reference gap alone, consistent with the badge operating as an attention or salience channel.
Finally, the data reveal economically meaningful co-movement between pit and Rit. This co-movement is expected in equilibrium—sellers respond to the displayed reference and the platform updates the reference from past prices—and it underscores why identification must leverage the platform’s institutional mechanics (observed Rit, known threshold rules, and policy-induced shifts in the reference algorithm) rather than treating the reference as an exogenous covariate.
Taken together, these institutional features and descriptive facts discipline the modeling choices in the next section. The platform’s reference statistic behaves like a memory kernel applied to past prices (and sometimes sales), the deal badge is well approximated by a threshold in the price–reference comparison, and the panel structure provides the ingredients needed to estimate both the reference formation process and consumers’ asymmetric responses to gains and losses relative to the displayed benchmark.
We formalize the platform environment as a dynamic interaction among three sets of agents: the platform, which computes and displays a reference-price statistic and (sometimes) a deal label; sellers, who choose posted prices; and consumers, who decide whether to purchase after observing the on-page signals. The model is intentionally parsimonious: it isolates the two feedback loops that are central in our setting—(i) the platform maps past prices (and possibly past sales) into a displayed benchmark Rit, and (ii) sellers anticipate how today’s price affects both contemporaneous demand and the future path of that benchmark.
The unit of analysis is a product–time cell (i, t) (daily in our main sample). At each t, the platform displays a reference statistic Rit and a deal label Lit on the product page. The seller then posts a price pit. Consumers arrive, generating mit impressions and yit purchases. Finally, the platform updates the next-period reference using a deterministic algorithm applied to lagged outcomes.
From the econometrician’s perspective, the observed dataset is
𝒟 = {pit, Rit, Lit, mit, yit, Xit}i, t,
where Xit
collects observed covariates (e.g., shipping promises, ratings, and time
controls). A key modeling choice is to treat impressions mit as
conditioning information for conversion: we model yit
given mit and
the on-page signals, and we interpret changes in conversion as demand
shifts holding traffic fixed.
The platform’s reference statistic is designed to summarize recent
price history. We capture this with a that maps lagged posted prices
into the displayed benchmark. Our most general specification assumes
that for each product i,
where K is the maximum lag
used by the algorithm. This formulation nests a wide range of
moving-average and ``windowed’’ reference rules used in practice: the
platform chooses the shape of the weight vector ω = (ω1, …, ωK),
thereby choosing the effective memory length. Flat weights correspond to
a trailing average; front-loaded weights correspond to short memory.
A commonly used special case is exponential smoothing (ESM), which
can be written recursively as
The ESM rule implies geometrically decaying weights on lagged prices,
with effective memory increasing in α. Empirically, ESM is attractive
because it reduces the dimensionality of the reference algorithm to a
single parameter and yields sharp testable restrictions relative to .
Conceptually, it also highlights an important economic tradeoff: a
high-α reference is stable
(harder to move with short-lived price changes) but may be slow to
incorporate genuine cost or quality changes; a low-α reference responds quickly but can
be more easily ``steered’’ by transitory price movements.
In some platform regimes, reference formation uses not only posted
prices but also transaction information, reflecting a desire to anchor
the statistic to prices at which consumers actually purchase. We capture
this through a kernel:
with a fallback rule when the denominator is zero (e.g., carry-forward
of Ri, t − 1
or suppression of the display). The economic motivation is
straightforward: an unweighted rule gives equal influence to posted
prices even when no consumer buys, which can strengthen incentives for
purely cosmetic ``anchor’’ prices; in contrast, attenuates the impact of
no-sale price spikes by tying the benchmark to realized transactions.
This distinction will matter for counterfactuals because it changes how
costly it is for sellers to manipulate Rit.
We emphasize two simplifications. First, we treat the platform algorithm as product-separable and stable within policy regimes. This fits the institutional fact that the reference statistic is computed mechanically from product-level history, though it abstracts from category-level pooling and engineering exceptions (e.g., cold-start heuristics). Second, we treat Rit as predetermined with respect to the seller’s time-t price: Rit is computed from information through t − 1 and displayed before pit is posted. This timing is critical for the seller problem below.
In addition to displaying Rit,
the platform sometimes shows a discrete badge intended to highlight
unusually low prices. We model the label as a deterministic function of
the current price relative to the displayed reference:
where τ > 0 is a
platform-chosen threshold. This specification captures the institutional
feature that the badge is triggered by crossing a cutoff in the
price–reference comparison. It also makes precise the sense in which the
label discretizes an underlying continuous object: the price gap Rit − pit.
When the platform uses multiple badge intensities, generalizes to
ordered thresholds; in the model, this simply introduces additional
cutoffs in the mapping from (pit, Rit)
to the displayed signal.
Consumers observe the posted price pit and the platform’s displayed signals (Rit, Lit), as well as other observable attributes summarized by Xit. We model purchase as a binary decision at the impression level and aggregate to the product–day level.
Our baseline demand specification is a tractable generalized linear
model for conversion. Let qit
denote the probability that an impression converts into a purchase. We
specify
where σ(z) = 1/(1 + e−z)
is the logistic link, (x)+ = max {x, 0},
δi are
product fixed effects, and λt are time
fixed effects. The coefficient κ > 0 captures the standard
negative effect of price holding reference fixed. The parameters (η+, η−)
govern the asymmetry in how consumers respond to being below versus
above the benchmark. The label coefficient ψ captures an additional salience
channel that operates discretely when the badge is present, above and
beyond the continuous gain term.
Given impressions mit,
observed purchases follow a binomial sampling model:
This formulation delivers a likelihood that is well suited to panel
estimation with fixed effects and allows us to map changes in conversion
into economically interpretable shifts in the latent index in .
For some counterfactual exercises, it is convenient to work with a
linear-demand approximation that yields closed-form pricing rules. In
that case we posit
with ai > 0. While
is less well-behaved globally, it clarifies the economic mechanism: a
higher reference raises demand only to the extent that it creates a
perceived gain (or avoids a perceived loss), and the slopes differ on
the two sides.
We model the featured-offer seller for product i as choosing pit to
maximize profit. Let ci denote
marginal cost. Period profit is
where expected demand is determined by (or ). We impose an exogenous
feasible set $p_{it}\in[\underline p,\bar
p]$ capturing platform price floors/ceilings or seller
constraints.
Our baseline equilibrium discipline is best response: at time t, the seller takes Rit and
the platform’s label rule as given and maximizes period-by-period. In
the linear-demand approximation, the best response admits closed form
separately in the gain region (p ≤ R) and the loss region
(p ≥ R). Abstracting
from the label discontinuity for the moment, the interior solutions
are
and
These expressions make the comparative statics transparent: when η− > η+
(loss aversion), the seller’s optimal price is more tightly linked to
R in the loss region,
generating an incentive to avoid posting prices above the reference.
Incorporating the deal label Lit
adds a discrete kink around the threshold p = (1 − τ)R when
ψ > 0, potentially
generating bunching just below the cutoff.
Myopic pricing is a useful approximation when sellers either do not
understand the platform’s reference algorithm or face short planning
horizons. To connect the empirics to the strategic logic emphasized in
long-memory reference-price theories, we also consider a dynamic
benchmark in which the seller internalizes that today’s price affects
future references through the platform’s update rule. Let the reference
state evolve as
Ri, t + 1 = g(Rit, pit, history),
where g(⋅) corresponds to , ,
or under the relevant policy regime. The seller’s value function
satisfies
where we write L(p, R) to
emphasize that the badge is endogenously determined by given the
seller’s price choice.
This dynamic problem captures the key intertemporal tradeoff: a higher price today raises contemporaneous margin but may raise future R (depending on the memory kernel), which can make future markdowns appear as larger gains and thus increase future demand. Sales-weighting changes this incentive by making the influence of a high price depend on whether consumers actually buy at that price. We do not assume the platform chooses g(⋅) optimally; instead, we treat the algorithm as a policy lever that can vary across regimes or experiments, which will be the basis for our counterfactual comparisons.
An equilibrium in our environment consists of sequences {pit, Rit, Lit} such that (i) the platform’s displayed reference and label satisfy its mechanical rules, (ii) consumers purchase according to (or ), and (iii) sellers’ prices satisfy either the myopic best response conditions or the dynamic optimality condition . In the data, we observe outcomes generated by the platform’s chosen rule and the sellers’ (possibly heterogeneous) behavioral responses; our empirical strategy will therefore focus on identifying the kernel parameters governing Rit and the demand parameters governing responses to (pit, Rit, Lit).
The model abstracts from several features that matter in practice, including multi-seller competition, endogenous page position, inventory constraints, and the platform’s own objective function. We view these omissions as disciplined by two considerations. First, our core objects—the reference formation kernel and the asymmetric gain/loss demand response—are directly tied to the on-page signals the platform displays and are meaningful even in richer competitive environments. Second, the institutional variation we observe (policy shifts and experiments in the display rules) provides quasi-experimental leverage on the causal role of these signals, reducing reliance on fully specifying the broader marketplace. The next section develops the identification strategy implied by these primitives.
Our identification goal is to separately recover (i) the platform-side mapping from past outcomes into the displayed reference statistic and badge, and (ii) the consumer-side demand response to the resulting on-page signals. The central challenge is that prices, references, and labels are jointly endogenous in equilibrium: sellers may choose pit in light of Rit and the label rule, while the platform computes Rit from past prices (and possibly past sales), and both sides may react to unobserved demand shocks. We therefore organize identification around two sources of structure that are present in the institutional setting: (a) the nature of reference formation within policy regimes, and (b) variation induced by A/B tests, discrete policy shifts, and threshold-based badge assignment.
The first step treats the reference statistic as an observed outcome of a deterministic filter applied to lagged product-level histories. When Rit is directly observed on the page, the kernel parameters can be identified with minimal behavioral assumptions: we do not need a model of consumer demand or seller pricing to estimate the mapping from {pi, t − k}k = 1K into Rit, because the platform computes Rit before the seller chooses pit at time t.
Under , identification of ω
is essentially a linear regression problem with product-level panel
structure:
$$
R_{it}=\sum_{k=1}^K \omega_k p_{i,t-k} + u_{it},
$$
where uit
captures measurement error in the scraped reference statistic,
engineering exceptions, or occasional deviations from the documented
rule. Intuitively, if lagged prices exhibit sufficient independent
variation (after controlling for fixed effects and time controls), then
the K weights are pinned down
by how the displayed reference co-moves with each lag. Formally, a rank
condition on the regressor matrix of lagged prices identifies ω up to sampling error, and the
simplex constraints ωk ≥ 0 and ∑kωk = 1
can be imposed to improve precision and interpretability.
Exponential smoothing imposes a sharp set of overidentifying restrictions on the unrestricted kernel: the implied weights satisfy ωk ∝ (1 − α)αk − 1 (up to truncation at K). This yields testable implications that are useful both substantively (short versus long memory) and practically (dimension reduction). We implement this logic by (i) estimating the unrestricted ω under , (ii) estimating α directly from the recursion , and (iii) comparing fit via Wald or likelihood-ratio tests of the geometric-decay restrictions. A useful summary object is the implied of the reference memory, log (1/2)/log (α), which provides an interpretable mapping from parameter estimates to the persistence of the on-page benchmark.
When the platform uses transaction information as in , the mapping from histories into Rit becomes nonlinear and potentially undefined when recent sales are zero. Conceptually, identification is still anchored in the same institutional fact: conditional on observed histories, Rit is mechanically determined. Empirically, we exploit two implications. First, within periods with positive weighted sales mass, Rit is a ratio of weighted sums; variation in (yi, t − kpi, t − k, yi, t − k) identifies the effective weighting and the role of sales in attenuating the influence of no-sale price spikes. Second, the ``zero-denominator’’ events reveal the fallback rule. Because fallback behavior is often discrete (carry-forward, suppression, or category default), it generates visible patterns in the time series of Rit that can be directly documented and incorporated as part of the platform rule. We treat these patterns as additional moments that discipline the algorithm specification.
A mechanical mapping does not mean a perfectly stable mapping: platforms deploy exceptions (cold-start rules, outlier trimming, category-level overrides) and may change policy without public documentation. We therefore treat kernel estimation as both an estimation and a model-checking exercise. Two validation steps are especially informative. First, we assess by re-estimating ω on rolling windows and testing for breaks at known policy-change dates. Second, we assess of Rit from lagged prices: if a candidate kernel is correct, it should forecast the displayed reference well even when the seller’s pricing process changes, because the platform filter is applied to observed history regardless of why prices moved.
Having disciplined the evolution of Rit, we turn to identifying the demand parameters in . The primary endogeneity problem is that posted prices respond to expected demand shocks: sellers may raise prices when demand is high for unobserved reasons, biasing naive estimates of price sensitivity. A second concern is that Rit is a function of past prices and sales, which themselves are outcomes of past demand shocks; even if Rit is predetermined within period t, it may be correlated with the current shock ϵit through persistence. Our strategy combines (i) experimental or policy-induced variation that shifts the displayed signals holding other primitives fixed, and (ii) local quasi-random variation around deterministic thresholds.
Many platforms evaluate UI and pricing-display changes via randomized experiments. When the assignment varies the reference algorithm gt(⋅) or the label rule ht(⋅) across users (or across time in a staggered rollout), the experiment creates exogenous variation in Rit and/or Lit that is orthogonal to seller-level demand shocks. The key exclusion restriction is that the experiment affects purchases only through the displayed signals, not through concurrent changes in product quality or fulfillment. Under this condition, algorithm assignment serves as an instrument for the endogenous component of the on-page signal. Concretely, if two user groups face different kernels ω(A) and ω(B) applied to the same underlying price history, then differences in conversion between groups identify the causal effect of R (and of badge exposure) without requiring us to assume that sellers are not optimizing—indeed, sellers may not even observe user-level assignment, making supply responses limited at the relevant margin.
Policy shifts (non-randomized but discrete changes at known dates) are weaker than A/B tests but still valuable when combined with rich controls and pre-trend diagnostics. The identifying logic is difference-in-differences in which the treatment is the mapping from history to displayed reference, rather than the history itself. Because Rit is a nonlinear summary of past prices, changing gt(⋅) can move Rit even when prices evolve smoothly, providing variation that is not mechanically collinear with pit. We interpret this as a practical route to disentangling reference effects from pure price effects when sellers adjust prices gradually.
Even absent experimentation, deterministic badge assignment generates
a discontinuity that we can exploit. Define the running variable
sit = pit − (1 − τ)Rit,
so that Lit = 1{sit ≤ 0}.
Under the maintained continuity condition that, absent the badge, the
potential conversion rate is smooth in sit at
zero (equivalently, sellers have locally noisy pricing around the cutoff
and there are no other discontinuous UI changes at the same threshold),
the jump in conversion at sit = 0
identifies the causal effect of receiving the badge. This identifies
ψ in in a local sense, and it
also helps separate discrete salience effects from the continuous
gain/loss terms.
Beyond the jump, the RD design is informative about the parameters (η+, η−). Because the gain and loss components enter as piecewise-linear functions of Rit − pit, the slope of conversion with respect to p differs on either side of the reference. In practice, we use local polynomial regressions that allow different slopes for sit < 0 and sit > 0, optionally netting out the discrete label jump. The identifying idea is that, in a sufficiently small neighborhood around the cutoff, other covariates vary smoothly, so differences in local derivatives map to differences in implied gain versus loss sensitivity.
Two assumptions merit emphasis because they can be partially assessed in the data.
First, the RD design requires that sellers cannot perfectly control the running variable at the threshold. In this environment, some bunching just below the cutoff is plausible precisely because the badge is valuable. We treat this as a limitation rather than a failure: bunching is itself evidence of strategic response, but it complicates a naive continuity argument. We address it by (i) using donut-RD specifications that exclude a narrow region around the cutoff where manipulation is most acute, and (ii) reporting McCrary-style density tests and balance tests for predetermined covariates Xit around the threshold. When manipulation is strong, we interpret RD estimates as local effects for the subset of observations not strategically placed, rather than as a universal treatment effect.
Second, the instrumental-variables logic of experiments and policy shifts requires that the algorithm change affects conversion primarily through displayed signals. We probe this by (i) checking for discontinuities in impressions mit at the same time (to the extent the platform might re-rank items when badges change), and (ii) examining placebo outcomes that should not respond to display changes (e.g., lagged conversion, or conversion for products not eligible for the badge). Where impressions respond, we interpret our baseline estimates as conversion-margin effects conditional on traffic and report complementary specifications that model mit jointly.
The conceptual separation we rely on is that platform rules determine the (Rit, Lit) from lagged histories, while demand parameters determine how consumers react to those signals. This separation is cleanest when Rit is observed and generated mechanically from lagged observables, because then the kernel can be estimated without reference to demand, and demand can be estimated conditioning on (or instrumenting with) the resulting signals.
In practice, we adopt a two-layer approach. In the first layer, we estimate and validate the reference rule within policy regimes and construct predicted references R̂it from observed histories. In the second layer, we estimate conversion demand using variation in pit, R̂it, and Lit, leveraging either randomized variation in the display rule or quasi-random local variation at the badge threshold. A useful byproduct is a set of overidentifying restrictions: if a candidate kernel is correct, then replacing Rit by R̂it should not change estimated demand responses materially (up to measurement error), and algorithm-induced shifts in R should predict conversion shifts with the same sign and magnitude as within-regime movements in R.
Finally, these identification steps deliver policy-relevant primitives. The platform kernel governs how persistent an ``anchor’’ price remains visible, while (η+, η−) and ψ govern how strongly consumers react to being framed as receiving a bargain and how costly it is for sellers to price above the benchmark. These objects are exactly what we need for counterfactual evaluation of alternative reference algorithms: changing gt(⋅) changes the evolution of the signal, which feeds back into seller pricing incentives and consumer welfare. The next section describes how we implement these ideas in a likelihood and GMM framework, and how we quantify uncertainty given high-dimensional fixed effects and potentially endogenous pricing.
This section describes how we map the observed panel 𝒟 = {(pit, Rit, Lit, mit, yit, Xit)}i, t into estimates of (i) the reference-formation parameters (kernel weights ω or smoothing parameter α and any sales-weighting components), and (ii) the demand parameters governing conversion responses to price, displayed reference, and deal labels. Because the platform signal (Rit, Lit) is mechanically generated within a policy regime while prices are chosen strategically, we organize estimation to (a) isolate the ``engineering’’ mapping that produces Rit, (b) flexibly absorb product and time heterogeneity, and (c) address residual endogeneity of pit (and, when relevant, the endogeneity of the displayed signals through correlation with persistent shocks).
When the displayed reference is observed and is well-approximated by
a K-lag linear filter of
posted prices, we estimate
where μi
captures persistent product-level scaling or rounding differences in the
displayed statistic, νt captures
common time-varying engineering adjustments (e.g., category-level
updates rolled out globally), and ξit is
an error term capturing scraping noise and rule exceptions. In practice,
we estimate by least squares with high-dimensional fixed effects and
impose the simplex restrictions ωk ≥ 0 and $\sum_{k=1}^K \omega_k=1$ either (i) directly
via constrained least squares, or (ii) by reparameterizing $\omega_k=\exp(\theta_k)/\sum_{j=1}^K
\exp(\theta_j)$ and minimizing the squared error in θ. The constrained formulation
improves interpretability and stabilizes estimates when lagged prices
are highly collinear.
For exponential smoothing, we estimate α from the recursion
which can be estimated by least squares (or IV if one is concerned that
ξit is
correlated with Ri, t − 1
via measurement error). We then compare (i) the fit of and (ii) the fit
of the unrestricted kernel , and we test the geometric restrictions
implied by ESM using Wald tests on the unrestricted ω̂. For interpretability, we report
the implied half-life log (1/2)/log (α̂) and the effective
memory length 1/(1 − α̂) as
summaries of persistence.
When the reference is sales-weighted, the object is nonlinear:
with a fallback rule otherwise (e.g., carry-forward Rit = Ri, t − 1,
suppression, or category default). We estimate (ω, fallback parameters) by
nonlinear least squares or simulated minimum distance: we choose
parameters to minimize the distance between observed Rit and
the right-hand side of on the subset with positive denominators, and we
separately match moments characterizing fallback events (frequency of
missing/suppressed reference, probability of carry-forward, and
transition probabilities across regimes). Because sales can be sparse at
the daily level, we also implement weekly aggregation as a robustness
check, which smooths denominators and reduces the influence of
idiosyncratic zero-sales days.
We treat kernel estimation as both parameter recovery and specification checking. Within each policy regime, we re-estimate the kernel on rolling windows to assess stability and to detect undocumented changes. We also evaluate out-of-sample prediction of Rit from lagged prices (and sales, if applicable). A kernel that fits in-sample but fails to forecast the displayed reference out of sample is unlikely to represent the true platform rule, and it would contaminate downstream demand estimation if R̂it is used as a generated regressor.
We estimate the conversion model at the impression level, treating
purchases as binomial outcomes:
where σ(z) = 1/(1 + e−z).
The log-likelihood is
with parameter vector θ = ({δi}, {λt}, γ, κ, η+, η−, ψ).
In baseline specifications, we include product fixed effects δi to absorb
time-invariant quality and baseline popularity, and time fixed effects
λt to
absorb seasonality and common shocks. We also report parsimonious
alternatives (e.g., category-by-time fixed effects) to assess the
stability of the gain/loss asymmetry estimates under different control
sets.
Because the panel can be large, we estimate using standard high-dimensional generalized linear model methods. Concretely, we maximize via iteratively reweighted least squares with within-transformations (or alternating updates) for {δi} and {λt}. This procedure is numerically stable even when mit is large because the binomial likelihood aggregates impression-level outcomes without requiring individual-level data. We monitor convergence using both parameter changes and likelihood improvements, and we trim extreme observations with mit below a small threshold in robustness checks to reduce sensitivity to noisy conversion rates.
In settings where we do not directly observe Rit for all products and dates (or when we prefer to impose a parametric rule), we replace Rit by its predicted counterpart R̂it(ω̂) constructed from the kernel estimates. Because R̂it is a generated regressor, naive standard errors for (κ, η+, η−, ψ) may understate uncertainty. We therefore use either (i) a block bootstrap that re-estimates the kernel and demand jointly within each resample, or (ii) a two-step sandwich correction based on the joint score (stacking kernel moments and demand likelihood scores), as described below.
To address residual endogeneity of posted prices, we complement
likelihood estimation with a GMM/IV approach that uses instruments Zit
that shift prices but are plausibly orthogonal to unobserved demand
shocks. A convenient formulation is based on the score conditions of the
binomial logit. Let uit(θ) = yit − mitqit(θ)
denote the binomial ``residual.’’ Under correct specification and
exogeneity, we have
We implement two-stage procedures (control-function or two-stage
residual inclusion) and direct GMM minimization of $\widehat{g}(\theta)=\frac{1}{|\mathcal{D}|}\sum_{i,t}
Z_{it}u_{it}(\theta)$ with optimal weighting in a second
step.
In all cases, we report first-stage strength (where applicable), assess sensitivity to alternative instrument sets, and provide overidentification tests when the model is overidentified.
Even with instruments for pit, it is important to guard against the possibility that Rit is correlated with persistent unobservables. Experimental shifts in gt(⋅) are particularly useful here: they generate variation in Rit conditional on a common underlying price history. In specifications that pool across regimes or A/B arms, we include arm-by-time controls and use assignment indicators as instruments for Rit and Lit (or, equivalently, for the gain/loss regressors). This directly targets the elasticity of conversion with respect to displayed signals rather than with respect to mechanically correlated price history.
Conversion outcomes exhibit serial dependence within product (persistent demand conditions, gradual assortment changes) and cross-sectional dependence within date (common demand shocks, platform-wide UI changes). Our baseline inference therefore uses two-way clustered standard errors at the product and time level for the demand estimation, and analogous clustering for kernel regressions. When computational constraints bind, we adopt product-level clustering and verify robustness using block bootstrap over time.
When demand is estimated using predicted references R̂it, we
account for first-stage uncertainty using one of two approaches. The
first is a : we resample products with replacement (preserving their
full time series), re-estimate the kernel (including any regime-specific
parameters), reconstruct R̂it,
and re-estimate demand. The empirical distribution of parameter
estimates yields confidence intervals that incorporate both steps and
preserve within-product serial correlation. The second is a : we define
a combined estimator that solves
∑i, tsitR(ω) = 0, ∑i, tsitD(θ; R̂(ω)) = 0,
where sR
are kernel score/moment conditions and sD are demand
scores or GMM moments. We then compute the standard sandwich variance
using the Jacobian of the stacked system and the clustered covariance of
the stacked scores.
Given potentially many fixed effects, we report both conventional clustered standard errors and bootstrap intervals. For specifications with extreme probabilities (very low or very high conversion), we also verify that results are not driven by quasi-separation by applying weakly informative penalization (e.g., ridge on selected coefficients) as a robustness check; we treat such penalization purely as a computational device and confirm that point estimates are stable.
Although our baseline gain/loss specification is piecewise-linear in (Rit − pit), we probe robustness by allowing richer curvature: we estimate (i) spline versions in the perceived gain/loss gap, (ii) specifications that replace Rit with log Rit and use relative gaps (Rit − pit)/Rit, and (iii) models that include an interaction between Lit and the magnitude of the discount to capture the idea that badges may increase salience more when the discount is large. We report whether the implied asymmetry (η− > η+) persists across these alternatives.
Because the platform may adjust ranking or visibility in response to labels, conversions and impressions can move jointly. Our baseline demand model conditions on mit through the binomial likelihood, but this does not by itself address endogenous traffic. We therefore estimate complementary specifications: (i) models that restrict to stable-position observations (or include fine position fixed effects), (ii) a two-equation system in which log mit is modeled as a function of Lit and Rit with arm/time instruments, and (iii) conversion-per-impression models restricted to narrowly defined UI contexts. These checks help interpret ψ as a conversion-margin effect rather than a composite of conversion and traffic reallocation.
When exploiting badge thresholds, we implement ``donut’’ samples that exclude a narrow band around the cutoff and verify that estimates are stable as the excluded bandwidth varies. Even when our main estimation relies on experiments or instruments rather than RD alone, these threshold-based diagnostics provide a useful check on whether sellers precisely target the label rule in ways that could confound naive comparisons of labeled and unlabeled observations.
The outputs of this section are (i) a validated estimate of the reference-memory process (kernel weights ω̂ or smoothing α̂, and any sales-weighting/fallback parameters), and (ii) demand primitives (κ̂, η̂+, η̂−, ψ̂, γ̂) that translate changes in posted prices and displayed signals into predicted conversion changes. These primitives are the inputs for the empirical comparisons and counterfactual exercises that follow, including out-of-sample validation of conversions and labels and the long- versus short-memory assessments implied by the estimated kernel.
This section reports four sets of findings that summarize what the data say about (i) how the platform constructs the displayed reference statistic, (ii) how consumers respond to the resulting perceived gains and losses, (iii) how these objects vary across product categories, and (iv) whether the estimated model predicts held-out outcomes (references, labels, and conversions) with economically meaningful accuracy. Our organizing theme is that reference-price design is not merely a reporting choice: it governs a in the consumer decision problem and therefore shapes both demand elasticities and sellers’ incentives to manage the displayed signal.
We begin by estimating the finite-lag kernel mapping from lagged posted prices to the displayed reference. The key empirical question is whether the platform’s statistic behaves like a short-memory exponential smoother (large weight on very recent prices) or like a long-memory window (substantial weight on older prices). Intuitively, short memory makes the reference ``track’’ current prices and therefore weakens the scope for intertemporal anchoring, while long memory makes the reference persistent and amplifies the salience of discounts relative to a historically elevated benchmark.
Across product panels with sufficiently rich price variation, the unrestricted K-lag specification produces a weight profile that declines slowly with the lag rather than concentrating primarily on pi, t − 1. In other words, the estimated kernel places nontrivial mass on prices from several weeks (or more) in the past, consistent with an ARM-style statistic rather than a pure ESM. This is visible both in the shape of ω̂ and in the implied effective memory length computed from the weight distribution (e.g., the cumulative mass placed on lags beyond a short window). When we impose the geometric restrictions associated with ESM, the resulting fit deteriorates in precisely the episodes where sellers engage in pronounced temporary price changes (promotions and post-promotion reversions): the observed reference remains elevated relative to current prices for too long to be reconciled with rapid exponential decay, and it does so in a manner that is well-captured by a finite window with slow decay.
Two additional diagnostics reinforce the long-memory interpretation. First, rolling-window estimation within stable policy regimes yields kernels that are broadly stable over time, which is what we would expect if the displayed statistic is generated by a relatively fixed engineering rule. The variation that remains appears concentrated in discrete breaks that align with documented (or otherwise externally detected) platform UI or policy changes. Second, out-of-sample prediction of Rit from lagged prices is materially better under the unrestricted kernel than under ESM in those same stable periods. This matters for downstream demand estimation because any systematic misspecification of Rit translates directly into measurement error in the gain/loss regressors and can attenuate η+ and η−.
When we are able to separately estimate the kernel by policy regime
or experimental arm (e.g., around a known update to the reference
algorithm), we see that the mapping from history to Rit
moves in the expected direction: regimes described by the platform as
more responsive'' correspond to weight shifting toward shorter lags, while regimes described asmore
stable’’ correspond to a flatter profile over a longer window. This
regime-level variation is useful beyond mere description: it creates
plausibly exogenous shifts in the perceived gain/loss term holding the
underlying posted-price path fixed, which is precisely the leverage we
need to distinguish reference effects from generic price effects.
We next turn to the demand estimates from . The central question is whether consumers treat a given price gap symmetrically depending on whether pit lies below or above the displayed reference. The maintained specification makes this asymmetry transparent by allowing separate slopes on (Rit − pit)+ and (pit − Rit)+. The empirical pattern is clear: we estimate η− > η+, meaning that being priced above the reference is substantially more damaging to conversion than the corresponding benefit of being equally far below the reference.
We interpret this finding as evidence of reference-dependent preferences and loss aversion in the purchase funnel. Practically, the platform statistic acts as an evaluative frame: conditional on a product’s baseline popularity and common time shocks, consumers penalize ``over-reference’’ pricing sharply. This also helps rationalize the pervasive clustering of observed prices at or just below the reference in raw data: even absent an explicit badge, the reference statistic itself generates a kink in demand. Importantly, this asymmetry is not mechanically driven by the fact that Rit is a function of past prices. In specifications that exploit experimental or policy-driven variation in Rit (through shifts in gt(⋅)), the asymmetry persists, suggesting it reflects consumer response to the displayed signal rather than a reduced-form correlation between current prices and past pricing histories.
Deal labels, when present, provide an additional and policy-relevant layer. In the baseline model that includes a separate label shifter ψLit, we typically find that the badge has an incremental effect above and beyond the continuous gain/loss terms, consistent with a salience channel: the label simplifies the consumer’s inference problem and draws attention to the discount. At the same time, the presence of both the piecewise gap and the label allows us to separate two notions of ``deal effectiveness’’: (i) the marginal value of lowering price relative to the reference, and (ii) the discrete impact of crossing the platform’s threshold and receiving a badge. This distinction is important for platform design because the platform can adjust τ (the threshold) without changing the underlying reference rule, thereby shifting the incidence of labels and the strength of strategic bunching incentives.
We also corroborate the label mechanism with local evidence around the badge cutoff. In RD-style plots (and corresponding local regressions), conversion exhibits a discrete jump when crossing into the labeled region, while smooth covariates show no commensurate discontinuity. We treat this as a diagnostic rather than as the sole identification strategy: it supports the interpretation that Lit changes consumer behavior at the margin, and it reassures us that our parametric specification is not merely capturing unrelated nonlinearities in price sensitivity. We nonetheless acknowledge a key limitation: if sellers precisely manipulate prices to qualify for the badge, local randomization may fail. Our donut and density checks suggest that while some bunching exists, the qualitative evidence of a label-induced discontinuity is robust to excluding a narrow neighborhood around the cutoff.
Both the kernel and the demand response vary systematically across product categories, in ways that align with the economic logic of search costs, price dispersion, and repeat-purchase frequency. On the reference-formation side, some categories exhibit more persistent displayed references (flatter kernels), which can arise either because the platform intentionally prioritizes stability in those categories (to avoid confusing consumers with rapidly changing ``usual’’ prices) or because the underlying price processes are themselves smoother and make alternative kernels observationally similar. Where we observe explicit regime differences by category, we interpret them as design choices that trade off responsiveness against manipulability.
On the demand side, categories that consumers plausibly understand well (high purchase frequency, low differentiation) tend to have larger absolute price sensitivity κ but a weaker incremental role for the reference gap, consistent with consumers relying more on their own internal price knowledge. Conversely, in categories where quality is heterogeneous and search is costly, the displayed reference appears to carry more weight: both η+ and η− are larger in magnitude, and the badge effect ψ is more pronounced. The same logic applies within category to products with sparse review information or newer listings: when intrinsic value is harder to infer, the platform-provided comparison point becomes more influential.
We emphasize that heterogeneity is not merely descriptive; it affects how one should evaluate platform policy. A longer-memory reference in a category with strong loss aversion will tend to increase the prevalence of markdown dynamics (because the reference stays elevated long enough for subsequent discounts to look attractive), while the same long-memory rule in a category with weak reference dependence may have little effect beyond smoothing a displayed statistic. These patterns foreshadow the counterfactual exercises that follow: the welfare consequences of ARM versus ESM or sales-weighted references are not uniform across the catalog, and a ``one-size-fits-all’’ reference algorithm may be suboptimal if the platform can feasibly tailor rules at the category level.
Finally, we evaluate whether the estimated model meaningfully predicts held-out outcomes, which serves two purposes. First, it checks that we have captured the platform’s rule with sufficient fidelity to treat Rit as a credible observed (or generated) regressor. Second, it assesses whether the gain/loss structure improves prediction of conversions relative to conventional price-only models.
For reference prediction, we split the panel into training and test windows within each stable policy regime and forecast Rit using only lagged prices (and, when applicable, lagged sales for the sales-weighted variant). The unrestricted kernel delivers low bias in the level of the reference and tracks turning points well, whereas the ESM specification tends to overreact to recent price changes and then mean-revert too quickly, producing systematic forecast errors after sharp promotions. Importantly, the improvement is not confined to a few extreme events: the kernel outperforms ESM across a wide range of products, suggesting that long-memory is a broad feature rather than an artifact of a small subset of promotional SKUs.
For label prediction, we leverage the mechanical threshold rule Lit = 1{pit ≤ (1 − τ)Rit}. Using observed (or predicted) Rit and the known τ, we can predict label incidence almost perfectly in regimes where the rule is strictly enforced, and deviations (false positives/negatives) are informative about engineering exceptions or UI suppression. This exercise is useful because it identifies parts of the data where labels are not reliably generated by the canonical rule; excluding or separately modeling those observations reduces noise in the estimated ψ.
For conversion prediction, we compare the out-of-sample log loss (and related metrics) of three nested models: (i) a price-only logit with fixed effects, (ii) a symmetric reference-gap model that uses Rit − pit with a single slope, and (iii) the baseline asymmetric gain/loss specification with the badge term. The asymmetric model yields the best predictive performance, particularly in the tails of the price-gap distribution where the kink at pit = Rit matters most. Substantively, this indicates that the gain/loss structure is not simply an overfit to in-sample nonlinearities; it captures stable behavior that generalizes across time and across products.
We close with a limitation and a practical takeaway. The limitation is that conversion data alone cannot fully disentangle attention (traffic) reallocation from conversion conditional on exposure when the platform simultaneously changes ranking and labeling. Our robustness checks that control for position and restrict to stable UI contexts mitigate this concern but cannot eliminate it entirely. The practical takeaway is that, despite this caveat, the combination of (i) a long-memory reference statistic and (ii) loss-averse consumer response is strongly supported by the data and is quantitatively important for predicting both the incidence and effectiveness of discounts. These empirical primitives are precisely what we need to discipline the counterfactual platform-design exercises in the next section.
We now use the estimated primitives to evaluate how alternative designs of the platform reference statistic change market outcomes. The object of interest is not the displayed reference Rit per se, but the equilibrium it induces: changing the algorithm gt(⋅) changes consumers’ perceived gains and losses and, through seller best responses, changes the posted price path that future references are computed from. This feedback loop is the core policy tradeoff. A more persistent reference can stabilize what consumers see and make discounts salient, but it can also create incentives to manage the state variable Rit by temporarily raising prices. Conversely, a more responsive reference can reduce intertemporal anchoring and manipulation, but may compress perceived discounts and thereby attenuate the platform’s own promotional instruments.
We consider three canonical algorithms that span the design space discussed in practice. In each case we hold fixed the estimated demand parameters (κ, η+, η−, ψ, γ), product and time shifters (δi, λt), costs ci, and the exposure process (mit, Xit); we then change only the mapping from history to the displayed signals (Rit, Lit).
We define an
ARM'' reference as a flat (or slowly decaying) $K$-lag kernel on posted prices, \[ R_{it}^{\mathrm{ARM}}=\sum_{k=1}^K \omega_k^{\mathrm{ARM}} p_{i,t-k}, \] where $\omega^{\mathrm{ARM}}$ is calibrated to represent a long window (e.g., approximately uniform weights over a fixed horizon) while preserving the same support $K$ as in the status quo. This captures the design choice of making ausual
price’’ persist and thus remain informative about a longer price
history.
We define an exponentially smoothed reference,
RitESM = (1 − α)pi, t − 1 + αRi, t − 1,
with α chosen to match a
targeted half-life (or, alternatively, to minimize prediction error for
the observed Rit
within the subset of regimes that appear most responsive). Relative to
ARM, ESM compresses the influence of older prices and makes Rit
track recent postings.
Finally, we consider a sales-weighted statistic that updates based on
transaction prices rather than merely posted prices:
$$
R_{it}^{\mathrm{SW}}=\frac{\sum_{k=1}^K \omega_k\,
y_{i,t-k}p_{i,t-k}}{\sum_{k=1}^K \omega_k\, y_{i,t-k}},
$$
with the standard fallback rule when the denominator is zero (we use the
previous reference, Ri, t − 1,
as a conservative default). This design is motivated by a manipulability
concern: a posted price that generates no demand should carry little
informational content about the prevailing transaction price and
therefore should have limited influence on the future anchor.
In all counterfactuals, the deal label is generated by the same
threshold form,
Lit = 1{pit ≤ (1 − τ)Rit},
unless we explicitly vary τ as
an additional policy lever. Thus, when we change g(⋅) we mechanically change label
incidence even holding prices fixed; when we additionally allow sellers
to respond, we may also generate strategic bunching around the
cutoff.
To compute equilibrium pricing under each algorithm we implement two
seller decision rules that bracket plausible behavior. Our baseline is :
for each (i, t) the
seller chooses
$$
p_{it}\in\arg\max_{p\in[\underline p,\bar p]} \ (p-c_i)\, m_{it}\,
q_{it}(p,R_{it},L_{it},X_{it}),
$$
taking the displayed state (Rit, Lit)
as given and ignoring the impact of pit on
future references. Under the logit demand we solve this one-dimensional
problem by direct search and enforce the price bounds. In addition, we
report a that internalizes the law of motion for Rit: we
solve a finite-horizon dynamic program with state variable R (and, for the sales-weighted
design, the sufficient statistics needed to update the weighted
average), using value function iteration on a discretized state grid.
Because the state is scalar under posted-price kernels and ESM, this
benchmark is computationally stable and provides an interpretable upper
bound on the extent of intertemporal reference management that a
forward-looking seller could find profitable.
Given a candidate algorithm and seller policy, we simulate paths forward in time. We initialize the reference using the observed pre-period history (or the first observed Rit when available) and then iterate: (i) compute pit from the seller rule, (ii) compute Lit from the threshold, (iii) compute expected conversion qit and expected sales 𝔼[yit] = mitqit, and (iv) update Ri, t + 1 using the counterfactual algorithm. We use expected sales in the reference update so that comparisons across algorithms reflect systematic incentives rather than simulation noise; when reporting uncertainty, we add a Monte Carlo layer that draws yit ∼ Binomial(mit, qit) and bootstrap the estimated demand parameters.
We report three families of outcomes aggregated over products and
time. First, seller revenue and gross merchandise value (GMV) coincide
in our setting, GMV = ∑i, tpityit,
and we also report seller profit Π = ∑i, t(pit − ci)yit
to isolate changes driven by margin versus volume. Second, we report
conversion and units, ∑i, tyit
and ∑i, tyit/∑i, tmit,
to clarify whether an algorithm primarily moves prices or primarily
moves quantities through perceived gains and losses. Third, we compute
consumer surplus (CS) under the estimated discrete-choice model. With an
outside option normalized to zero and purchase utility index
Vit(p, R, L, X) = δi + λt + Xit′γ − κp + η+(R − p)+ − η−(p − R)+ + ψL,
the per-impression expected surplus is the log-sum,
$$
\mathrm{cs}_{it}=\frac{1}{\kappa}\log\!\left(1+\exp(V_{it})\right),
$$
and total surplus is CS = ∑i, tmit csit.
While this measure inherits the maintained logit structure (and
therefore should be interpreted as a model-based welfare index), it is
well-suited for counterfactual comparisons because it aggregates exactly
the margin affected by reference design: how the platform changes the
attractiveness of purchase relative to the outside option by shifting
perceived gains/losses and labels.
Two robust patterns emerge. First, moving from long memory (ARM) to
short memory (ESM) mechanically reduces the persistence of perceived
discounts: after a temporary price cut, RitESM
falls quickly, shrinking (Rit − pit)+
and reducing label incidence for a fixed price path. Holding prices
fixed at their observed values, this
information'' effect tends to lower predicted conversion during promotions (because the same nominal discount appears smaller relative to the reference) and raises predicted conversion in post-promotion periods (because the reference no longer remains artificially high). Second, when we let sellers respond, ESM dampens intertemporal incentives: because today’s price has less influence on tomorrow’s displayed anchor, we observe less price cycling and a tighter link between $p_{it}$ and contemporaneous demand shifters. In categories where estimated loss aversion is strong (high $\eta_-$), this smoothing effect can increase welfare by reducing the frequency of episodes in which products appear overpriced relative to their reference, even if it also reduces the number ofdramatic’’
discounts.
Under ARM, the same mechanisms run in the opposite direction. Long memory keeps the reference elevated after high-price episodes and therefore sustains the potential benefit from later markdowns: lowering pit below a persistent Rit yields a larger gain term and more frequent labeling. In the myopic equilibrium, this primarily shows up as greater price dispersion and more mass of prices just at or below the deal cutoff (1 − τ)Rit. In the dynamic benchmark, the incentive is stronger: forward-looking sellers find it valuable to occasionally post higher prices to lift future Rit and then harvest the gain region with subsequent markdowns. The quantitative importance of this channel depends on the estimated asymmetry: when η− > η+, the dynamic policy is disciplined by the fact that pricing above the reference is demand-destructive, so profitable ``anchor inflation’’ must be paired with periods in which demand is intrinsically low (low mit or unfavorable Xit) or in which the seller can avoid being far above Rit given its persistence. Thus, ARM is most distortionary precisely in environments where sellers can time high prices into low-attention periods, a pattern that is consistent with the intuition behind intertemporal reference management.
Sales-weighting changes the equilibrium in a distinct way: it targets manipulability rather than memory. For a given weight profile ω, the sales-weighted influence of a posted price on future R is proportional to realized (or expected) sales at that price. As a result, transient price spikes with little demand have little effect on the anchor, and the equilibrium exhibits less incentive to ``post high, sell low.’’ In our simulations this typically reduces extreme reference levels and compresses the distribution of (pit − Rit)+, which is valuable when η− is large because it avoids large over-reference penalties. At the same time, sales-weighting preserves the possibility of persistent references in genuinely stable, high-volume periods, so it can maintain much of the perceived-discount salience of long memory without relying on non-transactional postings.
To clarify what drives welfare differences, we implement a simple decomposition that separates the direct impact of changed displayed signals from the indirect impact through seller repricing. For each counterfactual algorithm 𝒜 ∈ {ARM, ESM, SW} we compute:
We hold prices fixed at the observed path {pitobs}
and recompute the counterfactual reference and labels (Rit𝒜, Lit𝒜)
implied by 𝒜. We then recompute
predicted conversions and surplus using the estimated demand
system:
qit𝒜, info = q (pitobs, Rit𝒜, Lit𝒜, Xit).
Differences relative to the status quo isolate how the platform’s choice
of displayed statistics changes consumer behavior holding the supply
side fixed.
We then allow sellers to reoptimize prices under 𝒜 (myopic or dynamic), generating {pit𝒜, *}
and the corresponding (Rit𝒜, *, Lit𝒜, *).
The incremental difference between the full equilibrium counterfactual
and the information-only counterfactual captures the induced repricing
effect:
Δstrat(𝒜) = [W (p𝒜, *, R𝒜, *, L𝒜, *) − W (pobs, R𝒜, L𝒜)],
for each welfare object W ∈ {Π, GMV, CS}.
Economically, this is the portion of the policy effect that arises
because the reference algorithm changes sellers’ incentives to manage
the state.
This decomposition is informative because it distinguishes two interpretations of a ``better’’ reference design. A design may improve welfare because it provides a more informative and less distortable comparison point (a direct effect on consumer choice given prices), or because it discourages intertemporal manipulation and thereby changes equilibrium pricing (an indirect effect). In our setting the second component is often decisive: algorithms that look similar in the information-only exercise can produce materially different outcomes once sellers internalize how today’s postings shape tomorrow’s anchor and badge eligibility.
Stepping back, the counterfactuals highlight a practical design lesson. Reference statistics and deal labels are often treated as presentation features, but in a marketplace they function as policy instruments that jointly determine (i) the mapping from price histories into salient consumer frames and (ii) the incentives sellers face to manage those frames. Long memory increases the state dependence of demand and thus the return to intertemporal price shaping; sales-weighting reduces the scope for such shaping by tying the state to transactions; short memory reduces it by making the state harder to move in a persistent way. In the next section, we examine how robust these conclusions are to competition, alternative demand forms, and richer heterogeneity in consumer segments and seller behavior.
Our counterfactual exercises rely on a parsimonious set of primitives: a displayed reference statistic that is a deterministic function of past prices (and possibly past sales), a demand system in which perceived gains and losses enter in a piecewise-linear way, and seller pricing rules that are either myopic or forward-looking with respect to the reference state. In this section we probe how sensitive the qualitative conclusions are to three classes of departures that are central in marketplace applications: (i) multi-seller competition for the same consumer, (ii) alternative functional forms for demand and the mapping from conversion to welfare objects, and (iii) heterogeneity and dynamics on the consumer side (segment-specific and potentially state-dependent sensitivity to reference comparisons). Throughout, the goal is not to ``improve fit’’ mechanically, but to assess whether the main design tradeoff—reference persistence versus manipulability—survives once we relax simplifying assumptions.
The baseline environment treats each SKU as having a single price setter. This is appropriate for many private-label products and for listings with a single effective seller, but it abstracts from a common setting in which several sellers simultaneously offer the same product (or close substitutes) and compete for the same impressions. Two issues then arise. First, equilibrium prices respond not only to the displayed reference but also to rivals’ prices. Second, the platform may compute a reference statistic at the product level (shared across sellers) rather than at the seller-offer level, which can amplify externalities: one seller’s posted price can move the reference that affects demand for all sellers.
To address these concerns we implement a multi-seller extension in
which, for a given product i
and time t, a set of sellers
j ∈ {1, …, Ji}
simultaneously choose prices {pijt}.
Consumers choose among offers and the outside option. A tractable
specification that nests our baseline is an offer-level logit with a
shared reference term:
$$
q_{ijt}=\frac{\exp\!\left(\delta_{ij}+\lambda_t+X_{ijt}'\gamma-\kappa
p_{ijt}+\eta_+(R_{it}-p_{ijt})_+-\eta_-(p_{ijt}-R_{it})_+ + \psi
L_{ijt}\right)}
{1+\sum_{\ell=1}^{J_i}\exp(\cdot) }.
$$
Seller j earns πijt = (pijt − cij) mit qijt,
where mit is
product-level traffic that is endogenously split across offers by the
choice model. We consider two reference constructions: (a) an
offer-level reference Rijt
computed from pij, t − k
(which largely preserves the single-seller logic offer-by-offer), and
(b) a product-level reference Rit
computed from a traffic- or sales-weighted average of offers’
transaction prices, which introduces a within-product externality.
In the myopic case, the first-order condition for each seller resembles the monopoly condition but with an additional diversion term, as in standard differentiated Bertrand. We solve the static Nash equilibrium by iterating best responses on a price grid, and we repeat the counterfactual algorithm changes holding fixed the same estimated demand parameters (or, when feasible, re-estimating demand with offer fixed effects). Two patterns are robust. First, the comparative statics with respect to reference memory persist: longer memory increases the return to being in the gain region, which raises the incentive to place mass near the label cutoff even under competition. Second, sales-weighting continues to reduce the payoff to ``no-sale’’ price spikes, but now with an additional benefit: when R is computed at the product level, sales-weighting reduces the extent to which one seller can impose a higher anchor on rivals. In other words, the externality created by a shared reference is disciplined when only realized transactions meaningfully update the state.
That said, competition changes magnitudes in a predictable direction. Because rivals provide a substitute, any seller who prices above R (entering the loss region) loses share quickly when η− > 0, making ``anchor inflation’’ especially unattractive in high-competition products. This moderates the dynamic manipulation channel relative to a monopoly benchmark. Conversely, if the label confers a discrete demand boost (captured by ψLijt) and consumers search at the product level, competition can increase the incentive to price just below the cutoff to win the label and capture demand from rivals. In this sense, competition can the salience of the platform’s labeling rule even if it reduces the profitability of long-run reference management.
Our welfare calculations and counterfactual simulations use a logit conversion model primarily for tractability and for its clean mapping to consumer surplus. Two concerns are natural: (i) whether the conclusions hinge on the logit link (rather than, say, a linear probability or count model), and (ii) whether the piecewise-linear gain/loss structure is too restrictive, particularly around p ≈ R where small changes in R can mechanically flip the sign of the reference term.
We therefore re-run the main exercises under three alternative demand specifications.
We estimate a linear model for expected purchases,
𝔼[yit ∣ pit, Rit, Xit] = mit(bi + λt + Xit′γ − apit + η+(Rit − pit)+ − η−(pit − Rit)+),
and truncate implied purchase probabilities to [0, 1] when needed. This specification yields
closed-form best responses in each region and is useful for
stress-testing whether the equilibrium feedback we highlight is an
artifact of nonlinear choice. The qualitative ranking of algorithms is
unchanged: longer memory raises the intertemporal value of elevating
R and increases price
dispersion; sales-weighting attenuates the influence of low-demand price
postings on future R and
reduces volatility; short memory reduces both perceived discount
persistence and the state dependence of optimal pricing. What changes is
the decomposition of effects: with linear demand, the marginal impact of
being in the gain region is constant within the region, so strategic
bunching at the label cutoff is less sharp unless we include a discrete
label shift ψ.
When yit is better interpreted as a count generated by a purchase rate rather than a binomial conversion from fixed impressions, we estimate a Poisson model for 𝔼[yit] with log link and an offset log mit. This preserves the multiplicative structure of demand shocks and is more forgiving about overdispersion. The estimated gain/loss asymmetry is similar, and counterfactual rankings remain stable. In particular, the sales-weighted reference continues to dominate unweighted posted-price references in environments with large transitory demand shocks because it reduces the sensitivity of future R to prices posted in periods that generate few transactions.
To relax the kink at p = R, we replace the
piecewise-linear term with a differentiable approximation, e.g.
$$
\eta\,(R-p)\tanh\!\left(\frac{R-p}{\sigma}\right),
$$
or a spline in (p − R). These alternatives
allow the data to determine whether sensitivity changes sharply at p = R or more gradually.
Empirically, the kink is often present but not infinitely sharp,
consistent with consumer inattention or noisy perception of the
reference. Importantly, the policy implications are not driven by the
kink itself: the main driver is the between over-reference and
under-reference pricing, which remains under smooth approximations.
Across these alternatives, the main limitation is interpretability of welfare. When we move away from logit, we lose the exact log-sum expression for consumer surplus. For robustness we therefore report (i) compensating-variation style indices under a maintained extreme-value error only for the subset of models that admit it, and (ii) ``behavioral surplus’’ proxies that integrate the predicted change in purchase utility (excluding the idiosyncratic shock) over impressions. The algorithm ranking is stable under both approaches, though levels are model-dependent.
Reference pricing is unlikely to be homogeneous: experienced consumers may treat ``usual price’’ skeptically, while new consumers may rely on it as a heuristic; some categories (e.g., commoditized electronics) may be more reference-sensitive than others (e.g., niche goods). We incorporate heterogeneity in two complementary ways.
First, we estimate segment-specific parameters by interacting the
gain/loss terms and the label with observable segment indicators S ∈ {new, returning} (or device
type, traffic source, and category bins):
Vit = ⋯ + η+, S(Rit − pit)+ − η−, S(pit − Rit)+ + ψSLit.
A consistent pattern is that returning users exhibit smaller ψS and smaller
η+, S,
while η−, S is often
comparable or larger. This is consistent with a behavioral
interpretation in which frequent shoppers discount promotional framing
but still react negatively to offers that look overpriced relative to a
displayed benchmark. In counterfactual simulations, this heterogeneity
matters for effects: designs that generate more frequent ``big
discounts’’ (long memory) shift surplus toward new users, while designs
that reduce over-reference episodes (sales-weighting and short memory)
disproportionately benefit returning users.
Second, we allow for continuous heterogeneity by estimating a random-coefficients version in which (η+, η−, ψ) follow a parametric distribution across users (approximated at the impression level using observed mix-shifters). While we cannot recover full user-level histories in many datasets, a mixed-logit approximation is still informative about the tails: a small mass of highly reference-sensitive consumers can make labels valuable even if average effects are modest. Under this extension, sales-weighting becomes particularly attractive as a design: it reduces the ability to ``manufacture’’ high anchors that disproportionately exploit the most reference-sensitive segment, while preserving genuine informational content from high-volume transaction periods.
A deeper concern is that the displayed reference statistic may not be the only reference consumers use. Consumers may carry an internal reference from their own browsing or prior purchases, and they may update beliefs about what R means when platform policies change. To examine whether our findings are an artifact of treating R as an exogenous shifter of utility, we implement two reduced-form dynamic-consumer checks.
We augment utility with an internal state R̃it
that updates via exponential smoothing of observed prices,
R̃i, t = (1 − ρ)pi, t − 1 + ρR̃i, t − 1,
and allow demand to depend on both Rit and
R̃it
through gain/loss terms. Identification is challenging without panel
user histories, but we can bound the importance of R̃ by exploiting moments that compare
short-run responses around policy shifts. The main finding is that the
displayed R retains
incremental explanatory power even when we allow for a persistent
internal reference, suggesting that platform design changes have
first-order effects on consumer framing rather than merely reflecting
what consumers already know.
We also allow the probability that a consumer attends to the
reference signal to vary with salience, proxied by whether a label is
shown and by the magnitude of the implied discount. A simple
specification is
Vit = ⋯ + Ait[η+(Rit − pit)+ − η−(pit − Rit)+] + ψLit,
where Ait ∈ [0, 1]
is an attention index (logit in observables). This extension tends to
the difference between long- and short-memory references in the
information channel (because persistent references generate more salient
gaps), but it does not overturn the manipulation logic: if anything,
greater attention makes manipulability more consequential and therefore
strengthens the case for sales-weighting or other guardrails.
Finally, we implement several diagnostics that address threats to identification and to the interpretation of the counterfactuals.
We vary K and re-estimate the kernel weights (or α) where applicable, confirming that the estimated ``memory mass’’ is not driven by an arbitrary horizon choice. Model selection metrics typically favor a medium-to-long memory in posted-price references, but the welfare ranking of sales-weighting versus unweighted references is stable across K.
When using threshold-based identification, we check for smoothness of predetermined covariates and for the absence of discontinuities at placebo cutoffs. We also test for bunching at the cutoff to gauge strategic pricing. Importantly, bunching is informative rather than fatal: it signals that the label is valued and that the policy changes incentives, which is precisely the channel our counterfactuals seek to quantify.
Our simulations hold the exposure process (mit, Xit) fixed. To assess whether algorithm changes could materially reallocate traffic across products (e.g., via ranking), we re-weight impressions using simple rules that map predicted conversion to exposure. While this introduces additional assumptions, the direction of the main effects persists: designs that reduce over-reference penalties (when η− is large) tend to increase conversion and therefore receive more exposure under conversion-based ranking, reinforcing their welfare advantage.
Taken together, these extensions suggest that the central message is not a knife-edge consequence of a particular functional form or of a monopoly benchmark. The platform’s reference design affects outcomes through two linked mechanisms—how consumers interpret current prices and how sellers (and competing sellers) adapt their pricing to shape future frames—and the comparative advantage of transaction-based (sales-weighted) updating is especially robust in the presence of strategic behavior. The remaining uncertainty is therefore less about the sign of the key tradeoffs and more about where, in a given marketplace, the economically relevant margin lies: consumer protection against misleading anchors, seller incentives and competition, or the platform’s own use of labels as a promotional tool. In the next section we discuss how these empirical and counterfactual patterns map into governance questions about transparency, enforcement, and the appropriate design of reference-price policies.
Displayed reference prices and deal badges sit at an uncomfortable intersection of information and persuasion. On one hand, a statistic such as ``usual price’’ can summarize a noisy price history into a single, cognitively cheap benchmark, helping consumers infer whether today’s offer is unusually good or unusually bad. On the other hand, precisely because many consumers treat the benchmark as diagnostic, the statistic becomes an object of strategic attention for sellers and an object of design choice for the platform. The economic logic we develop throughout the paper clarifies this tension: the same persistence that makes a reference informative (it reflects more than a single day’s idiosyncratic price) also makes it manipulable (it can be moved by past pricing actions), and the interaction of that state with asymmetric gain–loss demand generates incentives for intertemporal price shaping.
From a consumer-protection perspective, the key question is not whether a displayed reference is literally computed as described, but whether it is likely to mislead a reasonable consumer about the economic reality it purports to summarize. In our framework, a displayed Rit is best interpreted as a platform-provided signal about the distribution of recent prices or transaction prices for product i. A reference can therefore be misleading in at least three distinct (and empirically separable) ways.
First, the reference can be : the platform claims R is an
average'' orusual’’ price, but the algorithm is instead a
marketing construct (e.g., based on list prices, MSRP, or posted prices
that rarely transact). This is the most direct analogue of classic
``former price’’ deception addressed in many jurisdictions (e.g.,
requirements that the reference be a genuine, recent selling price
rather than an inflated sticker). In our notation, the concern is that
Rit is
formed from posted prices {pi, t − k}
even when sales are zero, so that the statistic is mechanically easy to
inflate without corresponding consumer transactions.
Second, the reference can be even if computed exactly as described, because the design amplifies the impact of strategically chosen, low-volume, or non-transactional prices. This is where the algorithmic nature of the statistic matters. When $R_{it}=\sum_{k=1}^K \omega_k p_{i,t-k}$, the influence of a transient price increase on future frames is (locally) proportional to ω1 regardless of whether anyone bought at the high price. If demand is loss-averse (η− > η+), sellers then face a predictable temptation: push R up in low-demand periods and later ``discount’’ into the gain region. This behavior need not violate any stated computation rule, yet it can produce a sequence in which the reference systematically overstates what consumers typically pay. A consumer-protection standard focused only on the truth of the computation, rather than on its susceptibility to manipulation, would miss this channel.
Third, reference pricing can be due to heterogeneity in consumer attention and interpretation. If a subset of consumers heavily weights badges or reference gaps while others ignore them, then a design that is unbiased on average may still exploit the most reference-sensitive consumers. In that case, the relevant harm is distributional and behavioral rather than purely informational. Our demand estimates and counterfactuals speak directly to this possibility: the welfare consequences of reference design depend on the mass of consumers for whom (R − p) meaningfully shifts perceived value.
These distinctions suggest a policy lesson: enforcement that treats reference pricing as a static claim about a single past price is incomplete in digital marketplaces where R is an algorithmic function of a seller’s entire recent pricing process. A natural modernization is to evaluate not only whether R reflects a past price, but whether it reflects a past price history in a way that is robust to strategic posting.
Our analysis is not a blanket indictment of reference statistics.
Instead, it highlights a design tradeoff that platforms can manage. We
can express the manipulability of a reference algorithm by the
sensitivity of the next-period reference to today’s posted price,
holding fixed other primitives:
$$
\frac{\partial R_{i,t+1}}{\partial p_{it}}.
$$
Under unweighted posted-price kernels this sensitivity is mechanically
large and largely independent of realized sales. Under sales-weighting
it is attenuated when yit is
small, thereby tying the evolution of the state to economically
meaningful transactions. This observation leads to practical guardrails
that are implementable even when platforms do not want to abandon
reference pricing altogether.
The most direct guardrail is to compute reference statistics from transaction prices rather than posted prices whenever feasible, i.e., using a sales-weighted rule (with an explicit fallback when the denominator is small). This does not require that the platform disclose individual sales, only that the reference be anchored in realized purchases. In our counterfactuals, this design reduces equilibrium price volatility and mitigates ``no-sale’’ anchor inflation, with especially large gains when η− is sizeable.
Sales-weighting introduces a separate concern: thin products with few transactions can yield noisy or stale references. Platforms can address this by imposing minimum effective sample thresholds (e.g., compute a reference only when ∑kωkyi, t − k exceeds a cutoff) and otherwise display ``insufficient data’’ or a broader-category benchmark. This is a transparency improvement as well: admitting uncertainty can be less misleading than displaying a precise but manipulable number.
More generally, platforms can adopt bounded-influence updating rules, for example by winsorizing transaction prices or limiting the per-period change in R. Such constraints reduce the gains from extreme, temporary prices. In addition, disclosing the time window (or half-life, in the exponential case) can help sophisticated users interpret the statistic appropriately. We emphasize, however, that disclosure is not a substitute for robustness: a disclosed but easily gamed statistic can remain harmful.
Deal labels matter because they create discrete incentives around thresholds. If Lit = 1{pit ≤ (1 − τ)Rit}, then the platform effectively chooses a policy parameter τ that trades off promotional intensity against strategic bunching. A high τ reduces label prevalence but can induce sharp clustering just below the cutoff when sellers value the badge. A lower τ increases labeling frequency but can dilute informational content. One pragmatic compromise is multi-tier labeling (e.g., multiple cutoffs) paired with hysteresis or minimum-duration rules, which can reduce churning around a single threshold while still communicating meaningful price differences.
Digital marketplaces blur the traditional line between a retailer making claims and a neutral venue hosting third-party offers. When the platform computes and displays Rit and Lit, it is not a passive messenger; it is choosing a market-wide framing technology that shapes demand and seller incentives. This creates a governance problem with two parts.
First, platforms must police seller behavior that attempts to exploit the reference system. Even with a robust algorithm, sellers may engage in rapid price cycling, low-inventory ``sales’’ at high prices, or other tactics intended to seed the reference. Monitoring tools naturally follow from our primitives: one can flag SKUs with repeated large increases in pit coinciding with low yit (high posted-price variance but low transaction volume), or with systematic patterns of raising R followed by discounting into the gain region. Such diagnostics are essentially tests for whether posted-price dynamics are disproportionately chosen to move the state rather than to clear demand.
Second, the platform must govern itself. A platform that benefits from higher conversion may have incentives to choose a reference algorithm that maximizes the perceived discount wedge, even if it weakens the connection between R and what consumers actually pay. This tension is particularly salient when platforms monetize via advertising, take rates, or paid prominence: the surplus from increased sales need not align with truthful framing. For this reason, a credible governance regime may require either external auditing or internal firewalls between teams optimizing short-run conversion and teams responsible for consumer-protection compliance.
Calls for algorithmic transparency are often vague, and they ignore a genuine tension: full public disclosure of g(⋅) can enable gaming, while opacity can enable deception. Our setting suggests a tiered approach.
Consumers benefit from simple, interpretable disclosures: the
reference window (e.g.,
based on the last 30 days''), whether the statistic is transaction-based (e.g.,based
on prices customers paid’’), and whether the reference is product-level
or offer-level when multiple sellers exist. These disclosures help align
consumer interpretation with the statistic’s construction without
providing a detailed blueprint for manipulation.
Regulators (or accredited auditors) can require more detailed, confidential documentation: the exact kernel weights ω, the treatment of missing sales, the handling of outliers, and logs that allow reconstruction of Rit for sampled items. In our framework, this is straightforward because reference formation is a deterministic mapping from recorded histories. The same auditability logic applies to label rules ht(⋅) and to any policy changes or experiments that affect what consumers see.
Independent research can play a complementary role, particularly for measuring behavioral impacts and heterogeneity. Platforms can enable this by providing privacy-preserving access to aggregated panels with (p, R, L, m, y, X) and documented policy shifts. Because our identification strategies exploit threshold rules and algorithm changes, they are feasible in aggregated data without user-level histories, reducing privacy risks while still permitting credible evaluation.
We close by acknowledging what our analysis does not settle. First, while we emphasize manipulability and dynamic incentives, the welfare evaluation still relies on a maintained demand structure and on how we map predicted utility changes to consumer surplus. Alternative preference models can shift levels and, in principle, alter the ranking in edge cases where reference signals convey real cost information rather than framing. Second, our counterfactuals abstract from some platform-wide general equilibrium forces, such as endogenous ranking, entry and exit of sellers, and long-run learning by consumers about the meaning of R. These forces could either dampen or amplify harms. Third, what counts as ``deceptive’’ is ultimately a legal and normative question. Our contribution is to supply an economic diagnostic: a reference algorithm that places substantial weight on non-transactional posted prices creates a predictable path by which sellers can generate large perceived gains without corresponding consumer benefit, and that mechanism is testable in data.
Reference-price statistics and deal labels are not innocuous UI elements; they are policy levers that reshape demand and pricing dynamics through a state variable that the platform itself defines. Our empirical framework shows how to recover both the memory embedded in the platform’s reference design and the asymmetry in consumer responses to gains and losses. The resulting counterfactuals clarify a central governance tradeoff. Long-memory references can intensify perceived discounts and thereby raise conversion in the short run, but they also invite intertemporal manipulation and price volatility. Transaction-based (sales-weighted) updating, by contrast, makes the reference more difficult to inflate without real purchases and tends to produce more stable pricing paths, with consumer-surplus gains that are especially salient when consumers penalize over-reference prices.
For policymakers, the implication is that modern consumer protection should treat reference pricing as an algorithmic claim whose credibility depends on design robustness, not merely on literal truth. For platforms, the implication is that responsible governance is feasible: one can retain the informational benefits of reference statistics while reducing manipulability through transaction grounding, bounded influence, and auditable rules. More broadly, the economics here is a reminder that transparency is not only about disclosure; it is about choosing mechanisms whose incentives align with the meaning consumers reasonably infer.