Observational Price Variation in Scanner Data Does Not Reproduce Experimental Price Elasticities
Robert Bray, Robert Evan Sanders & Ioannis Stamatopoulos (2026)
Roemheld: The paper documents a striking divergence, but the experimental estimates face a plausibility challenge — they imply profit opportunities far exceeding observed industry margins. Contribution is valuable, but headline conclusion requires stronger foundations.
Anonymous evaluator: A nice paper with a rich novel experimental dataset. The DID shift is not automatically "observational bias" — it may reflect functional-form restrictiveness rather than bias per se. Worth quantifying how much is explained by curvature.
Roemheld70overall · tier 3 / 3.5
Anonymous60overall · tier 4 / 5
Verbatim quote — evaluator or author's own words AI-drafted summary — checked and edited by D. Reinstein
The paper
The research
This paper tests whether standard observational demand estimation with rich scanner data recovers the price elasticities revealed by a randomized pricing experiment — and finds they diverge substantially. AI-drafted summary
The canonical wording for this overview is the evaluation manager's summary on PubPub: unjournal.pubpub.org/pub/evalsumbraybray/. The text below was drafted by AI and edited by D. Reinstein.
Using data from a large grocery retailer's pricing experiment, Bray, Sanders & Stamatopoulos (Robert Bray, Robert Evan Sanders, and Ioannis Stamatopoulos) ask whether standard observational demand-estimation methods — applied to rich scanner data — recover the "true" price elasticities revealed by a randomized price experiment. They find the answer is no. The observational methods shift own-price elasticities substantially relative to the experimental benchmark:
A difference-in-differences analysis finds observational own-price elasticity estimates are roughly 2 units more negative (more elastic) than the experimental benchmark — the observational estimates shift by about −2 relative to the experimental own-price elasticity of approximately −0.34.
The headline experimental own-price elasticity is approximately −0.34 — substantially less elastic than observational methods suggest.
The gap persists across nine product categories and at the individual-product level.
The pattern holds under multiple robustness specifications addressing salience, stockouts, and aggregation concerns.
The paper title was softened from "cannot reproduce" to "does not reproduce" to better reflect the scope of the evidence.
Why the gap? Leading explanations examined
The paper examines and partially rules out several candidate explanations: temporary-price-reduction salience (93% of observational variation is TPRs), experimental compliance, stockout dynamics, sample selection, basket-level price perception, and functional-form restrictiveness. The full analysis is in the paper; the evaluators focus on which explanations remain live.
What's in the paper
Actual section structure from the openly-hosted working-paper version. The evaluated 2026 revision renumbers some sections — the evaluation's references to Sections IV.E/IV.F/IV.I correspond to later Findings subsections.
1. Introduction
2. Related Literature
3. Data and Experiment — 3.1 Data; 3.2 Price Experiments
The paper's findings bear directly on whether scanner-data demand estimates can be trusted for policy analysis — including animal-welfare-motivated policies that hinge on meat / plant-based substitution elasticities. If observational methods systematically differ from true experimental estimates, cost-effectiveness conclusions built on scanner-data elasticities could be unreliable.
The evaluators agree on the significance of this contribution while disagreeing on its current strength. Roemheld presses on the plausibility of the experimental benchmark itself. The anonymous evaluator questions the "bias" framing, suggesting the gap may reflect functional-form restrictiveness rather than observational bias.
If the experimental elasticity estimate is correct, the finding has broad implications for empirical industrial-organization, consumer-demand modeling, and any policy that uses scanner-data elasticities as inputs.
The evaluations
What the evaluators said
Two evaluators with different focal concerns — their views are presented side by side here.
Summary (verbatim)
"This paper compares observational and experimental price elasticity estimates from a large grocery retailer. While the paper documents a striking divergence between the two approaches, I argue that the experimental estimates themselves face a plausibility challenge: they imply profit opportunities far exceeding observed industry margins. I offer five diagnostic suggestions that may help resolve this tension, focusing on experimental compliance, promotional salience, and stockout dynamics. The paper's contribution is valuable, but its headline conclusion requires stronger foundations."
Main claim as identified
Main research claim (as read by E1)
"Standard observational approaches to estimating price elasticity of demand fail to approximate true (experimental) elasticity, even with rich scanner data."
E1's central diagnostic concern
The experimental own-price elasticity of approximately −0.34 implies the retailer prices in the inelastic region, which would imply profit opportunities far above observed grocery margins (Kroger-like: 20–25% gross, 2–3% net). This "margin puzzle" is the organizing concern for E1's five suggestions.
Six diagnostic concerns
Each concern is paired with the authors' reply as a margin note inside the full evaluation below (open "Full evaluation"); the author-response overview indexes them.
E1-01 The margin puzzle. An elasticity near −0.34 implies the retailer prices in the inelastic region, inconsistent with profit maximization under standard markup formulas and observed margins.
E1-02 Experimental compliance and price-data integrity. Do transacted prices accurately reflect the experimental assignments?
E1-03 Promotional salience and consumer attention. 93% of observational price changes are temporary price reductions — a very different mechanism from the experiment's price changes.
E1-04 Out-of-stock dynamics. Stockouts could censor observed demand response, biasing elasticity estimates in either direction.
E1-05 Aggregation and sample selection. Test products are higher-selling and non-randomly selected; unweighted averages may not generalize.
E1-06 Basket-level perception and competitive dynamics. Mean-zero perturbations to individual products may leave perceived basket prices approximately unchanged, muting response.
The following is Roemheld's complete report, reproduced verbatim from the published evaluation. Headings, paragraph structure, and bullet points are preserved as submitted. The authors' point-by-point replies (green) appear as margin notes beside the passage each one answers on wide screens, or inline on narrow screens.
Written report
1. Summary
Bray et al. (2024) use both observational and experimental data from a large Midwestern grocery retailer to compare estimates of price elasticity of demand across the two settings. The paper's central finding is that the estimates differ dramatically: observational approaches produce substantially more elastic demand estimates than the randomized experiment does. The authors interpret this as evidence that observational approaches to demand estimation are unreliable.
The paper makes several noteworthy contributions. First, it is rare to see experimental pricing data of this richness and scale made available for academic study. The transparency with which the authors describe their data and methodology is commendable. Second, the paper's honest engagement with the difficulty of reconciling observational and experimental estimates is refreshing and a realistic presentation of applied demand estimation. Third, the paper's general estimation approaches appear defensible.
That said, I have significant reservations about the paper's headline conclusion. As I argue below, the experimental estimates themselves appear implausible when subjected to a simple back-of-the-envelope calculation.
2. The Margin Puzzle
The paper's primary experimental elasticity estimate is approximately −0.34 (p. 9). This estimate remains inelastic even for price changes of +/-20% and over time horizons of several months. On its face, this implies that the grocery retailer is consistently pricing in the inelastic portion of the demand curve – a striking result that deserves more scrutiny than the paper provides.
Consider what an elasticity of −0.34 means in practice. Suppose the retailer raises prices by 20%. At this elasticity, quantity sold falls by roughly 7%. Selling 7% fewer units at 20% higher price yields approximately 12% higher revenue.
Now consider the grocery industry's actual financial structure. Kroger, a large publicly listed Midwestern grocery retailer that matches the paper's description, reports gross margins of 20–25% and net profit margins of 2–3%. A revenue increase of 12%, with modest volume loss, would flow almost entirely to the bottom line. Gross margins would jump to roughly 30–35%, and net profit margins could rise to 13–15%: an order of magnitude above actual industry profitability.
The fact that grocery margins remain thin suggests either that the experimental elasticity is biased toward zero or that the experiment is measuring something other than the demand curve the retailer faces in equilibrium.
3. Diagnostic Suggestions
3.1 Experimental compliance and price data integrity
In applied pricing contexts, experimental prices are frequently not implemented as intended. Sources of error range from technical failures to human overrides of prices that staff perceive as "particularly outrageous."
3.2 Promotional salience and consumer attention
The authors report that 93% of price changes during the observational period stem from temporary price reductions. If the experimental price variation was implemented without promotional signaling, it may not have been salient enough for shoppers to notice.
3.3 Out-of-stock dynamics
The paper does not discuss stockout events, which could especially bias experimental elasticity estimates. Price reductions increase the likelihood of stockouts, which censor the observed demand response.
3.4 Aggregation and sample selection
The process of averaging elasticity estimates across products may introduce bias. The products included in the experiment were a non-random sample of relatively high-selling items.
3.5 Basket-level perception and competitive dynamics
Because the experimental price perturbations were mean-zero across products, a typical customer's total basket price may not have changed meaningfully.
4. Overall Assessment
This paper tackles an important question with an unusually rich dataset and a commendable commitment to transparency. However, I am not yet convinced that the paper establishes its headline claim – that observational approaches to demand estimation are generally unreliable.
5. Pivotal Question
The specifications estimated for Figure 10 would in principle allow quantification of substitution effects, though the paper does not report these estimates. It is important to distinguish between the category-level price elasticity of demand and the more elastic demand curve that a single retailer or brand faces in a competitive market.
Claim identification and assessment
I. Identify the most important and impactful factual claim this research makes
Standard observational approaches to estimating price elasticity of demand fail to approximate true (experimental) elasticity, even with rich scanner data.
II. To what extent do you believe the claim you stated above?
0-10-30 (skeptical due to margin puzzle). The experimental estimates themselves appear implausible when subjected to a simple back-of-the-envelope calculation comparing implied profit opportunities to actual grocery industry margins.
III. Suggested robustness checks
Experimental compliance and price data integrity checks
Promotional salience and consumer attention analysis
Out-of-stock dynamics assessment
Aggregation and sample selection analysis
Basket-level perception and competitive dynamics investigation
IV. Important implication, policy, credibility
If the headline claim is true, observational estimates of price elasticity of demand used for policy analysis would need substantial revision. However, the experimental estimates themselves may not serve as the "gold standard" the paper takes them to be.
See the published version on PubPubRoemheld's evaluation as published on The Unjournal's platform.
Roemheld noted his per-criterion sub-ratings were "rather positive for all, but somewhat less positive for Claims, strength, and characterization of evidence."
Summary / abstract (verbatim)
"The paper uses data from a rich price experiment and frames it as a validation of observational demand methods. The headline finding — a DID shift of own-price elasticities toward zero — is not automatically 'observational bias,' because the experiment changes the price support: elasticities during treatment are evaluated at different price levels and thus different points on the demand curve. The most natural takeaway is that constant-elasticity log-linear demand is too restrictive here. A useful extension would quantify how much of the shift is explained by curvature using flexible demand estimates."
Main claim as identified
Main research claim (as read by E2)
Observational methods do not reproduce the experimental elasticity even with rich scanner data; E2's emphasis is that this need not be "bias" but may reflect functional-form restrictiveness.
Key public points
Overall positive: "A nice paper that brings in a rich novel experimental dataset to the demand estimation literature… experimental variation across many categories provides a valuable benchmark."
Framing concern (E2-02): A DID effect on estimated elasticities is not automatically "observational bias" — the experiment changes the price support, so elasticities are evaluated at different points on the demand curve.
Clarification needed (E2-01): What is the main estimator — OLS log–log or 2SLS? Page 9 mentions four models; it was unclear to E2 which is used in Figure 2.
Takeaway (E2-03): Log-linear (constant-elasticity) demand is likely too restrictive. The central fact is more about functional-form / model misspecification than "bias" per se.
Suggested extension: Quantify how much of the shift is explained by the price-level channel using flexible demand estimates.
The following is the evaluator's complete report, reproduced verbatim. Paragraph structure and bullet points are preserved as submitted. Alongside the relevant passages, the authors' point-by-point replies (green) and the evaluation manager's higher-level notes (olive) appear as margin notes on wide screens, or inline on narrow screens.
Main report
Overall assessment / contribution. This is a nice paper that brings in a rich novel experimental dataset to the demand estimation literature. Most of the literature relies on quasi-experimental variation (which always requires assumptions) to estimate price elasticities. Having access to actual experimental variation across many categories provides a valuable benchmark.
Connection to program evaluation / validation logic. I like the link the authors draw to the program-evaluation literature. Conceptually, it's very interesting to use randomized price variation as an external benchmark to assess the extent to which observational strategies (even sophisticated ones) reproduce the causal effect of price on demand.
Clarification needed: what is the main estimator? I'm confused about what is being reported in the main results. Are the reported elasticities coming from OLS log–log regressions, or from 2SLS? Page 9 mentions that four models are estimated, but then I can't see which one is used in, e.g., Figure 2. The interpretation depends heavily on this, and the paper would benefit from making the baseline estimator and its identifying variation completely explicit up front (and keeping the OLS/IV results clearly separated).
Framing concern: DID on elasticities is not automatically "observational bias." Independently of OLS vs. 2SLS, I find the DID framing confusing. The paper effectively treats the estimated elasticities as the outcome in a difference-in-differences design and concludes that the experiment "causes" an average reduction in the magnitude of own-price elasticities of roughly 2. I don't think that object can be interpreted as "observational bias" without further caveats. The key issue (which the authors hint at) is that the experiment affects the distribution of prices. Elasticities estimated during the experimental period are therefore elasticities at different price levels, and thus at different points on the demand curve, than those prevailing in the pre-period. There is no reason, even under perfect identification, that those elasticities should coincide with pre-period elasticities if elasticities are not constant in prices. Put differently, I understand the authors as suggesting that a DID effect of zero on the elasticity would be the null under "no bias," but that is not generally true once the treatment itself changes the price support over which the elasticity is evaluated.
What I take away instead: functional-form restrictiveness. The interpretation that seems most natural is that the log-linear demand specification is too restrictive in this context. A constant-elasticity log–log model implicitly assumes the elasticity is invariant to the price level, which is exactly what the experiment appears to contradict. The central empirical fact is then less about "bias" per se and more about model misspecification / lack of flexibility: observational estimation under a restrictive functional form can deliver elasticities that fail to transport across price regimes. This, in turn, is a clean motivation for the large literature on flexible / semiparametric / nonparametric demand estimation.
That said, the magnitude is large and worth digging into. The estimated change is substantial, so it may not be fully explained by "different price levels imply different elasticities." I think it would be interesting to explore how much of the change in elasticities can be explained by this channel (maybe using standard approaches to flexible demand estimation). This would give a sharper quantification of the residual change in elasticities that is truly driven by the gap between the experimental and the observational variation.
Answer to Unjournal pivotal question
This paper speaks to a broader question in empirical IO / quantitative marketing: whether the demand-estimation methods we routinely apply to scanner data can be used reliably for policy analysis, including policies motivated by animal welfare (and related concerns such as emissions from meat production). A central policy-relevant object in this setting is substitution between meat and plant-based products. If plant-based options draw substantial demand away from meat when their prices fall—i.e., if the relevant cross-price elasticities (or diversion) are large—then subsidizing plant-based products could plausibly be an effective instrument to improve animal welfare (and reduce emissions). If instead meat and plant-based products are effectively separate markets—in the extreme case, because plant-based demand primarily comes from consumers who were already vegetarian—then such a subsidy would not achieve its intended goal.
At the moment, the paper speaks primarily to own-price elasticities. In future research, it would be very interesting to expand the analysis to cross-price elasticities between meat and plant-based products. It's very likely that a restrictive log-log model that assumes constant cross-price elasticities would also tend to not do well at matching the patterns in the experimental data. This would confirm that specifying a flexible functional form matters for the key questions mentioned above that rely on correctly capturing diversions across products.
Evaluation manager follow-up Q&A
Evaluation manager — question 1
Does the evidence presented in this paper cast substantial doubt on the potential for estimating cross-price U.S. demand for substitutes between plant-based and animal products? Would you be optimistic about those prospects?
Evaluator
"My expectation is that extending the analysis to cross-price elasticities would reveal that a log-log model (assuming constant elasticities) also does not match the experimental patterns of substitution across products well. This would confirm that more flexible models are needed."
Evaluation manager — question 2
Would you tend to favor the evidence from observational studies or experimental studies in this context?
Evaluator
"My main take away from the paper is not that experimental data is necessarily better than observational data (or vice versa). It's that restrictive models (like a log-log model of demand) do not do a good job capturing the true shape of the demand functions."
Overall assessment with 90% credible intervals, 0–100 scale; plus journal-tier assessments. Only these summary ratings are available in this brief.
Note: Full per-criterion ratings (methods, logic, claims, etc.) are on the PubPub evaluation: unjournal.pubpub.org/pub/evalsumbraybray/. Roemheld noted per-criterion sub-ratings were "rather positive for all, but somewhat less positive for Claims, strength, and characterization of evidence."
Whisker = 90% credible interval · marker = point estimate
Tier legend (should-be / predicted placement): 0 little value · 1 somewhat valuable · 2 decent field journal · 3 strong field journal · 4 top field journal · 5 A-journal / top journal.
Ratings — overall assessment with 90% credible intervals
Metric
Roemheld (E1)
E1 90% CI
Anonymous (E2)
E2 90% CI
Overall assessment (0–100)
70
40–90
60
55–70
Journal tier — should be (0–5)
3
—
4
—
Journal tier — predicted (0–5)
3.5
—
5
—
Per-criterion sub-ratings available at the full PubPub evaluation. The two evaluators show a notable gap on overall assessment (70 vs 60) with overlapping but distinct CIs — Roemheld's CI is wider (40–90) reflecting greater uncertainty.
Author response
Authors' response — overview
Robert Bray, Robert Evan Sanders, and Ioannis Stamatopoulos provided a detailed point-by-point response to both evaluators. The manuscript was also substantially revised after the evaluated version.
General statement — Robert Bray, Robert Evan Sanders & Ioannis Stamatopoulos
"Thank you all for your interest in our work… the manuscript has been substantially revised since the version originally evaluated."
The authors' point-by-point replies are now shown inline, beside the passage each one answers, inside the two full evaluations above. On a wide screen they appear as margin notes (green-ruled "Authors' reply"; olive-ruled "Manager note") to the right of the relevant text; on a narrow screen they fold in directly under the passage. Use the labels below to jump to a passage and its reply.
Two evaluation-manager (higher-level) notes are anchored beside the E2-02 and E2-03 passages. Concerns are drawn from the published evaluations; replies from the published author response. Full text on PubPub.
Process & status
Transparency & process
Package status
2026 evaluation package. The manuscript was substantially revised after the evaluated version. Revised manuscript on SSRN · 5 Feb 2026
Evaluated version
Roemheld confirmed he evaluated the 5 Feb 2026 SSRN version (abstract_id=4899765).
Why this paper
Relevance to scanner-data demand estimation and to animal-welfare / food-policy interventions that rely on observational elasticity estimates.
Evaluator identities
E1 Lars Roemheld is identified (signed). E2 is anonymous by choice. Standard Unjournal process.
Conflicts of interest
Standard Unjournal disclosure applies. Evaluators were selected for complementary expertise: E1 brings applied industry pricing experience; E2 brings econometric demand-estimation expertise.
Process and guidelines
unjournal.org — full evaluation guidelines, author response process, and disclosure policy.
Note on the title: The published evaluation summary uses the title "cannot reproduce." The authors subsequently softened the paper title to "does not reproduce" to better reflect the scope of the evidence.