Adjusting for Scale-Use Heterogeneity in Self-Reported Well-Being
Daniel J. Benjamin, Kristen Cooper, Ori Heffetz, Miles S. Kimball & Jiannan Zhou · NBER w31728
Kaiser: A major methodological innovation in adjusting for scale-use differences; the empirical component would benefit from more diverse, reliable samples.
Prati: An extraordinary paper — it approaches a fundamental issue in wellbeing measurement, and does so constructively by suggesting and testing a solution.
Kaiser95overall · tier 4.7
Prati95overall · tier 5.0
Layer 1 · The paper
The research
A framework for correcting the fact that different people use survey scales differently — a long-standing obstacle to using subjective wellbeing data in economics.
The paper tackles a fundamental problem in wellbeing measurement: different people use survey scales differently — one person's 7/10 is another's 5/10. This scale-use heterogeneity has hindered economists' adoption of subjective wellbeing data for decades.
The authors propose correcting it with two parameters, estimated from a small number of extra calibration questions (CQs):
Shifter — where a respondent centers their scale.
Stretcher — how spread out a respondent's scale is.
The correction can change substantive results in applications, e.g. wellbeing comparisons across groups.
Implementable with existing vignette data, or with new data needing only a few extra questions.
What "scale-use heterogeneity" means
Two people feel equally happy, but report different numbers — and the shifter slides the whole scale.
Drag the shifter to see how the same underlying feeling maps onto different reported scores. The teal respondent reports 7; the slate respondent reports 5.
Canonical record
This evaluation discusses an interim working-paper version. The authoritative paper lives on NBER:
The correction is directly relevant to the WELLBY (wellbeing-adjusted life-year) measure used in global-priorities cost-effectiveness analysis. If scale-use heterogeneity systematically distorts comparisons across groups, then cost-effectiveness rankings built on self-reported wellbeing could shift once the correction is applied.
The Unjournal selected this paper partly because of its direct link to our Pivotal Questions project on the WELLBY measure — making it a high-leverage methodological input for cause prioritization and impactful interventions.
Layer 3 · Evaluation
What the evaluators said
Two experts, the same rating criteria, the same claim-identification framework — enabling side-by-side comparison.
Short summary synthesised from Kaiser's one-line verdict and ratings; the full report is the canonical record.
Evaluation
A major methodological innovation. The framework is elegant and the estimation strategy is sound . The empirical component would especially benefit from more diverse and reliable samples, and from direct comparisons against existing scale-correction methods so readers can judge incremental value . Logic and communication could be tightened in places — rated lower here than other dimensions .
Claim identification
Main research claim (as read)
There is a new method to adjust for scale-use differences, implementable with existing vignette data and with new data needing only a few extra questions.
Belief in claim
“There certainly is a new method. I'd want comparisons with existing methods.”
Suggested robustness checks
See detailed suggestions in the full report.
Kaiser rated six of the eight criteria; real-world relevance and global-priorities relevance were left not rated. See ratings →
Overall evaluation
This is an extraordinary paper. It is the kind of methodological research one wants to see more often. It approaches a fundamental issue in wellbeing measurement, and does so constructively, by suggesting and testing a potential solution. The contribution is strong in both its theoretical and empirical parts.
The authors offer a deep reflection on the problem of scale-use heterogeneity, connect it with the social-science literature, give a theoretically informed account of how to think about it, and suggest a sound solution for estimation. The empirical effort is impressive too: the working-paper analysis provides a useful proof of concept, supplemented by additional data from a large representative sample (Understanding America Study) in the forthcoming version.
The model is very well thought out. The use of a shifter and a stretcher parameter makes a lot of sense. Some choices might go unnoticed by an unfamiliar reader, but recentring the shifter, conditioning results on a question's “height”, and the distinction between “dimensional scale use” and “general scale use” are actually smart innovations .
More comments about the limits
The paper has limits not because of any fault in methods or reasoning, but because a single study cannot solve all problems of response-scale heterogeneity. This is proper to a research agenda, and the current paper already provides a substantial leap forward.
Adding calibration questions is costly. The evidence rests on a large number of CQs. It is not clear how well the correction performs when only two or three CQs are used — the realistic scenario. Even two CQs can be a substantial burden in large surveys given tight space constraints, and could be cognitively demanding. I suspect this is one crucial reason anchoring vignettes have not been implemented at scale in 20 years .
Claim identification
Main research claim (as read)
The authors develop an innovative framework to model and adjust for scale heterogeneity, test it with new calibration questions, and show the adjustment can change results in some applications.
Belief in claim
“Like 90%. They don't have the same data quality as their main dataset, but it's very comprehensive.”
Suggested robustness checks
Unclear how well the correction performs with only two CQs and short SWB scales.
Point estimates with 90% credible intervals, 0–100 scale. Hover or focus a row for the criterion definition; toggle an evaluator to isolate. Click a criterion to jump to the relevant prose.
Unjournal editorial note: the evaluators agree most strongly on the overall verdict (both 95). Their widest gap is logic & communication — Kaiser 75 vs Prati 95 — followed by claims & evidence (80 vs 95). This is an editorial annotation derived from the ratings, not evaluator prose.
Whisker = 90% credible interval · marker = point estimate
Kaiser · journal tier4.7 / 5
Prati · journal tier5.0 / 5
Tier legend: 0 little value · 1 somewhat valuable · 2 decent field journal · 3 strong field journal · 4 top field journal · 5 A-journal / top journal.
Ratings (0–100 scale) with 90% credible intervals — full appendix
Criterion
Kaiser
Kaiser 90% CI
Prati
Prati 90% CI
Overall assessment
95
80–100
95
90–100
Claims & evidence
80
70–90
95
90–100
Advancing knowledge & practice
90
80–100
95
90–100
Methods
90
80–100
95
90–100
Logic & communication
75
60–90
95
89–100
Open, collaborative, replicable
85
70–90
95
90–100
Real-world relevance
N/A
—
86
74–95
Relevance to global priorities
N/A
—
86
74–95
Journal-rank tier (0–5): Kaiser 4.7, Prati 5.0. Kaiser did not rate real-world relevance or relevance to global priorities (N/A).
Layer 4 · Author response
The authors reply
In the authors' voice
“The length and thoroughness of the evaluations clearly demonstrate the significant time and intellectual effort the evaluators invested. We are grateful for their insightful and constructive comments. Since the revised paper is still forthcoming, we do not provide a detailed point-by-point response at this stage; we find the suggestions very valuable and will carefully consider them — particularly the question of how few calibration questions can be used — as we finalize the revision. We welcome this public scientific discourse.”
Layer 5 · Process & followups
Transparency & what's next
Why we chose this paper
Prominence, relevance to ongoing practice, and a direct link to our Pivotal Questions project on the WELLBY measure.
Conflicts of interest
Standard Unjournal disclosure applies; evaluators were selected for complementary methodological and applied expertise.
Interim status
The authors made clear this is an interim version; updates are forthcoming. We evaluated it anyway because of its prominence and relevance. Both evaluators were aware of this — Prati's report explicitly considered updates presented in recent seminars. Interim · follow-up planned
Planned re-evaluation
A follow-up evaluation is planned once the revised paper — with the Understanding America Study data — is released.