Adjusting for Scale-Use Heterogeneity in Self-Reported Well-Being

Layer 1 · The paper

The research

A framework for correcting the fact that different people use survey scales differently — a long-standing obstacle to using subjective wellbeing data in economics.

The paper tackles a fundamental problem in wellbeing measurement: different people use survey scales differently — one person's 7/10 is another's 5/10. This scale-use heterogeneity has hindered economists' adoption of subjective wellbeing data for decades.

The authors propose correcting it with two parameters, estimated from a small number of extra calibration questions (CQs):

Shifter — where a respondent centers their scale.
Stretcher — how spread out a respondent's scale is.
The correction can change substantive results in applications, e.g. wellbeing comparisons across groups.
Implementable with existing vignette data, or with new data needing only a few extra questions.

What "scale-use heterogeneity" means

Two people feel equally happy, but report different numbers — and the shifter slides the whole scale.

Shifter centered

Drag the shifter to see how the same underlying feeling maps onto different reported scores. The teal respondent reports 7; the slate respondent reports 5.

Canonical record

This evaluation discusses an interim working-paper version. The authoritative paper lives on NBER:

doi.org/10.3386/w31728 canonical record · opens in new tab

Layer 2 · Implications

Why it matters

Global priorities relevance

The correction is directly relevant to the WELLBY (wellbeing-adjusted life-year) measure used in global-priorities cost-effectiveness analysis. If scale-use heterogeneity systematically distorts comparisons across groups, then cost-effectiveness rankings built on self-reported wellbeing could shift once the correction is applied.

The Unjournal selected this paper partly because of its direct link to our Pivotal Questions project on the WELLBY measure — making it a high-leverage methodological input for cause prioritization and impactful interventions.

Layer 3 · Evaluation

What the evaluators said

Two experts, the same rating criteria, the same claim-identification framework — enabling side-by-side comparison.

Short summary synthesised from Kaiser's one-line verdict and ratings; the full report is the canonical record.

Evaluation

A major methodological innovation. The framework is elegant and the estimation strategy is sound . The empirical component would especially benefit from more diverse and reliable samples, and from direct comparisons against existing scale-correction methods so readers can judge incremental value . Logic and communication could be tightened in places — rated lower here than other dimensions .

Claim identification

Main research claim (as read): There is a new method to adjust for scale-use differences, implementable with existing vignette data and with new data needing only a few extra questions.
Belief in claim: “There certainly is a new method. I'd want comparisons with existing methods.”
Suggested robustness checks: See detailed suggestions in the full report.

Kaiser rated six of the eight criteria; real-world relevance and global-priorities relevance were left not rated. See ratings →

Overall evaluation

This is an extraordinary paper. It is the kind of methodological research one wants to see more often. It approaches a fundamental issue in wellbeing measurement, and does so constructively, by suggesting and testing a potential solution. The contribution is strong in both its theoretical and empirical parts.

The authors offer a deep reflection on the problem of scale-use heterogeneity, connect it with the social-science literature, give a theoretically informed account of how to think about it, and suggest a sound solution for estimation. The empirical effort is impressive too: the working-paper analysis provides a useful proof of concept, supplemented by additional data from a large representative sample (Understanding America Study) in the forthcoming version.

The model is very well thought out. The use of a shifter and a stretcher parameter makes a lot of sense. Some choices might go unnoticed by an unfamiliar reader, but recentring the shifter, conditioning results on a question's “height”, and the distinction between “dimensional scale use” and “general scale use” are actually smart innovations .

More comments about the limits

The paper has limits not because of any fault in methods or reasoning, but because a single study cannot solve all problems of response-scale heterogeneity. This is proper to a research agenda, and the current paper already provides a substantial leap forward.

Adding calibration questions is costly. The evidence rests on a large number of CQs. It is not clear how well the correction performs when only two or three CQs are used — the realistic scenario. Even two CQs can be a substantial burden in large surveys given tight space constraints, and could be cognitively demanding. I suspect this is one crucial reason anchoring vignettes have not been implemented at scale in 20 years .

Claim identification

Main research claim (as read): The authors develop an innovative framework to model and adjust for scale heterogeneity, test it with new calibration questions, and show the adjustment can change results in some applications.
Belief in claim: “Like 90%. They don't have the same data quality as their main dataset, but it's very comprehensive.”
Suggested robustness checks: Unclear how well the correction performs with only two CQs and short SWB scales.

Prati rated all eight criteria. See ratings →

Layer 3 · Ratings

Ratings comparison

Point estimates with 90% credible intervals, 0–100 scale. Hover or focus a row for the criterion definition; toggle an evaluator to isolate. Click a criterion to jump to the relevant prose.

Unjournal editorial note: the evaluators agree most strongly on the overall verdict (both 95). Their widest gap is logic & communication — Kaiser 75 vs Prati 95 — followed by claims & evidence (80 vs 95). This is an editorial annotation derived from the ratings, not evaluator prose.

Kaiser Prati Whisker = 90% credible interval · marker = point estimate

Kaiser · journal tier4.7 / 5

Prati · journal tier5.0 / 5

Tier legend: 0 little value · 1 somewhat valuable · 2 decent field journal · 3 strong field journal · 4 top field journal · 5 A-journal / top journal.

Ratings (0–100 scale) with 90% credible intervals — full appendix
Criterion	Kaiser	Kaiser 90% CI	Prati	Prati 90% CI
Overall assessment	95	80–100	95	90–100
Claims & evidence	80	70–90	95	90–100
Advancing knowledge & practice	90	80–100	95	90–100
Methods	90	80–100	95	90–100
Logic & communication	75	60–90	95	89–100
Open, collaborative, replicable	85	70–90	95	90–100
Real-world relevance	N/A	—	86	74–95
Relevance to global priorities	N/A	—	86	74–95

Journal-rank tier (0–5): Kaiser 4.7, Prati 5.0. Kaiser did not rate real-world relevance or relevance to global priorities (N/A).

Ratings (point estimate, 90% CI): Overall assessment — Kaiser 95 (80–100), Prati 95 (90–100). Claims & evidence — Kaiser 80 (70–90), Prati 95 (90–100). Advancing knowledge & practice — Kaiser 90 (80–100), Prati 95 (90–100). Methods — Kaiser 90 (80–100), Prati 95 (90–100). Logic & communication — Kaiser 75 (60–90), Prati 95 (89–100). Open / collaborative / replicable — Kaiser 85 (70–90), Prati 95 (90–100). Real-world relevance — Kaiser N/A, Prati 86 (74–95). Relevance to global priorities — Kaiser N/A, Prati 86 (74–95). Journal-tier (0–5): Kaiser 4.7, Prati 5.0.

Layer 4 · Author response

The authors reply

In the authors' voice

“The length and thoroughness of the evaluations clearly demonstrate the significant time and intellectual effort the evaluators invested. We are grateful for their insightful and constructive comments. Since the revised paper is still forthcoming, we do not provide a detailed point-by-point response at this stage; we find the suggestions very valuable and will carefully consider them — particularly the question of how few calibration questions can be used — as we finalize the revision. We welcome this public scientific discourse.”

Layer 5 · Process & followups

Transparency & what's next

Why we chose this paper: Prominence, relevance to ongoing practice, and a direct link to our Pivotal Questions project on the WELLBY measure.
Conflicts of interest: Standard Unjournal disclosure applies; evaluators were selected for complementary methodological and applied expertise.
Interim status: The authors made clear this is an interim version; updates are forthcoming. We evaluated it anyway because of its prominence and relevance. Both evaluators were aware of this — Prati's report explicitly considered updates presented in recent seminars.
Interim · follow-up planned
Planned re-evaluation: A follow-up evaluation is planned once the revised paper — with the Understanding America Study data — is released.
Evaluator guidelines & process: unjournal.org

↑ Back to top