Unjournal Evaluation · Interim

Adjusting for Scale-Use Heterogeneity in Self-Reported Well-Being

Daniel J. Benjamin, Kristen Cooper, Ori Heffetz, Miles S. Kimball & Jiannan Zhou

NBER working paper · DOI 10.3386/w31728 · Evaluated by Caspar Kaiser and Alberto Prati

Interim evaluation (24 Nov 2025). The authors have made clear this is an interim version; updates are forthcoming. We evaluated it for its prominence, relevance to ongoing practice, and its link to our Pivotal Question project on the WELLBY measure. A follow-up is planned once the revised paper is released.

Scroll to read the story. The full evaluation record — both reviews, ratings, author response, and process notes — is below.

The paper · the problem

One person’s “7” is another person’s “5”

Policy increasingly leans on asking people how they feel — how satisfied they are with life, how happy, how anxious. But an answer is only as comparable as the scale it is read off, and people use those scales differently.

That is part of why many economists were long wary of subjective well-being data. Ask two people how satisfied they are with life on a 0–10 scale and they may feel much the same, yet one answers 7 and the other 5 — they center their scales in different places, and spread them differently. This is scale-use heterogeneity, and it can distort comparisons built on self-reports — including the WELLBY measure now used in global-priorities cost-effectiveness analysis.

The paper · the idea

A shifter and a stretcher

The authors’ framework describes each person’s scale with two parameters: a shifter (where you center the scale) and a stretcher (how spread out it is). Both are estimated from a few extra calibration questions (CQs) — no need for full anchoring-vignette batteries.

Drag the controls to recenter and stretch a respondent’s scale, and watch raw answers move onto a common footing. Correcting for this can change substantive results in applications — for example, well-being comparisons across groups.

The evaluation · the verdict

Two experts, eight criteria

From here the page shifts from the paper's subject to our judgement of it — and there's an irony worth naming. The paper is about how hard it is to compare people's self-reported scales; now two evaluators rate it on a set of 0–100 scales of our own, which carry their own subjective, scale-use element. Read these as considered judgements rather than measurements, and look at the intervals, not just the points.

Both evaluators rated the paper near the top of everything they have seen in this area. Kaiser calls it “a major methodological innovation.” Prati calls it “an extraordinary paper.” Each rating below carries a 90% credible interval.

They diverge most on claims & evidence and logic & communication, where Kaiser is more reserved. Hover or focus a row in the table below to trace it on the plot.

The evaluation · the debate

Where the evaluators push back

Both rate the paper highly; their reservations are practical, and they don't all point the same way.

Prati's main worry is cost. The evidence rests on a large number of calibration questions, and it is unclear how well the correction holds up with the realistic two or three — even two can be a heavy burden in a large survey. He suspects this is “one crucial reason anchoring vignettes have not been implemented at scale in 20 years.”

Kaiser raises several points, mostly about pinning down what the method adds. He wants direct comparison against existing scale-correction methods (such as HOPIT), so readers can see what the new approach buys; a check that the results hold beyond the US online sample, in other countries and more traditional surveys; an extension to panel / fixed-effects settings, where the field increasingly works; and a more accessible write-up, ideally with an R or Stata package. His lower marks fall on logic & communication and on claims & evidence.

This back-and-forth — what the evidence does and doesn't yet support — is the part that's easiest to lose when the reviews sit on separate pages. The full reasoning is in the evaluations below.

Author response

The authors reply

Because the revised paper — with new data from the Understanding America Study — is still forthcoming, the authors do not give a point-by-point response yet. They single out one suggestion to carry into the revision: how few calibration questions can be used.

Followups

What happens next

This is an interim evaluation by design. The scholarly record stays open: when the revised paper is released, The Unjournal plans to revisit it.

Full evaluation record

Everything below is present without JavaScript, for printing, screen readers, and archival reference.

The paper, in plain language

The paper tackles a fundamental problem in well-being measurement: different people use survey scales differently (one person’s “7/10” sits where another’s “5/10” does). This scale-use heterogeneity has hindered economists’ adoption of subjective well-being data for decades.

The authors propose a framework using a shifter parameter (where you center your scale) and a stretcher parameter (how spread out your scale is), estimated from a small number of extra calibration questions. The correction can change substantive results in applications — for example, well-being comparisons across groups. This is directly relevant to the WELLBY measure used in global-priorities cost-effectiveness analysis.

Ratings — Kaiser vs. Prati (0–100, with 90% CI)

Ratings by Caspar Kaiser and Alberto Prati across eight criteria, with 90% credible intervals on a 0 to 100 scale.
Criterion	Kaiser	Kaiser 90% CI	Prati	Prati 90% CI

Overall assessment — Kaiser 95 (80–100), Prati 95 (90–100). Claims & evidence — Kaiser 80 (70–90), Prati 95 (90–100). Advancing knowledge & practice — Kaiser 90 (80–100), Prati 95 (90–100). Methods — Kaiser 90 (80–100), Prati 95 (90–100). Logic & communication — Kaiser 75 (60–90), Prati 95 (89–100). Open / collaborative / replicable — Kaiser 85 (70–90), Prati 95 (90–100). Real-world relevance — Kaiser N/A, Prati 86 (74–95). Relevance to global priorities — Kaiser N/A, Prati 86 (74–95). Journal-tier (0–5): Kaiser 4.7, Prati 5.0.

Journal-rank tier (0–5): Kaiser 4.7, Prati 5.0. Legend: 0 = little value · 1 = somewhat valuable · 2 = decent field journal · 3 = strong field journal · 4 = top field journal · 5 = A-journal / top journal.

Claim identification & assessment

Evaluator	Main research claim (as read)	Belief in claim	Suggested robustness checks
Kaiser	There is a new method to adjust for scale-use differences, implementable with existing vignette data and with new data needing only a few extra questions.	“There certainly is a new method. I’d want comparisons with existing methods.”	See suggestions in report.
Prati	The authors develop an innovative framework to model and adjust for scale heterogeneity, test it with new calibration questions, and show the adjustment can change results in some applications.	“Like 90%. They don’t have the same data quality as their main dataset, but it’s very comprehensive.”	Unclear how well the correction performs with only two CQs and short SWB scales.

Full evaluations

Caspar Kaiser

“This is a major methodological innovation in how we can adjust for differences in scale-use. The empirical component would especially benefit from more diverse and reliable samples.”

Read evaluation (condensed from Kaiser's full report)

Overview. An important question in subjective-wellbeing research is whether responses are interpersonally and intra-personally comparable — it matters especially for regression analyses, because if scale-use differences are correlated with covariates of interest, failing to correct for them can bias the estimates. The paper addresses this in two ways: a new framework for thinking about scale-use heterogeneity, and empirical estimates of how much it matters.

On the method. The core idea — asking respondents a set of “calibration questions” whose perception we assume is shared, then attributing the differences in their answers to scale use — is more rigorous than, and arguably rests on weaker assumptions than, earlier vignette-based approaches. It leans on four key assumptions (an affine relation between people's scales, common perception of each calibration question, response consistency across questions, and independent errors). On the appendix evidence I am convinced the affine assumption approximately holds; the harder trade-off is that “objective” calibration questions (say, the darkness of a circle) are more likely to satisfy common perception, while wellbeing-specific ones are more likely to satisfy response consistency — and it is unclear whether one question can satisfy both.

Relation to existing methods. The standard tool here is the HOPIT model. This paper's assumptions are not strictly weaker or stronger than HOPIT's, so it would be valuable to clarify — theoretically and empirically — how the two relate and how different their results actually are.

Accessibility and panel data. The authors are thorough, which costs some accessibility; I would structure the paper around the most important use — correcting regression coefficients — and ship an R or Stata package. I would also push the method toward panel / individual-fixed-effects settings, which the field increasingly relies on: a time-varying shifter correlated with covariates would still bias fixed-effects estimates, and fixed effects do nothing about the stretcher.

Empirical side. The result I find most important is in Table 6: how much scale-use differences affect estimates of the determinants of life satisfaction looks rather small. The authors don't emphasise it, but if it holds up it is good news for the field. The main caveat is the sample — largely US MTurk respondents — so I would want to see whether the findings hold in other countries and in more traditional survey panels.

Conclusion. Notwithstanding these comments, this is one of the most important recent papers on the interpersonal comparability of wellbeing data. It tentatively suggests scale-use differences may be less of a problem than feared — which would make measures like the WELLBY more viable — though there is clearly more work to do, and it may yet overturn that conclusion.

Lightly condensed from Caspar Kaiser's evaluation, in his own words. Full text and references: unjournal.pubpub.org/pub/e1heterogenity.

Alberto Prati

“This is an extraordinary paper. It approaches a fundamental issue in wellbeing measurement, and does so constructively, by suggesting and testing a potential solution.”

Read evaluation

Overall evaluation. This is an extraordinary paper. It is the kind of methodological research one wants to see more often. It approaches a fundamental issue in wellbeing measurement, and does so constructively, by suggesting and testing a potential solution. The contribution is strong in both its theoretical and empirical parts. As for the former, the authors offer a deep reflection on the problem of scale-use heterogeneity, connect it with the social-science literature, give a theoretically informed account of how to think about it, and suggest a sound solution for estimation. The empirical effort is impressive too: the working-paper analysis provides a useful proof of concept, supplemented by additional data from a large representative sample (Understanding America Study) in the forthcoming version.

The model is very well thought out. The use of a shifter and a stretcher parameter makes a lot of sense. Some choices might go unnoticed by an unfamiliar reader, but recentring the shifter, conditioning results on a question’s “height”, and the distinction between “dimensional scale use” and “general scale use” are actually smart innovations.

More comments about the limits. The paper has limits not because of any fault in methods or reasoning, but because a single study cannot solve all problems of response-scale heterogeneity. This is proper to a research agenda, and the current paper already provides a substantial leap forward.

1. Adding calibration questions is costly. The evidence is based on a large number of calibrating questions (CQ). It is not entirely clear how well the correction performs when only two or three CQs are used (the realistic scenario). Even two CQs can be a substantial burden in large surveys given tight space constraints, and could be cognitively demanding. I suspect this is one crucial reason anchoring vignettes have not been implemented at scale in 20 years.

Excerpted from Alberto Prati's evaluation, in his own words. Full text and references: unjournal.pubpub.org/pub/e2heterogenity.

Author response

“The length and thoroughness of the evaluations clearly demonstrate the significant time and intellectual effort the evaluators invested. We are grateful for their insightful and constructive comments. Since the revised paper is still forthcoming, we do not provide a detailed point-by-point response at this stage; we find the suggestions very valuable and will carefully consider them — particularly the question of how few calibration questions can be used — as we finalize the revision. We welcome this public scientific discourse.”

— The authors (response on file)

Process notes & followups

Why we chose this paper: Prominence, relevance to ongoing practice, and a direct link to our Pivotal Questions project on the WELLBY measure.
Conflicts of interest: Standard Unjournal disclosure applies; evaluators were selected for complementary methodological and applied expertise.
Status: Interim evaluation. A follow-up evaluation is planned once the revised paper (with the Understanding America Study data) is released.
Guidelines & process: unjournal.org