Paper · root

Adjusting for Scale-Use Heterogeneity in Self-Reported Well-Being

Daniel J. Benjamin · Kristen Cooper · Ori Heffetz · Miles S. Kimball · Jiannan Zhou

NBER working paper DOI: 10.3386/w31728 Status: interim version

Different people use survey scales differently — one person's "7 out of 10" may be another's "5 out of 10." This scale-use heterogeneity has, for decades, slowed economists' adoption of subjective well-being data. The paper proposes a framework to correct for it.

A shifter parameter captures where a respondent centres their scale.
A stretcher parameter captures how spread out their scale is.
Both are estimated from a small number of extra calibration questions (CQs).
Applying the correction can change substantive results — e.g. well-being comparisons across groups.

Why it matters — the WELLBY connection

The correction is directly relevant to the WELLBY measure used in global-priorities cost-effectiveness analysis. If scale-use differences bias raw well-being comparisons, then cost-effectiveness rankings built on them may shift. The Unjournal selected the paper partly for its link to our Pivotal Questions project on the WELLBY measure.

The dialogue

Each thread is one topic. Within it: evaluator critique → author response → editor note. Expand with a click, Enter or Space. Filter by role or topic above.

EditorEvaluation manager · 24 Nov 2025

The authors made clear this is an interim version; updates are forthcoming. We evaluated it anyway because of its prominence, relevance to ongoing practice, and link to our Pivotal Question project on the WELLBY measure. Both evaluators were aware of this; Prati's report explicitly considered updates presented in recent seminars. We consider this an interim evaluation and aim to follow up with further evaluations/updates when the revised paper is released.

EvaluatorAlberto Prati

The model is very well thought out. The use of a shifter and a stretcher parameter makes a lot of sense. Some choices might go unnoticed by an unfamiliar reader, but recentring the shifter, conditioning results on a question's "height", and the distinction between "dimensional scale use" and "general scale use" are actually smart innovations.

The contribution is strong in both its theoretical and empirical parts. The authors offer a deep reflection on the problem of scale-use heterogeneity, connect it with the social-science literature, give a theoretically informed account of how to think about it, and suggest a sound solution for estimation.

EvaluatorCaspar Kaiser

A major methodological innovation in how we can adjust for differences in scale-use. The framework is elegant and the estimation strategy is sound.

Rating note — Methods: 90 (90% CI 80–100). The framework itself is not in question; the reservations attach to the empirical demonstration, not the theory.

EditorManager note

Both evaluators rate Methods highly (Kaiser 90, Prati 95). The disagreement across the package is not about the framework but about how convincingly the interim data demonstrate it — see the Samples and Calibration threads.

EvaluatorAlberto Prati

Adding calibration questions is costly. The evidence is based on a large number of calibrating questions (CQs). It is not entirely clear how well the correction performs when only two or three CQs are used — the realistic scenario.

Even two CQs can be a substantial burden in large surveys given tight space constraints, and could be cognitively demanding. I suspect this is one crucial reason anchoring vignettes have not been implemented at scale in 20 years.

Robustness suggestion: it is unclear how well the correction performs with only two CQs and short SWB scales. This is the single most important practical question for adoption.

AuthorAuthors' response

We find the suggestions very valuable and will carefully consider them — particularly the question of how few calibration questions can be used — as we finalize the revision.

EditorManager note

Both evaluators independently flagged the cost and minimum-count of CQs. The forthcoming revision with the Understanding America Study should let a follow-up evaluation test this directly.

EvaluatorCaspar Kaiser

The empirical component would especially benefit from more diverse and reliable samples, and from direct comparisons against existing scale-correction methods so readers can judge incremental value.

There certainly is a new method — implementable with existing vignette data and, for new data, needing only a few extra questions. I'd want comparisons with existing methods.

Claim as Kaiser read it: there is a new method to adjust for scale-use differences, implementable with existing vignette data and with new data needing only a few extra questions.

EvaluatorAlberto Prati

The empirical effort is impressive: the working-paper analysis provides a useful proof of concept, supplemented by additional data from a large representative sample (Understanding America Study) in the forthcoming version. Belief in the main claim: like 90%. They don't have the same data quality as their main dataset, but it's very comprehensive.

AuthorAuthors' response

Since the revised paper is still forthcoming, we do not provide a detailed point-by-point response at this stage; we will carefully consider the suggestions as we finalize the revision.

EvaluatorCaspar Kaiser

Logic and communication could be tightened in places — rated lower here than other dimensions (75; 90% CI 60–90).

EvaluatorAlberto Prati

Some choices might go unnoticed by an unfamiliar reader, but the reasoning is transparent and the innovations (recentring the shifter, conditioning on a question's "height") are clearly motivated. Rated 95 (90% CI 89–100).

EditorManager note

This is the criterion with the widest gap between evaluators (Kaiser 75 vs Prati 95) — see how their credible intervals overlap in the ratings chart.

EvaluatorAlberto Prati

This is the kind of methodological research one wants to see more often. It approaches a fundamental issue in well-being measurement constructively, by suggesting and testing a potential solution. Real-world relevance and relevance to global priorities both rated 86 (90% CI 74–95).

EditorManager note

Kaiser did not rate the two relevance criteria (recorded as not rated, not zero). Prati did; both are tied to the WELLBY use-case that motivated our selection of the paper.

Ratings at a glance

Eight shared criteria, 0–100. Point = midpoint; horizontal line = 90% credible interval. Hover or focus a row to highlight both marks. Definitions via the ? buttons.

Journal rank tier (0–5)

0 little value · 1 somewhat valuable · 2 decent field journal · 3 strong field journal · 4 top field journal · 5 A-journal / top journal

Kaiser 4.7 · Prati 5.0

Author response

Consolidated reply from the authors. Topic-specific replies are interleaved in the dialogue threads above.

AuthorBenjamin, Cooper, Heffetz, Kimball & Zhou

The length and thoroughness of the evaluations clearly demonstrate the significant time and intellectual effort the evaluators invested. We are grateful for their insightful and constructive comments.

Since the revised paper is still forthcoming, we do not provide a detailed point-by-point response at this stage; we find the suggestions very valuable and will carefully consider them — particularly the question of how few calibration questions can be used — as we finalize the revision. We welcome this public scientific discourse.

Process & followups

Transparency on how this evaluation was run, and what comes next.

EditorUnjournal process

Why we chose this paper: prominence, relevance to ongoing practice, and a direct link to our Pivotal Questions project on the WELLBY measure.

Conflicts of interest: standard Unjournal disclosure applies; evaluators were selected for complementary methodological and applied expertise.

Status: interim evaluation. A follow-up evaluation is planned once the revised paper — with the Understanding America Study data — is released.

Process & guidelines: unjournal.org