“This is a major methodological innovation in how we can adjust for differences in scale-use. The empirical component would especially benefit from more diverse and reliable samples.”
Read evaluation (condensed from Kaiser's full report)
Overview. An important question in subjective-wellbeing research is whether responses are interpersonally and intra-personally comparable — it matters especially for regression analyses, because if scale-use differences are correlated with covariates of interest, failing to correct for them can bias the estimates. The paper addresses this in two ways: a new framework for thinking about scale-use heterogeneity, and empirical estimates of how much it matters.
On the method. The core idea — asking respondents a set of “calibration questions” whose perception we assume is shared, then attributing the differences in their answers to scale use — is more rigorous than, and arguably rests on weaker assumptions than, earlier vignette-based approaches. It leans on four key assumptions (an affine relation between people's scales, common perception of each calibration question, response consistency across questions, and independent errors). On the appendix evidence I am convinced the affine assumption approximately holds; the harder trade-off is that “objective” calibration questions (say, the darkness of a circle) are more likely to satisfy common perception, while wellbeing-specific ones are more likely to satisfy response consistency — and it is unclear whether one question can satisfy both.
Relation to existing methods. The standard tool here is the HOPIT model. This paper's assumptions are not strictly weaker or stronger than HOPIT's, so it would be valuable to clarify — theoretically and empirically — how the two relate and how different their results actually are.
Accessibility and panel data. The authors are thorough, which costs some accessibility; I would structure the paper around the most important use — correcting regression coefficients — and ship an R or Stata package. I would also push the method toward panel / individual-fixed-effects settings, which the field increasingly relies on: a time-varying shifter correlated with covariates would still bias fixed-effects estimates, and fixed effects do nothing about the stretcher.
Empirical side. The result I find most important is in Table 6: how much scale-use differences affect estimates of the determinants of life satisfaction looks rather small. The authors don't emphasise it, but if it holds up it is good news for the field. The main caveat is the sample — largely US MTurk respondents — so I would want to see whether the findings hold in other countries and in more traditional survey panels.
Conclusion. Notwithstanding these comments, this is one of the most important recent papers on the interpersonal comparability of wellbeing data. It tentatively suggests scale-use differences may be less of a problem than feared — which would make measures like the WELLBY more viable — though there is clearly more work to do, and it may yet overturn that conclusion.
Lightly condensed from Caspar Kaiser's evaluation, in his own words. Full text and references: unjournal.pubpub.org/pub/e1heterogenity.