The atheoretical invalidity of life satisfaction scales

Taking all things together, what you say your life satisfaction is right now on a scale of 1–10? Would you say you are happier now than when you were an undergraduate? Has life improved? Remember your answers—they will be important later.

Some friends recently asked me for my opinion on the following paper:

E. Diener, R. Inglehart and L. Tay (2013), ‘Theory and validity of life satisfaction scales’, Social Indicators Research, vol. 112, no. 3, pp. 497–527

Of the hundreds of papers I have read during my PhD, this is arguably the one that shits me off the most. I just re-read it in preparation for writing this and I was amazed at how much rage I had barfed into my margin notes. My views on this paper are summarised in the title to this piece: life satisfaction scales are atheoretical and invalid. I’ll make 3 main points.

This graphic is actually deeper than most life satisfaction research


The first is that scale data is impossible to interpret without making assumptions, because there is nothing in the data that allows you to determine whether what you are observing is an adaptation, rescaling or preference drift. Life satisfaction research makes these assumptions while pretending that it doesn’t. Furthermore, it rejects most attempts to theoretically engage with life satisfaction scales on the grounds that it wants to maintain a ‘content-free’ measure of wellbeing. This would be nice if such a thing were possible.

The second point is that research using life satisfaction scales is miles away from causal analysis. Over the past 30 years it has barely gotten to a consensus about whether spinal injury causes long-term reductions in life satisfaction, even though all the empirical clinical psych literature says that explicitly. Life satisfaction scales not only produce only correlation, but extremely crude ones at that. These are useless for investigating the kind of granular issues that inform life satisfaction changes for the majority of the population. As such, research with scales continues to fiddle about asking questions we already know the answers to. A favourite example is from the forthcoming The Origins of Happiness, which argues that one of the best ways to increase happiness would be to cure more people of depression. That’s a tautology! More importantly, how do you cure people with depression? That would require a causal analysis. Life satisfaction scale research is so superficial as to be glib.

The third point is that in this paper, like in most papers on this topic by this particular clique of researchers, there are numerous examples given of massive measurement problems, but the authors conclude that scales are valid anyway. I’m always struck by this. It’s as if by merely acknowledging that measurement error exists you somehow make it go away. This is doubly taxing to me because the authors never seem to get stuck into tackling these measurement issues with empirical work using something other than scales.

Scales data is impossible to interpret

Scale data researchers are quick to point out that scale responses change following events that we would expect to cause them to change, like getting divorced, and tend to track other indicators of wellbeing: for example, people tend to report low subjective wellbeing (SWB) prior to committing suicide.

I don’t deny this. I have some concerns about the fact that these results are rarely experimental, but with things as dramatic as suicide, I don’t think this matters much. I don’t deny that life satisfaction (henceforth losat, because that’s what the variable is called in HILDA) tracks some changes in life satisfaction. What I do contest is that we can meaningfully interpret the data in anything but the most dramatic of circumstances, and even there that we can get an accurate estimate of the change. My concerns on this front are only compounded by empirical estimates of the test-retest coefficient for losat data, which is a mere 0.6 at 2 weeks (Krueger & Schkade 2008)!

The easiest way to see this is by comparing the adaptation hypothesis to the alternatives: preference drift and rescaling. Adaptation is one of the main ‘findings’ of losat research. It says that people get used to changes in their life circumstances over time, sometimes quite quickly, and so losat is defined by a fairly stable ‘set-point’ (slightly different to a mean) over time. This set-point is in turn determined mostly by genetic factors, especially whether you are a glass half-full or half-empty type, and whether you are an extrovert or a neurotic introvert. In losat research, people seem to adapt to even really major stuff, notably spinal injury and winning the lottery (cf. Brickman & Coates 1978).

Rescaling is an alternate hypothesis that explain what we observe in the data. It argues that the qualitative meaning of the points on people’s scales changes over time, such that an 8/10 at time x and an 8/10 at time y mean two different things. So for example, an 8/10 for Stalin before his spinal injury means a very different thing in terms of the underlying quantity of utility reflected than his 8/10 report after his spinal injury. He is using different scales in each case (Blanton & Jacard 2006 is the primary reference for this argument). If you think about this for a minute, I’ll wager you think rescaling is a reasonable hypothesis. Think back to yourself as a 3rd year university student. Imagine that person was asked their life satisfaction on a scale of 1–10. What would you have said? What did you say to the question when I asked it at the very beginning of this article? Is your answer the same today as it would have been back then? And has your life gotten better, in your own estimation? I daresay it has, and yet you give the same answer. What’s up with that? Could it be that the meaning of the points on your scale have changed? My view is that our answers to losat questions are massively conditioned by the ambiguities of our present circumstances, and so the numbers do not statically reflect underlying quantities of utility, but vague, churning intuitions largely informed by present mood. (Some losat psychologists sometimes say this is precisely what they are interested in, not underlying utility. The economists don’t have that excuse).  

There is one piece of rather crisp empirical evidence for rescaling. It comes from Stillman et al—pre-press version available here, see pg. 12. They study migrants from Tonga to New Zealand. Migration to New Zealand is by way of a lottery system, with about 10% of applicants getting visas. Stillman et al interview both successful and unsuccessful migrants, achieving randomisation on observable characteristics. It’s not tightest identification but it’s pretty damn good—much better than the panel data work on losat change. Migrants from the 2002–2005 cohorts are interviewed in 2005. On average, they have been in New Zealand for 11 months. The migrants give the same losat reports as unsuccessful applicants despite large increases in objective indicators of wellbeing, notably real income. One might conclude that they have perfectly adapted to their new circumstances. Or they might have rescaled, and indeed, that seems to be the case. When asked how they felt before migration, the successful applicants say they were less satisfied than they are now, even though the scale responses between the two groups are the same. Stillman et al elaborate:

The migrants appear to perceive that they are better than they were in Tonga, but rather than that taking the form advancing up a subjective ladder they instead demote their previous position. To the extent that this sort of filtering may occur more widely when frames of reference change, retrospective questions about changes in subjective well-being may be an unreliable guide to actual changes (pg. 12, emphasis added).
Given that this is a randomised situation, meaning that the migrants are essentially the same person as those who stayed except for the fact that they migrated, this suggests that rescaling occurs rapidly. It also suggests that interpersonal and especially intertemporal comparisons between respondents are invalid, because different respondents are using different scales. You can still use scales when you’re in one person at one time though (for a good application of such data, see Frijters & Mujcic 2012).

So far I’ve only discussed rescaling and I feel like I’ve already detonated the usefulness of scales, but there’s also preference drift. This is a situation where the numbers you are reporting reflect the same amount of underlying utility, but the things that determine which number you respond have changed. The three different interpretations: adaptation, rescaling and preference drift, are summarised in the graphs below:





There is nothing in scale data that you can use to determine whether what you are observing is straightforward change, adaptation, rescaling or preference drift. To my mind, that makes scales uninterpretable and so I don’t know why we keep expending so many resources on them rather than other empirical instruments.

Causality

In large part because of the interpretation problems mentioned above, understanding causal change in life satisfaction using scales is pretty damn hard. Consequently, even after more than 30 years of research, losat research is still struggling to answer questions that we had good answers to before losat research even began. I’ll jump straight to my favourite example, which is the checklist of ‘causes’ of happiness from the forthcoming The Origins of Happiness. Here’s a summary. Spoiler alert: they’re anything but.

The authors emphasise the following causes of happiness: income, education, unemployment, not being a criminal, being partnered, physical health and not being mentally ill. Like, duh. But this is even worse than duh. Income is not a cause of happiness. As economists, they should know that it’s what you buy with money that makes you happy, so this is definitely not an origin of happiness. Similar things could be said about health. That depressed people aren’t happy is a tautology. Moreover, the history of philosophical and psychological inquiry into wellbeing has been substantially about how to overcome mental illness. It’s not meaningful to say, as they do, that policymakers should aim to ‘abolish mental illness’, as though it’s just a matter of clicking your fingers. The causality of these things lies deeper, in issues like meaning, autonomy, authenticity, relationships etc. This research is similar (though much worse) than work on growth in the mid-20th century. It is looking at the proximate causes of a phenomenon rather than the deep determinants e.g. what causes mental illness. Meanwhile, they say ‘at last the map of happiness is becoming clearer’ even though their R2 is only 0.19!

Ironically, the article I linked to above ends with this imperative: ‘we now need thousands of well-controlled trials of specific policies from which we can obtain estimates of the effects on life satisfaction in the near and long term’. This is exactly what clinical psychology has been doing for 100 years! Yet I have no once seen reference to clinical psychology texts—from things like self-determination theory, self-discrepancy theory, terror-management theory etc.—in the economics of happiness literature. The literature has spent 40 years getting itself to the point where it is ready to read what everyone else has written already.

I don’t think life satisfaction scales are going to be very helpful in the causal analysis that needs to be done. We need more granularity, more qualitative information, than what is available in those scales. The clinical and personality psych work on these matters has far more sophisticated instruments, and they’ve already yielded far more interesting results, than anything employed in losat research.

Measurement error exists, therefore everything is fine

The last point I want to bring up is the bizarre tendency among losat researchers to point out all the measurement error in their instrument, and then conclude that it is valid anyway. The terrifying thing about this is that these articles have now been published by several generations of researchers in good journals, and now people are starting to take the validity seriously when it has never actually been established.

I’m just going to quote a bunch from the article itself by way of examples:

The information attended to at the time of the survey response—whether it is chronically accessible or situationally primed—can have a substantial influence on reported life satisfaction. Pg. 12
Information can be “primed” or made salient by situational factors occurring before or during the time the life satisfaction question is posed. For example, Oishi, Schimmack and Colcombe (2003) systematically primed “excitement” and “peace” and found that the prime shifted the basis of life satisfaction judgements. Pg. 12 (So which judgement is the true judgement?)
In a well-known study of item-order effects, Strack, Martin and Schwarz (1988) found that life satisfaction showed a much stronger association with dating satisfaction when it was asked second, and a smaller correlation when the dating question came second. Pavot and Diener (1993a) were able to replicate this effect, but it was nonsignificant and in the opposite direction when the multi-item life satisfaction scale was used rather than a singl-item scale. The effect was also reversed when subjects had previously conducted a systematic memory search of the most important areas of their lives Pg. 13 (So which response is the ‘true’ response?)
The fact that item order effects are often small and appear to be eliminated with a single buffer item, as well as when multi-item scales of life satisfaction are used, suggests that these effects, when they do occur, might be due to altering respondents’ interpretation of these questions. Pg. 14
People sometimes report greater SWB when interviewed in a face-to-face survey than in an anonymous interview. Pg. 19
To parse these differences into artifactual response tendencies that are unrelated to true life satisfaction versus real difference in life satisfaction is challenging, in part because response tendencies in part reflect differences in how people feel and think about the world and themselves. Furthermore, cultural differences may in some cases be relevant to policy and in some cases irrelevant. Pg. 22
To me, these all seem like catastrophic problems. There is no way to assess what the true response is if the researcher can manipulate it through priming and question ordering, and if it is contaminated with non-random measurement error pertaining to the respondent’s mood e.g. arising out of whether they missed the bus that morning. But I don’t need to make this point: the author’s admit it!:

Error of measurement occurs in all fields, including advanced sciences such as particle physics…In fields ranging from physics to epidemiology to biochemistry to economics, measurement error in unavoidable. Thus, scientists must work consistently to reduce measurement error, and also must take such error into consideration in their conclusions. We cannot, however, automatically dismiss findings or fields because of measurement error because science would halt if we did so. Pg 23
Ummm…okay….So you’re saying that there’s measurement error but we should keep going anyway. What about if we just change to a different instrument? My view is that we should ditch losat scales are use other instruments, not that we should abandon research into wellbeing. At the very least, losat researchers need to investigate whether rescaling is occurring. I’ve got various other empirical techniques that I think need a run, but I’ll leave them for my PhD.    


Comments