Taking all things together, what you say your life
satisfaction is right now on a scale of 1–10? Would you say you are happier now
than when you were an undergraduate? Has life improved? Remember your answers—they
will be important later.
Some friends recently asked me for my opinion on the
following paper:
E. Diener, R. Inglehart and L. Tay (2013), ‘Theory and
validity of life satisfaction scales’, Social
Indicators Research, vol. 112, no. 3, pp. 497–527
Of the hundreds of papers I have read during my PhD, this
is arguably the one that shits me off the most. I just re-read it in
preparation for writing this and I was amazed at how much rage I had barfed
into my margin notes. My views on this paper are summarised in the title to
this piece: life satisfaction scales are atheoretical and invalid. I’ll make 3
main points.
This graphic is actually deeper than most life satisfaction research |
The first is that scale data is impossible to interpret
without making assumptions, because there is nothing in the data that allows
you to determine whether what you are observing is an adaptation, rescaling or
preference drift. Life satisfaction research makes these assumptions while
pretending that it doesn’t. Furthermore, it rejects most attempts to theoretically
engage with life satisfaction scales on the grounds that it wants to maintain a
‘content-free’ measure of wellbeing. This would be nice if such a thing were
possible.
The second point is that research using life satisfaction
scales is miles away from causal analysis. Over the past 30 years it has barely
gotten to a consensus about whether spinal injury causes long-term reductions
in life satisfaction, even though all the empirical clinical psych literature
says that explicitly. Life satisfaction scales not only produce only
correlation, but extremely crude ones at that. These are useless for
investigating the kind of granular issues that inform life satisfaction changes
for the majority of the population. As such, research with scales continues to
fiddle about asking questions we already know the answers to. A favourite
example is from the forthcoming The Origins
of Happiness, which argues that one of the best ways to increase happiness
would be to cure more people of depression. That’s a tautology! More importantly,
how do you cure people with
depression? That would require a causal analysis. Life satisfaction scale
research is so superficial as to be glib.
The third point is that in this paper, like in most papers
on this topic by this particular clique of researchers, there are numerous
examples given of massive measurement problems, but the authors conclude that
scales are valid anyway. I’m always struck by this. It’s as if by merely
acknowledging that measurement error exists you somehow make it go away. This
is doubly taxing to me because the authors never seem to get stuck into
tackling these measurement issues with empirical work using something other
than scales.
Scales
data is impossible to interpret
Scale data researchers are quick to point out that scale
responses change following events that we would expect to cause them to change,
like getting divorced, and tend to track other indicators of wellbeing: for
example, people tend to report low subjective wellbeing (SWB) prior to
committing suicide.
I don’t deny this. I have some concerns about the fact that
these results are rarely experimental, but with things as dramatic as suicide,
I don’t think this matters much. I don’t deny that life satisfaction
(henceforth losat, because that’s what the variable is called in HILDA) tracks
some changes in life satisfaction. What I do contest is that we can
meaningfully interpret the data in anything but the most dramatic of
circumstances, and even there that we can get an accurate estimate of the
change. My concerns on this front are only compounded by empirical estimates of
the test-retest coefficient for losat data, which is a mere 0.6 at 2 weeks
(Krueger & Schkade 2008)!
The easiest way to see this is by comparing the adaptation
hypothesis to the alternatives: preference drift and rescaling. Adaptation is
one of the main ‘findings’ of losat research. It says that people get used to
changes in their life circumstances over time, sometimes quite quickly, and so
losat is defined by a fairly stable ‘set-point’ (slightly different to a mean)
over time. This set-point is in turn determined mostly by genetic factors,
especially whether you are a glass half-full or half-empty type, and whether
you are an extrovert or a neurotic introvert. In losat research, people seem to
adapt to even really major stuff, notably spinal injury and winning the lottery
(cf. Brickman & Coates 1978).
Rescaling is an alternate hypothesis that explain what we
observe in the data. It argues that the qualitative meaning of the points on
people’s scales changes over time, such that an 8/10 at time x and an 8/10 at
time y mean two different things. So for example, an 8/10 for Stalin before his
spinal injury means a very different thing in terms of the underlying quantity
of utility reflected than his 8/10 report after his spinal injury. He is using
different scales in each case (Blanton & Jacard 2006 is the primary
reference for this argument). If you think about this for a minute, I’ll wager
you think rescaling is a reasonable hypothesis. Think back to yourself as a 3rd
year university student. Imagine that person was asked their life satisfaction
on a scale of 1–10. What would you have said? What did you say to the question
when I asked it at the very beginning of this article? Is your answer the same today
as it would have been back then? And has your life gotten better, in your own
estimation? I daresay it has, and yet you give the same answer. What’s up with
that? Could it be that the meaning of the points on your scale have changed? My
view is that our answers to losat questions are massively conditioned by the ambiguities
of our present circumstances, and so the numbers do not statically reflect underlying
quantities of utility, but vague, churning intuitions largely informed by
present mood. (Some losat psychologists sometimes say this is precisely what
they are interested in, not underlying utility. The economists don’t have that
excuse).
There is one piece of rather crisp empirical evidence for
rescaling. It comes from Stillman et al—pre-press version available here,
see pg. 12. They study migrants from Tonga to New Zealand. Migration to New
Zealand is by way of a lottery system, with about 10% of applicants getting
visas. Stillman et al interview both successful and unsuccessful migrants, achieving
randomisation on observable characteristics. It’s not tightest identification
but it’s pretty damn good—much better than the panel data work on losat change.
Migrants from the 2002–2005 cohorts are interviewed in 2005. On average, they
have been in New Zealand for 11 months. The migrants give the same losat
reports as unsuccessful applicants despite large increases in objective indicators
of wellbeing, notably real income. One might conclude that they have perfectly
adapted to their new circumstances. Or they might have rescaled, and indeed,
that seems to be the case. When asked how they felt before migration, the successful
applicants say they were less satisfied than they are now, even though the
scale responses between the two groups are the same. Stillman et al elaborate:
The migrants appear to perceive that they are better than
they were in Tonga, but rather than that taking the form advancing up a
subjective ladder they instead demote their previous position. To the extent
that this sort of filtering may occur more widely when frames of reference change, retrospective questions about changes
in subjective well-being may be an unreliable guide to actual changes (pg. 12,
emphasis added).
Given that this is a randomised situation, meaning that the
migrants are essentially the same person as those who stayed except for the
fact that they migrated, this suggests that rescaling occurs rapidly. It also suggests that interpersonal and
especially intertemporal comparisons between respondents are invalid, because different
respondents are using different scales. You can still use scales when you’re
in one person at one time though (for a good application of such data, see
Frijters & Mujcic 2012).
So far I’ve only discussed rescaling and I feel like I’ve
already detonated the usefulness of scales, but there’s also preference drift.
This is a situation where the numbers you are reporting reflect the same amount
of underlying utility, but the things that determine which number you respond
have changed. The three different interpretations: adaptation, rescaling and
preference drift, are summarised in the graphs below:
There is nothing in scale data that you can use to
determine whether what you are observing is straightforward change, adaptation,
rescaling or preference drift. To my mind, that makes scales uninterpretable
and so I don’t know why we keep expending so many resources on them rather than
other empirical instruments.
Causality
In large part because of the interpretation problems
mentioned above, understanding causal change in life satisfaction using scales
is pretty damn hard. Consequently, even after more than 30 years of research,
losat research is still struggling to answer questions that we had good answers
to before losat research even began. I’ll jump straight to my favourite
example, which is the checklist of ‘causes’ of happiness from the forthcoming The Origins of Happiness. Here’s a summary. Spoiler
alert: they’re anything but.
The authors emphasise the following causes of happiness:
income, education, unemployment, not being a criminal, being partnered,
physical health and not being mentally ill. Like, duh. But this is even worse
than duh. Income is not a cause of
happiness. As economists, they should know that it’s what you buy with money
that makes you happy, so this is definitely not an origin of happiness. Similar things could be said about health.
That depressed people aren’t happy is a tautology. Moreover, the history of
philosophical and psychological inquiry into wellbeing has been substantially
about how to overcome mental illness. It’s not meaningful to say, as they do,
that policymakers should aim to ‘abolish mental illness’, as though it’s just a
matter of clicking your fingers. The causality of these things lies deeper, in
issues like meaning, autonomy, authenticity, relationships etc. This research
is similar (though much worse) than work on growth in the mid-20th
century. It is looking at the proximate causes of a phenomenon rather than the
deep determinants e.g. what causes mental illness. Meanwhile, they say ‘at last
the map of happiness is becoming clearer’ even though their R2 is only
0.19!
Ironically, the article I linked to above ends with this
imperative: ‘we now need thousands of well-controlled trials of specific
policies from which we can obtain estimates of the effects on life satisfaction
in the near and long term’. This is exactly what clinical psychology has been
doing for 100 years! Yet I have no once seen reference to clinical psychology
texts—from things like self-determination theory, self-discrepancy theory,
terror-management theory etc.—in the economics of happiness literature. The
literature has spent 40 years getting itself to the point where it is ready to
read what everyone else has written already.
I don’t think life satisfaction scales are going to be very
helpful in the causal analysis that needs to be done. We need more granularity,
more qualitative information, than what is available in those scales. The
clinical and personality psych work on these matters has far more sophisticated
instruments, and they’ve already yielded far more interesting results, than
anything employed in losat research.
Measurement
error exists, therefore everything is fine
The last point I want to bring up is the bizarre tendency
among losat researchers to point out all the measurement error in their
instrument, and then conclude that it is valid anyway. The terrifying thing
about this is that these articles have now been published by several
generations of researchers in good journals, and now people are starting to
take the validity seriously when it has never actually been established.
I’m just going to quote a bunch from the article itself by
way of examples:
The information attended to at the time of the survey
response—whether it is chronically accessible or situationally primed—can have
a substantial influence on reported life satisfaction. Pg. 12
Information can be “primed” or made salient by situational
factors occurring before or during the time the life satisfaction question is
posed. For example, Oishi, Schimmack and Colcombe (2003) systematically primed “excitement”
and “peace” and found that the prime shifted the basis of life satisfaction
judgements. Pg. 12 (So which judgement is the true judgement?)
In a well-known study of item-order effects, Strack, Martin
and Schwarz (1988) found that life satisfaction showed a much stronger
association with dating satisfaction when it was asked second, and a smaller
correlation when the dating question came second. Pavot and Diener (1993a) were
able to replicate this effect, but it was nonsignificant and in the opposite
direction when the multi-item life satisfaction scale was used rather than a
singl-item scale. The effect was also reversed when subjects had previously
conducted a systematic memory search of the most important areas of their lives
Pg. 13 (So which response is the ‘true’ response?)
The fact that item order effects are often small and appear
to be eliminated with a single buffer item, as well as when multi-item scales
of life satisfaction are used, suggests that these effects, when they do occur,
might be due to altering respondents’ interpretation of these questions. Pg. 14
People sometimes report greater SWB when interviewed in a
face-to-face survey than in an anonymous interview. Pg. 19
To parse these differences into artifactual response
tendencies that are unrelated to true life satisfaction versus real difference
in life satisfaction is challenging, in part because response tendencies in
part reflect differences in how people feel and think about the world and
themselves. Furthermore, cultural differences may in some cases be relevant to
policy and in some cases irrelevant. Pg. 22
To me, these all seem like catastrophic problems. There is
no way to assess what the true response is if the researcher can manipulate it through
priming and question ordering, and if it is contaminated with non-random measurement
error pertaining to the respondent’s mood e.g. arising out of whether they
missed the bus that morning. But I don’t need to make this point: the author’s
admit it!:
Error of measurement occurs in all fields, including
advanced sciences such as particle physics…In fields ranging from physics to
epidemiology to biochemistry to economics, measurement error in unavoidable.
Thus, scientists must work consistently to reduce measurement error, and also
must take such error into consideration in their conclusions. We cannot,
however, automatically dismiss findings or fields because of measurement error
because science would halt if we did so. Pg 23
Ummm…okay….So you’re saying that there’s measurement error
but we should keep going anyway. What about if we just change to a different
instrument? My view is that we should ditch losat scales are use other
instruments, not that we should abandon research into wellbeing. At the very
least, losat researchers need to investigate whether rescaling is occurring. I’ve
got various other empirical techniques that I think need a run, but I’ll leave
them for my PhD.
Comments
Post a Comment