Source: NYMag, Jan 2017
Almost two decades after its introduction, the implicit association test has failed to deliver on its lofty promises.
… which purports to offer a quick, easy way to measure how implicitly biased individual people are.
Unfortunately, none of that is true.
A pile of scholarly work, some of it published in top psychology journals and most of it ignored by the media, suggests that the IAT falls far short of the quality-control standards normally expected of psychological instruments. The IAT, this research suggests, is a noisy, unreliable measure that correlates far too weakly with any real-world outcomes to be used to predict individuals’ behavior — even the test’s creators have now admitted as such. The history of the test suggests it was released to the public and excitedly publicized long before it had been fully validated in the rigorous, careful way normally demanded by the field of psychology.
There’s an entire field of psychology, psychometrics, dedicated to the creation and validation of psychological instruments, and instruments are judged based on whether they exceed certain broadly agreed-upon statistical benchmarks.
The most important benchmarks pertain to a test’s reliability — that is, the extent to which the test has a reasonably low amount of measurement error (every test has some) — and to its validity, or the extent to which it is measuring what it claims to be measuring. A good psychological instrument needs both.
Take the concept of test-retest reliability, which measures the extent to which a given instrument will produce similar results if you take it, wait a bit, and then take it again. Different instruments have different test-retest reliabilities.
Test-retest reliability is expressed with a variable known as r, which ranges from 0 to 1. To gloss over some of the gory statistical details, r = 1 means that if a given test is administered multiple times to the same group of people, it will rank them in exactly the same order every time.
Hypothetically, if the IAT had a test-retest reliability of r = 1, and you administered the test to ten people over and over and over, they’d be placed in the same order, least to most implicitly biased, every time. At the other end of the spectrum, when r = 0, that means the ranking shifts every time the test is administered, completely at random. The person ranked most biased after the first test would, after the second test, be equally likely to appear in any of the ten available slots. Overall, the closer you get to r = 0, the closer the instrument in question is to, in effect, a random-number generator rather than a remotely useful means of measuring whatever it is you’re trying to measure.
What constitutes an acceptable level of test-retest reliability? It depends a lot on context, but, generally speaking, researchers are comfortable if a given instrument hits r = .8 or so. The IAT’s architects have reported that overall, when you lump together the IAT’s many different varieties, from race to disability to gender, it has a test-retest reliability of about r = .55. By the normal standards of psychology, this puts these IATs well below the threshold of being useful in most practical, real-world settings.
The individual results that have been published, though, suggest the race IAT’s test-retest reliability is far too low for it to be safe to use in real-world settings. In a 2007 chapter on the IAT, for example, Kristin Lane, Banaji, Nosek, and Greenwald included a table (Table 3.2) running down the test-retest reliabilities for the race IAT that had been published to that point: r = .32 in a study consisting of four race IAT sessions conducted with two weeks between each; r = .65 in a study in which two tests were conducted 24 hours apart; and r = .39 in a study in which the two tests were conducted during the same session (but in which one used names and the other used pictures).
What all these numbers mean is that there doesn’t appear to be any published evidence that the race IAT has test-retest reliability that is close to acceptable for real-world evaluation. If you take the test today, and then take it again tomorrow — or even in just a few hours — there’s a solid chance you’ll get a very different result. That’s extremely problematic given that in the wild, whether on Project Implicit or in diversity-training sessions, test-takers are administered the test once, given their results, and then told what those results say about them and their propensity to commit biased acts.
In statistical terms, the architects of the IAT claimed, for a long time, that there is a meaningful correlation between two variables: Someone’s IAT score (call it x) and how implicitly biased they act in intergroup settings (call it y). Generally speaking, researchers measure the extent to which two variables are correlated by examining how much of the variation in one variable, y, is explained by changes in the other, x. The more two variables are correlated in this manner, the more meaningful a connection might exist between them.
when you use meta-analyses to examine the question of whether IAT scores predict discriminatory behavior accurately enough for the test to be useful in real-world settings, the answer is: No. Race IAT scores are weak predictors of discriminatory behavior.
the most IAT-friendly numbers, published in a 2009 meta-analysis lead-authored by Greenwald, which found fairly unimpressive correlations (race IAT scores accounted for about 5.5 percent of the variation in discriminatory behavior in lab settings, and other intergroup IAT scores accounted for about 4 percent of the variance in discriminatory behavior in lab settings), were based on some fairly questionable methodological decisions on the part of the authors.
The second, more important point to emerge from this years-long meta-analytic melee is that both critics and proponents of the IAT now agree that the statistical evidence is simply too lacking for the test to be used to predict individual behavior.
The psychometric issues with race and ethnicity IATs, Greenwald, Banaji, and Nosek wrote in one of their responses to the Oswald team’s work, “render them problematic to use to classify persons as likely to engage in discrimination.” In that same paper, they noted that “attempts to diagnostically use such measures for individuals risk undesirably high rates of erroneous classifications.” In other words: You can’t use the IAT to tell individuals how likely they are to commit acts of implicit bias.
The point is that the key experts involved in IAT research no longer claim that the IAT can be used to predict individual behavior. In this sense, the IAT has simply failed to deliver on a promise it has been making since its inception — that it can reveal otherwise hidden propensities to commit acts of racial bias. There’s no evidence it can.
The scientific truth is that we don’t know exactly how big a role implicit bias plays in reinforcing the racial hierarchy, relative to countless other factors. We do know that after almost 20 years and millions of dollars’ worth of IAT research, the test has a markedly unimpressive track record relative to the attention and acclaim it has garnered. Leading IAT researchers haven’t produced interventions that can reduce racism or blunt its impact. They haven’t told a clear, credible story of how implicit bias, as measured by the IAT, affects the real world. They have flip-flopped on important, baseline questions about what their test is or isn’t measuring.
And because the IAT and the study of implicit bias have become so tightly coupled, the test’s weaknesses have caused collateral damage to public and academic understanding of the broader concept itself. As Mitchell and Tetlock argue in their book chapter, it is “difficult to find a psychological construct that is so popular yet so misunderstood and lacking in theoretical and practical payoff” as implicit bias. They make a strong case that this is in large part due to problems with the IAT.