Vaughan Bell of Mind Hacks links to a forthcoming Perspectives on Psychological Science article by Edward Vul et al. that is sure to prove a “bombshell” for the field of cognitive neuroscience. Vul’s analysis demonstrates, in rigorous detail, how the too-good-to-be-true results of (mostly) headline studies are produced by complex statistical errors and biases.
Vul’s analysis begins with a lucid, layman description of how fMRI scans can produce different kinds of images.
“The output of an fMRI experiment typically consists of two types of “3D pictures” (image volumes): “anatomical” (a high resolution scan that shows anatomical structure, not function) and “functional”. Functional image volumes are lower resolution scans showing measurements reflecting, among other things, the amount of deoxygenated hemoglobin in the blood – blood oxygenation level dependent (BOLD) signal. A functional image volume is composed of many measurements of the BOLD signal in small, roughly cubeshaped, regions called “voxels” (‘volumetric pixels’). The number of voxels in the whole image volume depends on the scanner settings, but it typically ranges between 10x64x64 and 30x128x128 voxels. Thus, each functional image contains somewhere between 40,000 and 500,000 voxels, with each of these voxels covering between 1 mm3 (1x1x1 mm) and 125 mm3 (5x5x5mm) of brain tissue (except for voxels outside of the brain). A new functional image volume is usually acquired every 2 or 3 seconds (TR, or repetition time) during a scan, so one ends up with a timeseries of these functional images.” (6-7)
How this enormous set of data is to be interpreted is by no means self-evident, however. Thus, to establish a ‘base’ against which experimental results can be measured, a “number of average-brain models exist, the most famous being Talairach (Talairach & Tournoux, 1988) and MNI (Evans et al. 1993)” (7), but even then “fMRI researchers typically focus not on the activation in particular voxels during one task, but rather on a contrast between the activation arising when the person performs one task versus the activation arising when they do another” (7). The contrast in brain activity between ‘reading words’ and ‘looking at nonlinguistic patterns’, a commonly used model, is derived from establishing, “separately for each voxel, the sequence of activation levels measured at that voxel”.
Once these basic steps have been completed, yielding “matrices consisting of tens or hundreds of thousands of numbers indicating activation levels” (7), qualitative summaries must still be obtained if correlations with behavioral measures are to be determined. An “investigator must somehow select a subset of voxels and aggregate measurements across them.” (7-8)
“This can be done in various ways. A subset of voxels in the whole brain image may be selected based on purely anatomical constraints (e.g., all voxels in a region generally agreed to represent the amygdala, or all voxels within a certain radius of some a priori specified brain coordinates). Alternatively, regions can be selected based on “functional constraints”: meaning voxels are selected based on their activity pattern in functional scans. For example, one could select all the voxels for a particular subject that responded more to reading than to non-linguistic stimuli. Finally, voxels could be chosen based on some combination of anatomy and functional response.” (7-8)
Whatever the case may be, “the critical question is: how was this set of voxels selected?” (8) Vul then wrote each of the research teams in question, inquiring whether the “fMRI signal measure that was correlated across subjects with a behavioral measure represented the average of some number of voxels, or instead, the activity from just one voxel that was deemed most informative (referred to as the peak voxel).” (8) If they used an average of some number of voxels, how was that average calculated? Or, if a peak voxel was used, how was that one voxel determined to be most informative based on its activation? What was the measure of activation? “Was it the difference in activation between two task conditions computed on individual subjects, or was it a measure of how this task contrast correlated with the individual difference measure?” (8) It is alone meaningful that the answers to these questions are not always included in the studies themselves, that respected publications did not encourage greater rigor.
Needless to say, a statistical threshold was in each case determined, past which results were eliminated as ‘noise’. 54% of the papers reviewed by Vul employed the following logic to determine that threshold and establish a correlation between behavior and brain activity:
“(a) From each subject, the researchers obtain a behavioral measure as well as BOLD measures from many voxels. (b) The activity in each voxel is correlated with the behavioral measure of interest across subjects. (c) From this set of correlations, researchers select those voxels that pass a statistical threshold, and (d) aggregate the fMRI signal across those voxels to derive a final measure of the correlation of BOLD signal and the behavioral measure.” (10)
In what might be an important moment in cognitive neuroscience, Vul then explains, with some math that I won’t reproduce, how that particular statistical move is a fallacy that comes with dramatic consequences:
“What are the implications of selecting voxels in this fashion? Such an analysis will inflate observed across-subject correlations, and can even produce significant measures out of pure noise. The problem is illustrated in the simple simulation displayed in Figure 4: (a) investigator computes a separate correlation of the behavioral measure of interest with each of the voxels. Then, (b) those voxels that exhibited a sufficiently high correlation (passing a statistical threshold) are selected. Then an ostensible measure of the ‘true’ correlation is aggregated from the voxels that showed high correlations (e.g., by taking the mean of the voxels over the threshold). With enough voxels, such a biased analysis is guaranteed to produce high correlations even if none are truly present (Figure 4). Moreover, this analysis will produce visually pleasing scattergrams (e.g., Figure 4c) that will provide (quite meaningless) reassurance to the viewer that s/he is looking at a result that is solid, “not driven by outliers”, etc.” (11)
Using examples outside of cognitive neuroscience to illustrate his point, Vul goes on to show just how clearly fallacious the “nonindependence error” really is – as when, to take one of his examples, it is applied to the world of market trading and investment advising (the effects of which we are now all too familiar with). The selection of voxels is by definition circular, if subtly so. “This approach amounts to selecting one or more voxels based on a functional analysis, and then reporting the results of the same analysis and functional data from just the selected voxels. This analysis distorts the results by selecting noise exhibiting the effect being searched for” (12).
One can see just how far Vul’s critique could reach. For the half of the studies that calculated an average voxel, “the reported correlation coefficients mean almost nothing, because they are systematically inflated by the biased analysis” (13), a problem that is only “exacerbated in the case of the 38% of our respondents who reported the correlation of the peak voxel (the voxel with the highest observed correlation) rather than the average of all voxels in a cluster passing some threshold.” (13)
Vul’s criticisms do not, however, amount to a wholesale rejection of the field itself. Building on a 2007 study by Kross et al, Vul in conclusion sketches out what he calls a ‘functional Region of Interest’ (fROI) method, which will not only provide “an unbiased measure of any relationships between evoked activity and individual differences” but will also avoid making “implausible assumptions about voxelwise correspondence across different individuals’ functional anatomy (Saxe, Brett, & Kanwisher, 2006).” (16-17)
In an intellectual climate where specific results can be programatically guaranteed in advance, Vul’s perspective couldn’t be more refreshing. In place of unquestioned homologies, Vul maintains a rigorous, skeptical attitude. “Although it is possible for voxels registered to the ‘average brain’ to be functionally matched across subjects,” Vul and his team observe, “the variability in anatomical location of well-studied regions even in early visual cortex (V1, MT) and visual cognition (FFA) suggests to us that higher-level functions determining individual differences in personality and emotionality is not likely to be anatomically uniform across individuals (Saxe, Brett, & Kanwisher, 2006).” (17) Which is to say, even imaging based on anataomical, rather than functional, constraints assumes too much – namely, that cognitive functions’ location in the brain, the body, are uniform across all individuals.
Vul is everywhere attentive to the many confounding factors besetting the imaging process. Even when an fMRI test is administered correctly, easily overlooked conditions proper to the experiment itself can disturb its findings. All of these concerns must be taken into account, though few are in the studies reviewed.
“For instance, proneness to anxiety may lead people to breathe faster, drink more coffee, or make slightly different choices in which lipids they ingest. All of these are known to have effects on BOLD responses (Weckesser et al, 1999; Mulderink et al., 2002; Noseworthy et al, 2003), and those effects could easily interact slightly with the specific hemodynamic responses of different brain areas. Or perhaps anxious people are more afraid than others of failing to follow task instructions and attend ever so slightly more to the required auditory stream. The weaker the correlation, the greater the number of indirect and uninteresting causal chains that might be accounting for it, and the greater the chance that the effect itself will appear and disappear in different samples in a completely inscrutable fashion (e.g., if the dietary propensities of anxious people in England differ from those of anxious people in Japan).” (20-21)
It would be difficult not to read Vul’s paper as a profound and far-reaching critique of the state of cognitive neuroscience, in terms of both the studies it produces and the internal standards of scholarship by which they are reviewed. If “it is quite possible that a considerable number of relationships reported in this literature are entirely illusory” (22), this can only be the effect of a much deeper problem internal to the discipline itself. “Interestingly,” Vul et al. concludes, “we suspect that the problems brought to light here are ones that most editors and reviewers of studies using purely behavioral measures would usually be quite sensitive to.”
That being said, a more important question remains. If all those studies are indeed fundamentally flawed (and not simply ‘off’) – which is to say, if they lack scientific value – then their force and execution must have been, and must be, cultural and ‘predisposed’. Usually, when a given study or wave of scholarship is debunked or dealt a blow, its effect on the world thus far, not to mention the complex cultural reasons for its half-blind acceptance, disappear from consideration, if only because, in a scientific framework, questions of meaning and ideology have a way of being neutralized by questions of validity.
For instance, Vul, in a brief aside picks apart Takahashi’s (2006) study that purported to demonstrate how “Men and women show distinct brain activations during imagery of sexual and emotional infidelity”. Now, if this study is as fundamentally flawed as Vul indicates, the question should arise as to what motivated and created, out of thin air and through sophisticated means, a study that assumed the highest legitimacy afforded knowledge today. Hardly a week goes by without some new brain-based study purporting to vindicate the crudest of stereotypes – that women love shopping because they’re “gatherers”, that girls have different kinds of brains and need to be taught separately, that gay men and straight women read maps similarly. The list could go on and on.
The moment of debunking or reassessment should be a beginning not an end; it is at precisely this point that social scientists, the most equipped to intervene, ought to step in to show how, in addition to a science, cognitive neuroscience can also be an apparatus, an ideology, and a conduit for far-ranging, deep-seated biases. Vul’s study shows this quite clearly, but it also shows just how promising rigorous scientific work can be.
References
Takahashi, H., Matsuura, M., Yahata, N., Koeda, M., Suhara, T., & Okubo, Y. (2006). Men and women show distinct brain activations during imagery of sexual and emotional infidelity. Neuroimage, 32, 1299-1307. [Link]
Vul, E., Harris, C., Winkielman, P., & Pashler, H. (2008). Voodoo Correlations in Social Neuroscience. Perspectives on Psychological Science, in press. [Link]
