Comment on Chow's "Issues in
Statistical Inference"
Christopher
D. Green
Published in the History & Philosophy of Psychology
Bulletin, 14 (1), 42-46.
(The missing reference to Chow (1994) has been included in this edition. -cdg-)
© 2002 by Christopher D.
Green
Looking at Wilkinson's "Statistical
Methods in Psychology Journals: Guidelines and Explanations" (1999), it is
hard to believe that anyone could object very much. Indeed, as I read it, I was somewhat surprised
at the moderate tone adopted by Wilkinson's APA-appointed "Task Force on
Statistical Inference" and at the modesty of the recommendations
made. In many ways it reads more like a précis of a basic methodology and
statistics textbook than like a tendentious "advocacy document": Don't confuse truly random samples with
'convenience' samples, and report it clearly if you have used the latter. Be careful to distinguish between mere contrast
groups and true control groups. Report reliability and validity figures when using a psychometric
questionnaire. Pay attention to
possible biases introduced by experimenter expectancies, and report the steps
you took to attenuate them. Report the
process that led to your choosing the particular size of your sample, and
attend to the power of your test. Don't
rely exclusively on null hypothesis
tests and report the sizes of the significant effects you find. Examine plots of your data and try to include
graphical representations of them in your report, not just numerical statistics. Don't report statistics you don't understand,
and make sure the assumptions of the statistics you employ have been satisfied.
Causation can be a thorny issue; don't be over-awed by the output of
"causal modeling" software.
What is more, the contributors to the
report constitute a veritable "Who's Who" of statistical expertise in
the behavioral sciences: Jacob Cohen,
Robert Abelson, Robert Rosenthal, Lee Cronbach, Paul Meehl, and John Tukey, among others.
My initially-sanguine response to the
Wilkinson report, however, was rattled by Chow's strenuous objections to nearly
every aspect of it: Random selection of samples is not necessary. Contrast
groups do not provide adequate experimental control. Experimenter effects are
chimerical. Effect sizes are primarily useful only for buttressing highly
questionable meta-analytic procedures. Null-hypothesis significance testing
remains the foundation of good statistical analysis. Power analysis is
irrelevant to significance testing. As a result, sample size recommendations
generated by power analysis are largely irrelevant as well. In short, Chow
rejects virtually all of the substantive recommendations made by the Task Force
appointed by the APA.
How are psychologists to sort out these
conflicting claims? How could informed
commentators disagree so strenuously on such fundamental aspects of so
well-worn a topic as basic statistical analysis of behavioral scientific data?
In part, I believe that the differences are not so
great in some areas as it would at first appear. In part, I believe that Chow
has adopted a radical posture in order to counterbalance what he views as an
incautious assault on traditional statistical methods. In part, as well, I believe Chow has
unfortunately misconstrued the APA Task Force's intent on some points. I cannot comment on all the issues raised by
Chow in the space allotted me here, so I will focus on four areas of
disagreement between him and the Task Force: (1) the (un-)reality of
experimenter effects, (2) the distinction between contrast and control groups,
(3) the value of null-hypothesis significance testing, and (4) the
(un-)importance of power analysis and its sample size recommendations.
1. Experimenter
Expectancy Effects
Chow's objection to the Task Force's
warning about experimenter expectancy effects (EEE) is based upon his own
research, published in the German Journal
of Educational Psychology (Chow, 1994).
Although it is somewhat difficult to make out the details of the study
from his necessarily abbreviated presentation in this issue of the Bulletin, Chow (1994) argued that
because the "experimenters" employed in the Rosenthal & Fode (1963) study of EEE did not compare at least two group
of subjects, they were not experimenters, properly speaking, but mere
"data collectors." In Chow's
modified replication, where, by contrast, subjects ran more than one group of
subjects, no significant EEE was detected.
Chow (2002) argues, therefore, that there is no EEE effect for the Task
Force to warn researchers about. There
are a number of difficulties here.
First, although the phenomenon found by Rosenthal and Fode has come to be called the "experimenter
expectancy effect," the technical definition of "experimenter"
has little to do with the effect's reality. The point was that the expectancies
of "data-collectors" (whether authentic experimenters or not) can
have an adverse impact on the data actually collected. Chow found no such effect and attributed
these findings to his modification of the procedure but, as we all know, one should
not assert the null merely on the basis of a failure to reject it (note, the
issue here is not whether the null can be true, but rather what conclusion
we should draw when we fail to reject it).
I cannot tell with certainty whether Chow (1996) accepts this form of
inference despite the "conventional wisdom." Be that as it may, most
people are inclined to put more faith in a multiply replicated effect than in a
single failure to replicate in any case.
More important, perhaps, is the question
of what Chow would have us do instead -- utterly ignore the possibility that
the experimenter is subtly passing his or her expectancies on to her subjects?
If so, Chow is, in effect, arguing against (half of) the standard
"double-blind" research design itself, and we have far too much
evidence in a wide array of disciplines that this procedure is a wise
precaution for us to abandon it on the basis of a single study that failed to
replicate a well-established result.
There is also a potential irony lurking
within Chow's study as well. If
Rosenthal is correct about the EEE, then it may be that Chow's own expectancy that there would be no
EEE biased his "experimenters," leading to the paradoxical dispelling
of the effect.
2. Contrast vs. Control
Groups
I believe that Chow simply misconstrued
the intent of the Task Force on the matter of control and contrast groups. He complains that "the Task Force's
recommendation of replacing the control group by the contrast group is an invitation
to weaken the inductive principle that underlies experimental control"
(ms. p. 11), and if they had done so, this might well be true. But a careful
reading of the passage in question shows that this was not the Task Force's
intent at all:
If we can
neither implement randomization nor approach total control of variables that
modify effects (outcomes), then we
should use the term "control group" cautiously. In most of these
cases, it would be better to forgo the term and use "contrast group"
instead. (Wilkinson, 1999, 3rd para. of
"Nonrandom assignment," italics added)
The
aim of the Task Force was not to replace control groups with contrast groups,
but rather to stop researchers from using the
term "control group" when they have in fact used mere contrast
groups.
3. Null Hypothesis
Testing
Chow has made something of a name for
himself over the past decade-and-a-half attempting to defend null-hypothesis
significance testing (NHST) against its legions of critics. His campaign reached its zenith with the
publication of Statistical Significance
(1996), and an open debate with many leaders in the field in Behavioral and Brain Sciences
(1998). After reading through much of
this material, I wonder whether more heat has been generated than light. More to the present point, I don't see that
the positions of Chow and the APA Task Force are, in reality, that far apart on
the issue (except that the Task Force sees NHST as being recklessly overused,
and Chow sees the attacks on it as being overkill). Let us consider two passages. Chow (1996) says:
The role of NHSTP[rocedure] is a very
limited, albeit important, one. However, to say that something is not due
to chance is really not saying very much at the theoretical level, particularly
when NHSTP says nothing about whether or not an explanatory hypothesis or an
empirical generalization receives empirical support. (1996, p. 65, italics
added)
Compare
this, then, to what the APA Task Force wrote:
Some had hoped that this task force would
vote to recommend an outright ban on the use of significance tests in
psychology journals. Although this might eliminate some abuses, the committee
thought that there were enough counterexamples (e.g., Abelson,
1997) to justify forbearance. (Wlikinson, 1999, 2nd para. of "Conclusions")
What
exactly is it we are debating again? Both sides seem to agree that NHST is
frequently misused and misinterpreted in the psychological literature as it now
stands. Both sides seem to agree that these abuses should be corrected
posthaste. Both sides seem to agree that there is a legitimate, if limited,
role for NHST. If only the participants in all sharply divided debates in
psychology agreed on so many fundamentals!
What are we going to do in those cases
where significance testing is not appropriate or not sufficient? Frankly I can
see nothing at all wrong with the Task Force's recommendations that we use
effect size statistics and graphical representations of data (e.g., Geoff
Loftus' "plot-plus-error-bars").
Naturally, these have their limitations, as Chow is quick to point
out. But what doesn't? The goal is to get the most relevant
information to the reader in the most easily-apprehendable
format. There is no reason I can see,
nor that Chow presents, that NHST should remain the "default" statistic,
or that effects sizes or graphical displays of data should be banned. As Gigerenzer
(1998, p. 199) has succinctly put it, "we need statistical thinking, not
statistical rituals."
4. Power Analysis
It is here that Chow is at his most radical.1 Because,
he says, power analysis requires two distributions, whereas NHST requires only
one, the two are not pitched at the same "level of abstraction." The
implication appears to be that, not being at the same "level," they
cannot be directly compared, but Chow provides no argument that this is the
case. He only asks rhetorically if we should be "oblivious" to the
fact. Though I don't think much hangs on the use of the phrase "levels of
abstraction," my own inclination is to say that the problem is not that they are at different
"levels," but rather that the typical NHST graph (of the sampling
distribution under Ho) is incomplete, showing only half the
story. It is not that the power-analytic
graph (of the sampling distribution under Ho and under a predicted or desired form of H1) invokes a
new "level of abstraction" in any substantive sense, but rather that
it completes parts of the picture
left vacant by the NHST graph: just as it is possible for us to falsely reject
Ho, so it is possible for us to falsely fail to reject it. Is there any good logical or scientific reason
for developing an elaborate analysis of the probability of the one error but
not of the other? No. Chow worries that there are indefinitely many H1s
that might be used to generate power numbers. So be it. Contrary to his claim
that power analysis is "mechanical" (Chow, 1996, 122-123), the
scientist must use discretion in selecting pertinent H1s to work
with. Now, if Chow thinks that the style of analysis popularized by Cohen is
suboptimal, he is, of course, at liberty to develop another. But to reject the
analysis and control of Type II error altogether while continuing to insist on
the careful analysis and control of Type I error is difficult to countenance.
In a related vein, Chow presents the
results of a simulation of significance testing under different sample sizes in
order to refute the oft-made claim that significance is dependent on sample
size. He demonstrates what I think most
people already know -- as one increases sample size most aberrantly large
sample differences will be diluted out. However, what I think most people
actually mean when they claim that significance is dependent on sample size is,
rather, that for a given difference
between sample means, a larger sample generates a smaller standard error,
making the obtained t larger. In
addition the larger sample, via degrees of freedom, makes the critical t smaller. In this way, a difference
that is not significant under one sample size, can be
rendered significant under a larger sample size. This is demonstrated in Table
1.
Table 1.
1-Sample, 2-Way t-Tests for a
d.f. M-μ |
8 |
12 |
16 |
20 |
24 |
28 |
32 |
36 |
… |
48 |
.1 |
|
|
|
|
|
|
|
|
… |
0.69 |
.2 |
|
|
|
|
|
1.058 |
1.13 |
1.20 |
… |
1.39 |
.3 |
|
|
|
1.34 |
1.47 |
1.58 |
1.70 |
1.79 |
… |
2.08 |
.4 |
|
|
1.60 |
1.79 |
1.96 |
2.11 |
2.26 |
2.40 |
… |
2.77 |
.5 |
|
1.73 |
2.00 |
2.24 |
2.45 |
2.65 |
2.83 |
3.00 |
… |
|
.6 |
|
2.08 |
2.40 |
2.68 |
2.94 |
|
|
|
|
|
.7 |
1.98 |
2.42 |
2.80 |
|
|
|
|
|
|
|
.8 |
2.26 |
2.77 |
|
|
|
|
|
|
|
|
.9 |
2.55 |
|
|
|
|
|
|
|
|
|
1.0 |
2.83 |
|
|
|
|
|
|
|
|
|
(Bold numbers represent
significant differences. s=1.0 throughout.)
To reiterate, however, the much more
important problem is not that of small effects coming up significant, but
rather of important ones (their size notwithstanding) coming up non-significant
because the samples used were too small. If we were to drop Cohen-style power
analysis, what would Chow have us do instead? Ignore Type II errors as we have
for decades, dooming more generations of researchers to committing more decades
of effort and more hundreds of thousands of dollars to conducting more or less
pointless research because the sample sizes are such that they have little
chance from the outset of detecting phenomena that are there to be found? Whatever flaws there might be in Cohen's
analysis of power, his work has done more to alert us to the fact that our
samples should be much larger than they typically are, if we are to regard
ourselves as serious scientists, than anything else that has been said or done
in the three-quarters of a century since Fisher first published Statistical Methods for Research Workers
(1925). Until something better comes along, we should continue to use the best
form of analysis available to us, which is power analysis.
In conclusion, I believe that Wilkinson's
report, though more tentative than it might have been, is a reasoned and
valuable contribution to psychological science. For those who are quite
familiar with the details of statistical methods, it confirms much of what has
been happening in the literature over the past few decades. For those who have
not been keeping abreast of new developments on the statistical scene, it
alerts them in a gentle way that there have been some important changes since
they earned their degrees, and that they should probably read up on these
advances before embarking upon their next research program or teaching their
next statistics course.
Footnote
1.
I leave the out question of power being a conditional probability because I
think it is clear that Chow has misinterpreted the sentence of Cohen's that he
quotes (ms. p. 16). Of course Cohen
knew that power is the probability of rejecting Ho given that (a particular form of) H1
is true. If Cohen failed to mention the conditional aspect of this probability
in the passage Chow cites, it is only because it was so obvious to Cohen and
most every other power analyst. This was pointed out repeatedly by commentators
on Chow (1998).
References
Abelson, R. P. (1997).
On the surprising longevity of flogged horses: Why there is a case for the
significance test. Psychological Science,
23, 12-15.
Chow, S. L. (1994). The
experimenter's expectancy effect: A meta-experiment. German Journal of Educational Psychology, 8, 89-97.
Chow, S. L. (1996). Statistical significance: Rationale,
validity and utility.
Chow, S. L. (1998).
Précis of Statistical significance:
Rationale, validity and utility [and the ensuing commentary]. Behavioral and Brain Sciences, 21, 169-239.
Chow, S. L. (2002). Issues in statistical inference. History and Philosophy of Psychology Bulletin, vol#, pp.-pp.
Fisher, Ronald A. (1925).
Statistical methods for
research workers.
Gigerenzer, G. We need
statistical thinking, not statistical rituals. Behavioral and Brain Sciences, 21,
199-200
Rosenthal,
R. & Fode, K. L. (1963). Psychology of
the scientist: V. Three experiments in experimenter bias. Psychological Reports, 12,
491-511.
Rosenthal, R. (1969). Interpersonal expectation. In R. Rosenthal & R. L. Rosnow (Eds.), Artifacts
in Behavioral Research (pp. 181-277).