Comment on Chow's "Issues in Statistical Inference"

 

Christopher D. Green

York University


Published in the History & Philosophy of Psychology Bulletin, 14 (1), 42-46.
(The missing reference to Chow (1994) has been included in this edition. -cdg-)

© 2002 by Christopher D. Green


Looking at Wilkinson's "Statistical Methods in Psychology Journals: Guidelines and Explanations" (1999), it is hard to believe that anyone could object very much.  Indeed, as I read it, I was somewhat surprised at the moderate tone adopted by Wilkinson's APA-appointed "Task Force on Statistical Inference" and at the modesty of the recommendations made.  In many ways it reads more like a précis of a basic methodology and statistics textbook than like a tendentious "advocacy document":  Don't confuse truly random samples with 'convenience' samples, and report it clearly if you have used the latter.  Be careful to distinguish between mere contrast groups and true control groups.  Report reliability and validity figures when using a psychometric questionnaire.  Pay attention to possible biases introduced by experimenter expectancies, and report the steps you took to attenuate them.  Report the process that led to your choosing the particular size of your sample, and attend to the power of your test.  Don't rely exclusively on null hypothesis tests and report the sizes of the significant effects you find.  Examine plots of your data and try to include graphical representations of them in your report, not just numerical statistics.  Don't report statistics you don't understand, and make sure the assumptions of the statistics you employ have been satisfied. Causation can be a thorny issue; don't be over-awed by the output of "causal modeling" software.

What is more, the contributors to the report constitute a veritable "Who's Who" of statistical expertise in the behavioral sciences:  Jacob Cohen, Robert Abelson, Robert Rosenthal, Lee Cronbach, Paul Meehl, and John Tukey, among others.

My initially-sanguine response to the Wilkinson report, however, was rattled by Chow's strenuous objections to nearly every aspect of it: Random selection of samples is not necessary. Contrast groups do not provide adequate experimental control. Experimenter effects are chimerical. Effect sizes are primarily useful only for buttressing highly questionable meta-analytic procedures. Null-hypothesis significance testing remains the foundation of good statistical analysis. Power analysis is irrelevant to significance testing. As a result, sample size recommendations generated by power analysis are largely irrelevant as well. In short, Chow rejects virtually all of the substantive recommendations made by the Task Force appointed by the APA.

How are psychologists to sort out these conflicting claims?  How could informed commentators disagree so strenuously on such fundamental aspects of so well-worn a topic as basic statistical analysis of behavioral scientific data? In part, I believe that the differences are not so great in some areas as it would at first appear. In part, I believe that Chow has adopted a radical posture in order to counterbalance what he views as an incautious assault on traditional statistical methods.  In part, as well, I believe Chow has unfortunately misconstrued the APA Task Force's intent on some points.  I cannot comment on all the issues raised by Chow in the space allotted me here, so I will focus on four areas of disagreement between him and the Task Force: (1) the (un-)reality of experimenter effects, (2) the distinction between contrast and control groups, (3) the value of null-hypothesis significance testing, and (4) the (un-)importance of power analysis and its sample size recommendations.

 

1. Experimenter Expectancy Effects

Chow's objection to the Task Force's warning about experimenter expectancy effects (EEE) is based upon his own research, published in the German Journal of Educational Psychology (Chow, 1994).  Although it is somewhat difficult to make out the details of the study from his necessarily abbreviated presentation in this issue of the Bulletin, Chow (1994) argued that because the "experimenters" employed in the Rosenthal & Fode (1963) study of EEE did not compare at least two group of subjects, they were not experimenters, properly speaking, but mere "data collectors."  In Chow's modified replication, where, by contrast, subjects ran more than one group of subjects, no significant EEE was detected.  Chow (2002) argues, therefore, that there is no EEE effect for the Task Force to warn researchers about.  There are a number of difficulties here.  First, although the phenomenon found by Rosenthal and Fode has come to be called the "experimenter expectancy effect," the technical definition of "experimenter" has little to do with the effect's reality. The point was that the expectancies of "data-collectors" (whether authentic experimenters or not) can have an adverse impact on the data actually collected.  Chow found no such effect and attributed these findings to his modification of the procedure but, as we all know, one should not assert the null merely on the basis of a failure to reject it (note, the issue here is not whether the null can be true, but rather what conclusion we should draw when we fail to reject it).  I cannot tell with certainty whether Chow (1996) accepts this form of inference despite the "conventional wisdom." Be that as it may, most people are inclined to put more faith in a multiply replicated effect than in a single failure to replicate in any case. 

More important, perhaps, is the question of what Chow would have us do instead -- utterly ignore the possibility that the experimenter is subtly passing his or her expectancies on to her subjects? If so, Chow is, in effect, arguing against (half of) the standard "double-blind" research design itself, and we have far too much evidence in a wide array of disciplines that this procedure is a wise precaution for us to abandon it on the basis of a single study that failed to replicate a well-established result.

There is also a potential irony lurking within Chow's study as well.  If Rosenthal is correct about the EEE, then it may be that Chow's own expectancy that there would be no EEE biased his "experimenters," leading to the paradoxical dispelling of the effect. 

 

2. Contrast vs. Control Groups

I believe that Chow simply misconstrued the intent of the Task Force on the matter of control and contrast groups.  He complains that "the Task Force's recommendation of replacing the control group by the contrast group is an invitation to weaken the inductive principle that underlies experimental control" (ms. p. 11), and if they had done so, this might well be true. But a careful reading of the passage in question shows that this was not the Task Force's intent at all:

If we can neither implement randomization nor approach total control of variables that modify effects (outcomes), then we should use the term "control group" cautiously. In most of these cases, it would be better to forgo the term and use "contrast group" instead. (Wilkinson, 1999, 3rd para. of "Nonrandom assignment," italics added)

The aim of the Task Force was not to replace control groups with contrast groups, but rather to stop researchers from using the term "control group" when they have in fact used mere contrast groups.

 

3. Null Hypothesis Testing

Chow has made something of a name for himself over the past decade-and-a-half attempting to defend null-hypothesis significance testing (NHST) against its legions of critics.  His campaign reached its zenith with the publication of Statistical Significance (1996), and an open debate with many leaders in the field in Behavioral and Brain Sciences (1998).  After reading through much of this material, I wonder whether more heat has been generated than light.  More to the present point, I don't see that the positions of Chow and the APA Task Force are, in reality, that far apart on the issue (except that the Task Force sees NHST as being recklessly overused, and Chow sees the attacks on it as being overkill).  Let us consider two passages.  Chow (1996) says:

The role of NHSTP[rocedure] is a very limited, albeit important, one. However, to say that something is not due to chance is really not saying very much at the theoretical level, particularly when NHSTP says nothing about whether or not an explanatory hypothesis or an empirical generalization receives empirical support. (1996, p. 65, italics added)

Compare this, then, to what the APA Task Force wrote:

Some had hoped that this task force would vote to recommend an outright ban on the use of significance tests in psychology journals. Although this might eliminate some abuses, the committee thought that there were enough counterexamples (e.g., Abelson, 1997) to justify forbearance. (Wlikinson, 1999, 2nd para. of "Conclusions")

What exactly is it we are debating again? Both sides seem to agree that NHST is frequently misused and misinterpreted in the psychological literature as it now stands. Both sides seem to agree that these abuses should be corrected posthaste. Both sides seem to agree that there is a legitimate, if limited, role for NHST. If only the participants in all sharply divided debates in psychology agreed on so many fundamentals!

What are we going to do in those cases where significance testing is not appropriate or not sufficient? Frankly I can see nothing at all wrong with the Task Force's recommendations that we use effect size statistics and graphical representations of data (e.g., Geoff Loftus' "plot-plus-error-bars").  Naturally, these have their limitations, as Chow is quick to point out.  But what doesn't?  The goal is to get the most relevant information to the reader in the most easily-apprehendable format.  There is no reason I can see, nor that Chow presents, that NHST should remain the "default" statistic, or that effects sizes or graphical displays of data should be banned.  As Gigerenzer (1998, p. 199) has succinctly put it, "we need statistical thinking, not statistical rituals."

 

4. Power Analysis

It is here that Chow is at his most radical.1  Because, he says, power analysis requires two distributions, whereas NHST requires only one, the two are not pitched at the same "level of abstraction." The implication appears to be that, not being at the same "level," they cannot be directly compared, but Chow provides no argument that this is the case. He only asks rhetorically if we should be "oblivious" to the fact. Though I don't think much hangs on the use of the phrase "levels of abstraction," my own inclination is to say that the problem is not that they are at different "levels," but rather that the typical NHST graph (of the sampling distribution under Ho) is incomplete, showing only half the story.  It is not that the power-analytic graph (of the sampling distribution under Ho and under a predicted or desired form of H1) invokes a new "level of abstraction" in any substantive sense, but rather that it completes parts of the picture left vacant by the NHST graph: just as it is possible for us to falsely reject Ho, so it is possible for us to falsely fail to reject it. Is there any good logical or scientific reason for developing an elaborate analysis of the probability of the one error but not of the other? No. Chow worries that there are indefinitely many H1s that might be used to generate power numbers. So be it. Contrary to his claim that power analysis is "mechanical" (Chow, 1996, 122-123), the scientist must use discretion in selecting pertinent H1s to work with. Now, if Chow thinks that the style of analysis popularized by Cohen is suboptimal, he is, of course, at liberty to develop another. But to reject the analysis and control of Type II error altogether while continuing to insist on the careful analysis and control of Type I error is difficult to countenance.

In a related vein, Chow presents the results of a simulation of significance testing under different sample sizes in order to refute the oft-made claim that significance is dependent on sample size.  He demonstrates what I think most people already know -- as one increases sample size most aberrantly large sample differences will be diluted out. However, what I think most people actually mean when they claim that significance is dependent on sample size is, rather, that for a given difference between sample means, a larger sample generates a smaller standard error, making the obtained t larger. In addition the larger sample, via degrees of freedom, makes the critical t smaller. In this way, a difference that is not significant under one sample size, can be rendered significant under a larger sample size. This is demonstrated in Table 1.

 

Table 1. 1-Sample, 2-Way t-Tests for a Range of Differences and Degrees of Freedom

  d.f.

M 

8

12

16

20

24

28

32

36

48

.1

 

 

 

 

 

 

 

 

0.69

.2

 

 

 

 

 

1.058

1.13

1.20

1.39

.3

 

 

 

1.34

1.47

1.58

1.70

1.79

2.08

.4

 

 

1.60

1.79

1.96

2.11

2.26

2.40

2.77

.5

 

1.73

2.00

2.24

2.45

2.65

2.83

3.00

 

.6

 

2.08

2.40

2.68

2.94

 

 

 

 

 

.7

1.98

2.42

2.80

 

 

 

 

 

 

 

.8

2.26

2.77

 

 

 

 

 

 

 

 

.9

2.55

 

 

 

 

 

 

 

 

 

1.0

2.83

 

 

 

 

 

 

 

 

 

(Bold numbers represent significant differences. s=1.0 throughout.)

 

To reiterate, however, the much more important problem is not that of small effects coming up significant, but rather of important ones (their size notwithstanding) coming up non-significant because the samples used were too small. If we were to drop Cohen-style power analysis, what would Chow have us do instead? Ignore Type II errors as we have for decades, dooming more generations of researchers to committing more decades of effort and more hundreds of thousands of dollars to conducting more or less pointless research because the sample sizes are such that they have little chance from the outset of detecting phenomena that are there to be found?  Whatever flaws there might be in Cohen's analysis of power, his work has done more to alert us to the fact that our samples should be much larger than they typically are, if we are to regard ourselves as serious scientists, than anything else that has been said or done in the three-quarters of a century since Fisher first published Statistical Methods for Research Workers (1925). Until something better comes along, we should continue to use the best form of analysis available to us, which is power analysis. 

In conclusion, I believe that Wilkinson's report, though more tentative than it might have been, is a reasoned and valuable contribution to psychological science. For those who are quite familiar with the details of statistical methods, it confirms much of what has been happening in the literature over the past few decades. For those who have not been keeping abreast of new developments on the statistical scene, it alerts them in a gentle way that there have been some important changes since they earned their degrees, and that they should probably read up on these advances before embarking upon their next research program or teaching their next statistics course. 


 

Footnote

1. I leave the out question of power being a conditional probability because I think it is clear that Chow has misinterpreted the sentence of Cohen's that he quotes (ms. p. 16). Of course Cohen knew that power is the probability of rejecting Ho given that (a particular form of) H1 is true. If Cohen failed to mention the conditional aspect of this probability in the passage Chow cites, it is only because it was so obvious to Cohen and most every other power analyst. This was pointed out repeatedly by commentators on Chow (1998).


 

References

Abelson, R. P. (1997). On the surprising longevity of flogged horses: Why there is a case for the significance test. Psychological Science, 23, 12-15.

Chow, S. L. (1994). The experimenter's expectancy effect: A meta-experiment. German Journal of Educational Psychology, 8, 89-97.

Chow, S. L. (1996). Statistical significance: Rationale, validity and utility. London: Sage.

Chow, S. L. (1998). Précis of Statistical significance: Rationale, validity and utility [and the ensuing commentary]. Behavioral and Brain Sciences, 21, 169-239.

Chow, S. L. (2002). Issues in statistical inference. History and Philosophy of Psychology Bulletin, vol#, pp.-pp.

Fisher, Ronald A. (1925). Statistical methods for research workers. London: Oliver and Boyd.

Gigerenzer, G. We need statistical thinking, not statistical rituals. Behavioral and Brain Sciences, 21, 199-200

Rosenthal, R. & Fode, K. L. (1963). Psychology of the scientist: V. Three experiments in experimenter bias. Psychological Reports, 12, 491-511.

Rosenthal, R. (1969). Interpersonal expectation. In R. Rosenthal & R. L. Rosnow (Eds.), Artifacts in Behavioral Research (pp. 181-277). New York : Academic Press.