LYKKEN’S article on STATISTICAL SIGNIFICANCE IN PSYCHOLOGICAL RESEARCH presents an alternative view of significance testing in psychology (especially the “softer areas”, like personality, assessment, and psychotherapy). As a prelude to the argument, Lykken tells a story of a research study by Sapolski that no one believes in , even though it got significant results. Sean will tell you about the cloacal frog. You may find the theory hard to believe, but strong psychoanalytic types will find it easier to believe. For example there is a classic book of case report studies by Lindner, called “The 50 Minute Hour”, that has a case of eating disorder associated with desire for pregnancy. Anyway,

The traditional method of significance testing sets up the NULL HYPOTHESIS, that essentially states that all variables in the research study are UNRELATED (UNCORRELATED). This applies to all variables in a correlation matrix, as well as independent and dependent variables in an experimental design. Within this framework, statisticians know that two kinds of errors are possible:

1) The BETA error is an incorrect failure to reject the null hypothesis when the null hypothesis is really false (i.e., the theory being tested is true). This often occurs when there is not enough power in the research study (in terms of the number of subjects) to avoid BETA errors. In Lykken article the BETA error is not emphasized, instead Lykken emphasizes the

2) The ALPHA error, which is an incorrect rejection of the null hypothesis when it is really true. That is, the statistics will be significant some of the time by chance alone, even the theory being tested is FALSE (i.e., the null hypothesis is true). Researchers have control over the rate of the ALPHA error and set ALPHA = .05 (USUALLY), meaning that the researchers are willing to tolerate being wrong, making ALPHA errors 5 times out of 100. Thus the results of any one study that are significant ought to be somewhat persuasive in convincing the reader that the theory is true (but note that this was not the case with the Sapolski study). Why?

The alternative view is that the NULL HYPOTHESIS IS NEVER STRICTLY TRUE, especially in areas of soft psychology. That is, the general molar variables found in personality, psychopathology, psychotherapy, psychological assessment, etc. always share 4-5% of the variance in common. This translates to correlations of r = .20 to .23. So, if the measures used are reliable, and if there are enough research participants; one will always find statistical significance--EVEN IF THE NULL HYPOTHESIS IS TRUE. This makes the possibility of ALPHA errors 1.0, even if we set ALPHA at .05. In practicality, however, the researcher usually makes directional predictions. If the statistics are always significant, then they will fall in the direction of the researcher’s predictions about half the time. So the real ALPHA is closer to .5 in soft psychology, not .05. What does this mean for interpretations of bodies of research literature?

Example: Hundreds of studies of psychotherapy outcome research have tested the “theory” that therapy is better than no therapy. Sixty percent of these studies show that therapy helps, 10% show that therapy hurts (i.e., is worse than no therapy), and 30% show that therapy makes no difference.

According to the traditional model of hypothesis testing, this impressive evidence for the effectiveness of therapy, since we would only expect to find 5% of the studies showing therapy helps, if indeed therapy is ineffective (i.e., the null hypothesis is true). So we would likely reject the null hypothesis in favor of the theory that therapy is beneficial.

According to the alternative model, by chance alone we would expect that 50% of studies would indicate that therapy helps (even though the null hypothesis is TRUE). Now 60% of positive studies is not so impressive. Indeed, if psychotherapy is truly beneficial, then an enormous amount (90% to 95%) of well designed studies (i.e., those with sufficient power and reliable measuring devices) should have significant results in favor of showing that psychotherapy works. The same data pattern described above does not provide much support for the effectiveness of therapy.

I do not care which view of hypothesis testing you wish to believe in, I only care that you know that both exist. I hold to the alternative view, which is one criterion for what I need to see in research studies to believe that something in psychology is a bonafide phenomenon. Lykken talks about this criteria in terms of replicated studies.

Lykken lists three kinds of replications, in order of importance they are:

1. LITERAL. The exact duplication of a research study. Only the subjects and time changes.

2. OPERATIONAL. A reader of a research report looks at the Methods section, and then constructs a new study trying to do the best job of reproducing the previous Methods.

3. CONSTRUCTIVE. The researcher starts with a clear statement of the results of a previous study (e.g., psychiatric patients that see frogs in the Rorschach are more likely to have eating disorders than psychiatric patients that do not see frogs). Then the researcher formulates her/his own methods of sampling, study setting, measurements, and data analyses.

Supporting evidence from a constructive replication provides more substance for the belief in a theory than does an operational replication. A literal replication is nice to have, but it is the weakest type of replication.

I personally do not base my beliefs in psychology upon the results of any one study. What I like to see are the following: Constructive replications that come from different research labs across the country conducted by researchers who have no allegiance with the theory being tested. Sean may wish to comment on how allegiance to a particular form of therapy by a researcher affects the results of research that person conducts.

“The moral of the story is that the finding of statistical significance is perhaps the least important attribute of a good experiment: it is never a sufficient condition for concluding that a theory has been corroborated, that a useful empirical fact has been established with reasonable confidence-or that an experimental report ought to be published.”