**Variance and the Design of Experiments**

**Contents**

Variance

The F Statistic

The Analysis of Variance

Power and Sensitivity

Designing Experiments - Independent Groups

Improving Experimental Designs

Correlated Groups Designs

Repeated Measures and Order Effects

Complete versus Partial Counterbalancing

The Bottom Line

When
you look at the data from an experiment, the first thing you may notice is that
the numbers are not all the same, even for the same condition or the same subject.
There is variability. Usually we want to know __why__ the numbers are different.
There will be many reasons, so we want to divide up the variability into portions
that can be traced to different sources.

*Variance* is a statistic
that measures variability. Technically, the variance of the numbers is the sum
of squared deviations of each value from the mean value, divided by the sample
size minus one. (Why sample size minus one? Because that's the degrees of freedom)
But you don't really need to know all this - the calculations can be handled automatically.
What you need to know is __why__ the variance is important, and what it all
means.

Variance is a measure of variability that can be divided into portions. We can compare the sizes of these portions by creating ratios from pairs of those portions (i.e., one variance divided by another). Then we make important statements based on the magnitude of the ratios.

In other words, when we see variability
in the data, we want to know where that variability comes from, and whether something
important has happened. *Variance* is a statistic that allows us to answer
these questions. The method for producing the answers is therefore called the
*Analysis of Variance,* or *ANOVA* for short.

How well do you understand the concept
of variance? | ||

Why do the numbers in Set 1 have the same variance as the numbers in Set 2? | Set 1 | 1, 2, 3, 4, 5, 6 |

Set 2 | 8, 9, 10, 11, 12, 13 | |

Why do the numbers in Set 1 have
four times the variance of the numbers in Set 2? | Set 1 | 2, 4, 6, 8, 10, 12 |

Set 2 | 1, 2, 3, 4, 5, 6 | |

The analysis of variance is based on something called an F ratio. F stands for Fisher, Sir Ronald Fisher, the statistician who first developed the theory behind the analysis.

The most common use for an F ratio is to
test hypotheses about the effect of an independent variable on a dependent variable.
When we do this, the term in the numerator of the F ratio will be referred to
as a "*treatment variance*", and the term in the denominator will
be referred to as an "*error variance*". The F ratio then tells
us if the treatment variance is large, relative to the error variance.

By
the way, the word "error" is a most unfortunate misnomer. There is nothing
"wrong" about "error" in this context. We try to keep the
error variance small, but we can never make it go away, and it serves a very useful
purpose when it comes to testing hypotheses. A much better term than error variance
would be *nuisance variance*, but at this point we are stuck with the terminology.

To illustrate the use of F ratios, suppose we have run an experiment to compare two treatments, and that we have made six observations under each treatment. The treatments are given to two independent groups. The data might look like this:

Treatment
1 | Treatment 2 |

4 | 7 |

6 | 5 |

8 | 8 |

4 | 9 |

5 | 7 |

3 | 9 |

The mean for Treatment 1 is 5.0, and the mean for Treatment 2 is 7.5. What can we say about the difference between the treatments?

Notice that there are three
kinds of variability in the table. First, we have 12 numbers that are (more or
less) all different. Second, on the average, scores for Treatment 2 are higher
than scores for Treatment 1. Finally, within each group the numbers are different.
So we can calculate three measures of variance. The first, based on all of the
variability, is the "*total variance*". The second, based on overall
group or treatment differences, is the "*treatment variance*".
The third, based on within-group variability, is the so-called "*error
variance*". (Again, there's nothing wrong with it - it's merely a nuisance).

Now suppose there is no real difference between the treatments (i.e., the null hypothesis of zero difference is true). The two group means should be similar, but it is highly unlikely that they would be identical. So there's bound to be some treatment variance. But the treatment variance ought to be approximately equal to the error variance. When we calculate a ratio of the treatment variance to the error variance, the ratio should be approximately 1.0, sometimes a little less, sometimes a little more.

On the other hand, if there is a true difference between the treatments (i.e., if the null hypothesis is false), the treatment variance should be larger than the error variance, and the ratio of the two should be larger than 1. How much larger should we expect it to be? Well, that's where the F statistic comes in. Our expectations will depend on the degrees of freedom, which in turn depend on the number of treatments and the number of observations per treatment. Tables of the F statistic tell us, for various degrees of freedom, what critical values of F we should use to reject the null hypothesis at a given level of alpha.

Variance is "stuff". It can be piled up, divided into smaller piles, sorted, and compared. We can also talk about where each pile comes from, and what is responsible for it. You have already met this idea when talking about correlational research. Given a correlation between two variables, the coefficient of determination (or r square) represents the proportion of variance in one variable that is accounted for or predicted by the other. Now we extend that idea to true experiments.

Recall that in the example above, we divided the total variance into treatment variance and error variance. We can also calculate an F statistic based on a comparison of treatment variance and error variance. This allows us to find out how much of the total variance is accounted for by the treatment. The tool for doing all this piling, dividing, and comparing is ANOVA.

When we perform an ANOVA,
we usually refer to things called *mean squares*. Don't let this confuse
you. A mean square is essentially the same thing as a variance (i.e., the "mean
squared deviation from the mean").

You'll also see reference to *sums
of squares* (or "sum of squared deviations"). We calculate a mean
square by dividing a sum of squares by its associated degrees of freedom. The
sums of squares in ANOVA turn out to be additive: that is, the total sum of squares
can be divided into parts that add up to the total. It is this property of additivity
that gives variance its "stuff"-like qualities.

For the data
in the table above, we obtain the following ANOVA. It is an independent groups
design, so we divide the variance into a "between-groups" source and
a "within-groups" source. The former is the *systematic variance*,
i.e., the variability in the group means. It's the variance we are most interested
in. The latter is the *error variance*, i.e., the variability that cannot
be explained by systematic differences between the groups. It is an indication
of how much variability we could expect if there were no true differences between
the groups.

Sum
of squares | Mean square | df | F
ratio | sig. | |

Between-groups | 18.75 | 18.75 | 1 | 6.82 | .026 |

Within-groups | 27.5 | 2.75 | 10 | ||

Total | 46.25 | 11 |

We find that the treatment mean square is quite a lot larger than the error mean square, with an F ratio of 6.82. To interpret this F ratio we need to know the degrees of freedom. For the treatment variance it is number of treatments minus 1, which is 1. For the error variance the degrees of freedom is the sum of each sample size minus 1, i.e., 5 + 5 = 10. The F ratio turns out to have a significance level of .026.

Note that the total sum of squares is the sum of the between groups term plus the within groups term. Furthermore, the total degrees of freedom is the sum of the between groups degrees of freedom plus the degrees of freedom within groups. The mean squares, though, are not additive.

Can you predict the results of an ANOVA? In each of two experiments, two independent groups were compared. The results are shown below. In each case, which do you think is larger, the mean square (or variance) between groups, or the mean square (variance) within groups? Why? | ||

Experiment 1 | Group 1 | 2,
5, 11, 19, 21, 29 |

Group 2 |
3, 5, 8, 22, 22, 30 | |

Experiment 2 | Group 1 | 2, 4, 5,
7, 8, 10 |

Group 2 | 13,
14, 16, 18, 19, 20 | |

Two concepts related to the analysis of variance are
important when we design an experiment. The *power* of an experiment is the
probability that we will find a true difference among the treatments when one
exists. That is, it is the probability of correctly rejecting a false null hypothesis.
*Sensitivity* means essentially the same thing. It is the ability of an experiment
to detect small differences among the treatments. Maximizing power and sensitivity
is an important step in the planning and design of any experiment. We shall use
the notion of variance as a key to understanding how to design an experiment so
that the power and sensitivity are maximized.

**Designing
Experiments - Independent Groups**

The "stuff"-like quality of variance gives us a useful tool for thinking about research. Consider a simple independent groups design, where different levels of an independent variable are assigned to independent groups. We can use Figure 1, below, to represent the different piles of variance.

Figure 1. Two examples of variance in an independent groups design

Total variance can be divided into systematic (between-group) variance - green in Figure 1 - and error (within-group) variance - yellow in Figure 1. Remember, it's the sums of squares that are additive, not the variances themselves, but it is still helpful to think in these terms. The relative size of the two piles tells us whether or not there is any systematic (non-random) difference between the groups. Thus, in the upper part of Figure 1 there appears to be no significant difference between the groups (the piles are similar in size), while in the lower part the difference is likely to be significant. Power and sensitivity are greater in the second example.

Now, behind the scenes the picture is somewhat more complicated. Systematic between-group differences can arise for two reasons - the effect of the independent variable itself, and also any confounding that is present. By definition, there is no way to separate the variance due to the independent variable from any variance due to confounding. That's why confounding is the real error in an experiment - if present, it renders the results uninterpretable.

So Figure 2 represents what happens if confounding is present. It shows variance due to the independent variable (green) and any confounding variables (red) mixed together in such a way that they cannot be separated. Thus, while the group means may be significantly different, we cannot conclude that the difference is caused by the independent variable.

Figure 2. Systematic variance includes variance caused by confounding variables

**Improving
Experimental Designs**

These diagrams can help us to identify good experiments and poor experiments. A good experiment is one that has no confounding, and small error variance relative to the treatment variance. A poor experiment is one with confounding, and/or large error variance (see Figure 3).

Figure 3. A poor experimental design (top) and a good experimental design (bottom)

Suppose an experimenter wanted to find out the effects of sleep deprivation on mathematical problem solving. He tested one group of students within two hours of their waking from a good night's sleep. He tested a second group after 36 hours of sleep deprivation. He found a significant difference between the two groups in their performance on a math test. Unfortunately, it turned out that most of the subjects in the sleep deprivation group were psychology majors, while most of the subjects in the normal sleep group were science majors. The experimenter repeated the experiment, taking care to randomly assign subjects to the two groups. This time there was no significant difference between the two groups. How would you explain the different results in the two studies, using the concept of variance? |

The question is, how do we get from a poor design (upper part of Figure 3) to a good design (lower part of Figure 3). Remember that while large error variance is merely a nuisance, confounding is fatal, so it is essential that the confounding be removed. The most effective way to do this is normally to use random assignment of subjects to groups, so that all extraneous variables create only random, not systematic variance. In other words, variance due to extraneous variables becomes part of the error variance.

Figure 4 illustrates what happens when we use random assignment to eliminate confounding. The error variance may increase somewhat, because additional extraneous variables contribute to the error. The systematic variance may actually shrink; whether it does or not depends on several factors, especially any correlation between the confounding variables and the independent variable. But at least now there is no fatal flaw in the design.

Figure 4. The result of using random assignment to eliminate confounding.

There are a number of ways in which we might reduce the error variance, and thereby increase the power of the design. Instead of randomizing all extraneous variables, we might decide to hold some of them constant, especially those that we know contribute large amounts of error variance. The most direct way to reduce error variance, though, is to increase the sample size, since error variance is inversely proportional to the degrees of freedom, which depend on the sample size. Figure 5 illustrates the results of holding some variables constant (middle), and increasing the sample size (bottom).

Figure 5. Reducing error variance by holding variables constant and increasing sample size.

Finally, we might be able to increase the power of the design by fine tuning the independent variable. For example, by choosing more extreme levels for the variable, or making sure that our manipulation is effective, we might be able to increase the systematic variance (Figure 6). We have now achieved the ideal design we sought in Figure 3.

Figure 6. The effect of increasing systematic variance due to the independent variable.

So far we have considered only independent groups designs. There are two forms of correlated groups designs that offer important advantages over independent groups - the matched groups (or matched subjects) design and the repeated measures (or within-subjects) design.

In
a matched groups design, subjects are divided into *blocks*, where all of
the individuals within a given block are alike in some way. The easiest way to
do this is to rank order subjects on some matching variable, and create the blocks
by taking successive sets of subjects from the rank ordering. Subjects are then
randomly allocated to treatment groups from within each set. The advantage to
this design is that variance due to whatever variable differentiates the blocks
is no longer part of the error term. In the analysis of variance it is extracted
as a separate source of variance.

Figure 7 shows how matching serves to increase the power of the design by reducing the error. The pile of error variance has been divided into two piles, and the residual error is smaller than it was.

Figure 7. Matched subjects design: Variance due to block differences is removed from the error variance.

Suppose Sam and Virginia each ran an experiment in which the dependent variable was a person's score on a test of state anxiety. They both compared the same two treatments, using a matched subjects design. Sam used as his matching variable a subject's score on a test of intelligence. Virgina used as her matching variable a subject's score on a pretest measure of anxiety. How would the results of Sam and Virginia's analyses be different? Explain the difference in terms of the ideas illustrated in Figure 7. |

The extreme form of matching is to use each subject as his or her own control, i.e., to match every individual with themselves. This is what happens in a Repeated Measures design, where each subject is exposed to every treatment condition.

In the analysis of variance for a Repeated Measures design, all individual differences will be extracted as a "Between Subjects" source of variance. This source is usually of no interest in itself, but again it serves to reduce the error variance and thereby increase power.

As you can see in Figure 8, the Subjects variance will usually exceed the Blocks variance in a matched groups design. More of the difference between subjects is extracted in a Repeated Measures design, thus producing an even greater increase in power.

Figure 8. Repeated measures design: Variance due to subject differences is removed from the error variance.

Subjects variance in a repeated measures design will usually exceed the Blocks variance in a matched groups design. Why? |

**Repeated
Measures and Order Effects**

As you are surely aware, there is a problem with repeated measures designs. If each subject is tested more than once, the order in which the treatments are applied becomes a major concern. If the same order is used with every subject, we have a very serious problem: Treatment and Order are confounded. The upper part of Figure 9 illustrates the problem. We no longer have a legitimate test of the Treatment effect, because it is confounded with the Order effect.

Figure 9. Removing confounding due to order effects by using random orders.

The
most common way to control for order effects is to use a **randomized order**,
chosen separately for each subject. As is always the case with randomization,
this eliminates the problem of confounding, but it does so at the cost of an increase
in error variance. This is illustrated in the lower part of Figure 9 (see also
Figure 4). Confounding has been removed, but the error variance is larger.

This creates a dilemma. The big advantage to a repeated measures design is its greater power. Yet, by controlling for order effects, we reduce that power by adding to the error variance. Usually the gain in power by removing individual differences from the error exceeds the loss of power that results from adding order effects to the error, but this is not guaranteed.

There is one way
to regain the power. We can control order effects by using **counterbalancing**.
In a counterbalanced design we use separate groups of subjects, each group receiving
a different order. If there are two treatments, for example (A and B), Group 1
received the treatments in the order AB, and Group 2 receives the treatments in
the order BA. Now the variance due to order effects becomes a between-groups source
of variance. It is extracted by the analysis of variance, and is no longer part
of the error variance (Figure 10).

Figure 10. Removing confounding due to order effects by using counterbalancing.

In Figure 10 the "Groups" differ only in the order of treatment. By creating a separate source of variance for groups, the error variance was reduced. For more information on the analysis of counterbalanced designs, see the supplemental notes on counterbalancing.

The lower section of Figure 10 shows four sources of variance. In the analysis of variance the Treatments mean square is compared with the Error mean square. The Subjects and Groups terms are usually ignored. Why would the researcher usually not be interested in the size of the Subjects mean square or the Groups mean square? Under what conditions might they be of interest? |

**Complete
versus Partial Counterbalancing**

When designing experiments, we never get anything for free. While counterbalancing can preserve the power of a repeated measures design, it does so at a cost. If there are only two treatments, counterbalancing is easy - we use two groups, one with the AB order and the other with the BA order. If there are three treatments, though, complete counterbalancing requires six groups: ABC, ACB, BAC, BCA, CAB, and CBA. If there are four treatments there are 24 possible orders, so to achieve complete counterbalancing we need 24 groups! Clearly this is becoming unfeasible.

There is one
final option. Instead of using complete counterbalancing, we can usually get by
with **partial counterbalancing**. If there are *k* treatments, we need
only *k* groups for a partial counterbalancing. For *k* = 3, for example,
the groups might use the orders ABC, BCA, and CAB. For *k* = 4, we can use
ABCD, BCDA, CDAB, and DABC. These designs are usually referred to as *Latin
Square* designs, which are beyond the scope of this course.

For more on this topic, including a discussion that may be useful to you when you design your own project, see the separate notes on counterbalancing.

When designing an experiment, you have to consider each independent variable and ask, should it be manipulated using an Independent Groups, Matched Groups, or Repeated Measures design. Some rules of thumb follow from the previous discussion.

1. If possible, use repeated measures. It's more powerful, and usually saves time and effort.

2. Repeated measures will be impossible if measuring a person once would make it impossible to measure them again. For example, if your independent variable consists of two types of instructions, you probably don't want to use the same test twice with different instructions each time.

3. If you have only two levels for a repeated measures variable, use counterbalancing to control for order effects and preserve power. That requires two independent groups that differ only in the order of treatments.

4. If you have several levels for a repeated measures variable, use partial counterbalancing.

5. If you have many levels for a repeated measures variable, use a random order.

6. If using repeated measures is impossible, consider using a matched subjects design. It can be time consuming, because you need to test all subjects on the matching variable before you can assign any of them to a treatment condition.

7. Classification variables, by definition, must be treated as between groups variables. Of course, they are not independent groups, but the ANOVA proceeds as if they were. One just has to be careful when interpreting the results.

An investigator wants to study cooperative behavior in a task similar to the Prisoners' Dilemma game. Subjects will be told they are playing with a human partner, although the partner is actually a pre-programmed computer. The dependent variable will be the number of trials on which a subject chooses to cooperate. The investigator will examine three variables as possible determinants of the degree of cooperation:
What kind of design would you suggest the investigator use for each of these three variables? |