Chapter 7 A model of means (ANOVA)

In this chapter, we will discuss how nominal variables, which often reflect different manipulations within an experiment, can be included in the General Linear Model. This is done by constructing so-called contrast codes, which provide a means to construct metric predictors which encode differences between the levels of the nominal variable (e.g. differences between conditions in an experiment). When a study concerns only a single nominal variable, the interest is generally in determining whether there are differences in the average of the dependent variable between the levels (groups). Traditionally, ANOVA has focused on so-called omnibus tests, which test whether the mean of at least one group mean is different to that of another. Contrast codes allow you to test more informative hypotheses, and we will discuss different ways of constructing these contrast codes, including so-called orthogonal contrast codes. Within the context of a single model, you can only include a limited number of contrast codes, and we end the chapter with considering how you can test more hypotheses, whether these were conceived beforehand (planned comparisons) or afterwards (post-hoc tests).

7.1 Can playing Tetris reduce intrusive memories?

After a traumatic experience, some people experience flashbacks, which are intrusive and involuntary memories that involve vivid imagery related to the traumatic event. These intrusive memories can be highly distressing and are a hallmark of acute stress disorder and posttraumatic stress disorder (PTSD). It has been suggested that recalling traumatic memories under certain conditions can reduce their negative impact. Memory consolidation refers to a collection of neural processes which stabilize a memory trace after it is acquired. According to reconsolidation theory, when a consolidated memory trace is reactivated (remembered), it again becomes malleable and will require restabilization for it to persist. Disruption of this reconsolidation after recall may then be a way to reduce the strength of traumatic memories, or even allow them to be forgotten.

James et al. (2015) conducted a study to investigate this idea. They reasoned that because intrusive memories of trauma are often visual in nature, performing a demanding visuospatial task (e.g. playing the computer game Tetris) after recall could interfere with the reconsolidation process and reduce subsequent intrusions of the traumatic memory. In their Experiment 2, they first created traumatic memories by showing their participants a 12-minute film with graphic scenes depicting death and serious injury (e.g. a van hitting a teenage boy while he was using his mobile phone crossing the road). Participants then went home and recorded the number of intrusive memories of the film during the subsequent 24-hour period (Day 0). The next day, they returned to the lab and were randomly assigned to one of four conditions:

  1. No-task control: participants in this condition (\(n=18\)) completed a 10-minute music filler task, rating excerpts of classical music for pleasantness.
  2. Tetris+Reactivation: participants in this condition (\(n=18\)) were shown a series of images from the scenes in the trauma film to reactivate the memories of the scenes. After this reactivation task, they completed the 10-minute music filler task, and then played the video game Tetris for 12 minutes.
  3. Tetris-Only: participants in this condition (\(n=18\)) performed the music filler task and then played Tetris for 12 minutes, but did not complete the reactivation task.
  4. Reactivation Only: participants in this condition (\(n=18\)) completed the reactivation and music filler task, but did not play Tetris.

All participants then went home and were asked to record the number of intrusive memories they experienced over the next seven days (Day 1 to 7). After this week passed, participants returned to the lab and completed an Intrusion-Provocation Task, in which they were shown blurred images from the trauma film and asked to indicate whether each of these triggered an intrusive memory.

The sample means and standard deviations of the number of intrusions in each condition are:

\(\overline{Y}\) \(S_Y\)
5.11 4.11
1.89 1.70
3.89 2.81
4.83 3.24

Boxplots for each condition are provided in Figure 7.1.

Number of intrusive memories during days 1 to 7 for each condition of Experiment 2 of James et al. (2015).

Figure 7.1: Number of intrusive memories during days 1 to 7 for each condition of Experiment 2 of James et al. (2015).

7.2 Comparing two groups

To start with a relatively straightforward example, let’s first focus only on the data from the Tetris+Reactivation and the Reactivation-Only conditions. We are interested in whether playing Tetris during reactivation reduces the number of memory intrusions on later days, in comparison to when the traumatic memory was reactivated only. To investigate this, we will extend the simple model of Chapter 3. In particular, we will assume that the number of intrusions is Normal-distributed with a mean that depends on the experimental condition participants were assigned to (Tetris+Reactivation or Reactivation-Only). We can write this model in two equivalent ways as: \[Y_i \sim \mathbf{Normal}(\mu_{\text{con}},\sigma)\] and \[\begin{equation} Y_i = \mu_\text{con} + \epsilon_i \quad \quad \epsilon_i \sim \mathbf{Normal}(0,\sigma) \tag{7.1} \end{equation}\] where \(\mu_\text{con}\) is a placeholder for the mean of a particular condition. Specifying that mean explicitly, we can also write the model as

\[Y_{i} = \begin{cases} \mu_\text{t+r} + \epsilon_{i} \hspace{2em} \text{if condition = Tetris+Reactivation} \\ \mu_\text{react} + \epsilon_{i} \hspace{2em} \text{if condition = Reactivation-Only} \end{cases} \quad \quad \epsilon_i \sim \mathbf{Normal}(0,\sigma_\epsilon)\] Note that the model assumes that while the mean can differ between the conditions, deviations of the observations around the mean are assumed to have the same standard deviance \(\sigma_\epsilon\). An example of what the model might look like is given in Figure 7.2.

Two Normal density functions with a different mean but identical standard deviation.

Figure 7.2: Two Normal density functions with a different mean but identical standard deviation.

Can we estimate such a model, using the tools we have already learned about? Yes! It is actually quite straightforward to construct a linear model to represent the model with means depending on condition. Condition is a nominal variable, which we can’t simply include “as is” in a linear model. Linear models need metric predictors. But we can construct a new predictor \(X\), which has the value 0 for participants in the Tetris+Reactivation condition, and the value 1 for participants in the Reactivation-Only condition. This predictor \(X\), with values 0 and 1 referring to different groups, is commonly referred to as a dummy coding variable. With this dummy predictor, the model \[Y_i = \beta_0 + \beta_1 \times X_i + \epsilon_i \quad \quad \epsilon_i \sim \mathbf{Normal}(0,\sigma)\] is formally equivalent to the model of Equation (7.1). To see why, it is important to realize that linear regression concerns the conditional means of the dependent variable given the values of the predictor variable(s): \[\hat{Y} = \mu_{Y|X_i}\]

If we give the model enough flexibility (and I will discuss more in detail what that means later), then the model will be able to represent those conditional means accurately. A main thing to focus on for now is that we assigned a different value of the predictor variable \(X\) for the two conditions. This means that the conditional means can differ between the conditions.

Let’s write out the structural part of the model (excluding the error term \(\epsilon_i\)) for participants from each condition. For everyone in the Tetris+Reactivation condition, \(X_i = 0\), so the model predictions for these cases are \[\begin{aligned} \hat{Y}_i &= \mu_{Y|X_i} \\ &=\beta_0 + \beta_1 \times 0 \\ &= \beta_0 \end{aligned}\] For everyone in the Reactivation-Only condition, \(X_i = 1\), so the model predictions for these cases are \[\begin{aligned} \hat{Y}_i &= \mu_{Y|X_i} \\ &= \beta_0 + \beta_1 \times 1 \\ &= \beta_0 + \beta_1 \end{aligned}\]

For cases in the Tetris+Reactivation condition, the model predictions are a constant: \(\beta_0\). If the errors are Normal-distributed, this is effectively the same model as we used in Chapter 3. The only difference is in the label for the constant: in Chapter 3 we labelled this as \(\mu\), and here we label it as \(\beta_0\). Although the labels look different, they both refer to the same thing, namely the (conditional) mean of the dependent variable. If we “relabel” \(\beta_0 = \mu_\text{t+r}\), then we have the model we wanted for this condition.

For cases in the Reactivation-Only condition, the prediction consists of the sum of two constants, \(\beta_0 + \beta_1\). A sum of two constants is itself a constant. Moreover, because we have already “relabelled” \(\beta_0 = \mu_\text{tetr}\), we can also write this sum of constants as \(\mu_\text{react} = \mu_\text{t+r} + \beta_1\). If we then move \(\mu_\text{t+r}\) to the left-hand side of this equation by subtracting \(\mu_\text{t+r}\) from both sides of the equation, we can see that \(\beta_1 = \mu_\text{react} - \mu_\text{t+r}\). In words, the slope of our predictor equals the difference between the mean of the the Reactivation-Only and the mean of Tetris+Reactivation condition.

Remember that the linear model represents the conditional mean of the dependent variable for each possible combination of values of the predictor variables. We have just constructed a predictor with a different value for cases in each condition, and we could write the resulting model predictions in terms of two parameters (the intercept and the slope) which we could relate to the means in each condition. In Chapter 3, we showed that the maximum likelihood estimate of the mean \(\mu\) of a Normal-distributed variable is the sample mean \(\overline{Y}\). Considering each group in isolation, the maximum likelihood estimates in each group are then simply the sample means in each group (condition). The model we just constructed is defined over all groups. However, by assigning a different value of the predictor variable \(X\) to each group (and choosing these values wisely), this single model allows you to do the same (and more) than defining a different model for each group. This is the essence of contrast coding: constructing metric predictors which allow you to model different groups in a single linear (regression) model.

In the model just constructed, the maximum likelihood parameter estimates are a function of the sample means in each group. More precisely, the estimate of the intercept equals the mean in the Tetris+Reactivation condition: \[\hat{\beta}_0 = \overline{Y}_\text{t+r}\] The estimate of the slope equals the difference between the sample mean of the the Reactivation-Only and the sample mean of Tetris+Reactivation condition: \[\hat{\beta}_1 = \overline{Y}_\text{react} - \overline{Y}_\text{t+r}\]

The average number of intrusive memories in the Tetris+Reactivation condition was \(\overline{Y}_\text{t+r} = 1.89\), and the corresponding mean in the Reactivation condition was \(\overline{Y}_\text{react} = 4.83\). Estimating the regression model provides the following estimates \[\texttt{intrusions}_i = 1.89 + 2.94 \times \texttt{dummy}_i + \hat{\epsilon}_i \quad \quad \hat{\epsilon}_i \sim \mathbf{Normal}(0, 2.66)\] As you can see, the estimated intercept equals \[\hat{\beta}_0 = \overline{Y}_\text{t+r}\] and the slope of \(\texttt{dummy}\) equals \[\hat{\beta}_1 = \overline{Y}_\text{react} - \overline{Y}_\text{t+r} = 4.83 - 1.89 = 2.94\] The results of the hypothesis tests that the true intercept and slope equal 0 are given in Table 7.1. The test result for the intercept indicates that the average number of intrusions in the Tetris+Reactivation condition is different from zero. More interestingly, the test result for the slope of \(\texttt{dummy}\) indicates that the difference between the Tetris+Reactivation and Reactivation-Only condition is not equal to 0. In other words, it is likely that there is a difference between the conditions in the average number of intrusions. Playing Tetris after memory reactivation seems to reduce the number of subsequent memory intrusions.

Table 7.1: Linear model predicting number of intrusions by a dummy predictor. Values under \(\hat{\beta}\) are parameter estimates. The value of SS in the Error row is the Sum of Squared Error (SSE) of the full model, and in the remaining rows reflect the Sum of Squares Reduced (SSR) between a MODEL R where that parameter is fixed to 0, and the full MODEL G where it is estimated. The values under df are \(\text{df}_2\) for the Error row, and \(\text{df}_1\) in the other rows. Values of \(F\) are the \(F\)-statistic for the null-hypothesis tests where the null hypothesis is that the true value of the parameter is 0, i.e. \(\beta_j = 0\). Values of \(p(\geq F)\) are the corresponding \(p\)-values.
\(\hat{\beta}\) \(\text{SS}\) \(\text{df}\) \(F\) \(p(\geq \lvert F \rvert)\)
Intercept 1.89 64.2 1 9.09 0.005
\(\texttt{dummy}\) 2.94 78.0 1 11.04 0.002
Error 240.3 34

A test comparing the means of two Normal-distributed variables is also known as an independent samples t-test. It involves an extension of the one-sample \(t\) test discussed in Chapter 3 and is based on the sampling distribution of the difference between two sample means. As with any \(\text{df}_1 = 1\) \(f\)-test, if you’d take the square-root of the \(F\) statistic for this test, you obtain the value of the \(t\) statistic of the independent samples t-test. As the mathematical details of the two-sample \(t\)-test don’t provide any new insights, we omit them here. A main thing to realise is that this test will provide identical results to our test of the slope of \(\texttt{dummy}\). By constructing a model where the slope of \(\texttt{dummy}\) is identical to the difference between the means between the Tetris+Reactivation and Reactivation-Only conditions, we have made it possible to test this hypothesis in the context of a linear regression model. Whilst this model is basically a regression model, including new predictors for nominal variables makes the model flexible enough to encompass most of the commonly encountered analyses, which perhaps is why it was rebranded as the General Linear Model.

The dummy coding procedure above can be generalized to the situation in which you want to compare more than two groups. In the Tetris study, there were four conditions. To allow a linear model to represent the means in all four conditions, you need to use more than one dummy-coding predictor. In fact, you would need 3 dummy-coding predictors. In such a dummy-coded model, the intercept represents the mean of one condition, which we can call the reference group. The slope of each predictor represents the difference between the mean of one of the corresponding other group and the mean of the reference group. Whilst dummy coding is simple and provides interpretable parameters, there are alternative coding schemes that may provide more interesting tests.

7.3 The ANOVA model

Before going into such alternative coding schemes, we will take a slight detour and consider the traditional model for the case of Normal-distributed variables in multiple groups with (potentially) different means, but the same standard deviation. This model is also called the oneway ANOVA model, and can be stated as follows: \[\begin{equation} Y_{j,i} = \mu + \tau_j + \epsilon_{j,i} \quad \quad \epsilon_{j,i} \sim \textbf{Normal}(0, \sigma_\epsilon) \tag{7.2} \end{equation}\] Here, \(Y_{j,i}\) denotes the value of the dependent variable for case \(i = 1,\ldots, n_j\) in group \(j = 1, \ldots, g\). So \(g\) denotes the total number of groups (i.e. \(g=4\) in the Tetris study), and \(n_j\) the number of cases in group \(j\) (i.e. \(n_j = 18\) in the Tetris study for all \(j=1,\ldots,4\)). \(\epsilon_{j,i}\) is the corresponding error term for participant \(i\) in group \(j\). The mean \(\mu\) is the so-called “grand mean”, which is the overall mean of the dependent variable, the mean of any observation that could be produced by the Data Generating Process, regardless of group. It is generally defined as the average of the means in all conditions, i.e. \(\mu = \frac{\sum_{j=1}^g \mu_j}{g}\). This is identical to the mean over all observations if all the conditions have an equal number of observations. The term \(\tau_j\) represents the so-called “treatment effect” of group \(j\), defined as the difference between the mean of group \(j\) and the grand mean: \[\begin{equation} \tau_j = \mu_j - \mu \end{equation}\]

The traditional goal of an ANOVA is to determine whether there is any treatment effect. That is, to test the null-hypothesis \(H_0: \tau_j = 0 \text{ for all } j = 1, \ldots, g\). Note that the hypothesis states that \(\tau_j = 0\) for all groups. We could have also stated this hypothesis as \(H_0: \tau_1 = \tau_2 = \ldots = \tau_g = 0\), if you find that clearer. As the hypothesis states that the difference between the group-specific mean \(\mu_j\) and the grand mean \(\mu\) is 0 for every group, the implication is that \(\mu_j = \mu\) for all groups (i.e. all groups have an identical mean \(\mu\)).

The test statistic for this null-hypothesis is the \(F\) statistic, calculated as a ratio of the (estimated) sample variation of the treatment effects and the (estimated) variance of residual error terms. As we will see, we can perform this hypothesis test with the General Linear Model through an overall model test (comparing a “full” MODEL G to an intercept-only MODEL R). In addition, we can also test more specific hypotheses regarding differences between the group means and the overall mean, or specific hypotheses regarding differences between particular (combinations of) groups. In contrast to traditional ANOVA, the GLM approach also deals naturally with situations in which the groups have unequal sizes (i.e. \(n_j\) differs between the groups). This is not straightforward with the traditional ANOVA test.

7.4 Contrast coding

As in the case of two groups, the approach to testing group differences in the GLM is to construct new predictor variables that represent differences between groups, which we’ll call contrast-coded predictors.

To illustrate the general concept of contrast coding in an intuitive manner, let’s consider a game in which you ask someone to pick a random number between 1 and 8 and your job is to determine the number they picked by asking questions which can be answered by “yes” or “no”. There are different ways in which you can play this game, and some of these are more efficient than others. For instance, you can ask sequential questions about the identity of the numbers such as “Is the number 1?”, “Is the number 2?”, “Is the number 3?”, etc. If they picked the number 1, then you would have needed just a single question, and if the number was 3, you would need three questions. If the number was 8, however, you would not need an additional question. After having asked “Is the number 7?”, either the answer would be “Yes”, in which case you would know the number to be 7, or “No” in which you would know the number to be 8. So the maximum number of questions is one less than the number of possibilities. As contrast coding is essentially asking such questions about group membership, you will always need one contrast-coding variable less than the number of groups.

Going back to the game, to determine any number between 1 and 8, an optimal playing strategy will always consist of 7 possible questions. You would not need to ask all of these questions in each game, but sometimes you opponent would have chosen 7 or 8, in which case you would need to ask all 7 questions. The strategy of asking whether the chosen number is equal to each of a sequence of number resembles dummy coding. In this view, the final number 8 is the reference group, and each question such as “Is the number 1?” is implicitly the same as “Is the number 1 and not 8?”.

A different way to play the game is by a strategy which guarantees you to always correctly “guess” the number in three questions. This strategy is depicted in Figure 7.3. In this strategy, each question halves the number of remaining options. If the first question “Is the number larger than 4?” is answered as “yes”, the number can only be 5, 6, 7, or 8. The options 1, 2, 3, or 4, are ruled out. Subsequently asking “Is the number larger than 6?” would reduce the remaining options by half again. If the answer to this question was “yes”, then the number would have to be either 7 or 8. If “no”, the remaining options would be 5 or 6. If the answer was “yes”, then subsequently asking the question “Is the number 7?” would allow you to finish the game (if the answer is “yes”, the number is 7, if “no”, it would have to be 8). Although more efficient in a single game, over all games, this strategy consists of a total of 7 questions: “Is the number greater than four?”, “Is the number greater than two?”, is the number greater than six?“,”Is the number seven?“, etc.

Questions to guess a random number between 1 and 8

Figure 7.3: Questions to guess a random number between 1 and 8

Constructing contrast codes can be seen as a more complicated version of the game above. Suppose that instead of guessing a single number, the other person can assign numbers to groups \(A, B, \ldots, H\), and your job is to determine what all these numbers are. You are given one hint, and then you can ask questions only about the differences between the number(s) assigned to (combinations of) the groups. In the context of data analysis, the numbers are the average value of a dependent variable in those groups. For example, if I’m told the mean in group \(H\) is \(\mu_H = 2\), I can then ask what the difference in the means of group \(H\) and group \(G\) is. If I then learn that this difference is \(\mu_H - \mu_G = 1\), I can infer that the mean in group \(G\) must be \(\mu_G = \mu_H + 1 = 1\). This game may be less fun to play than the earlier version, but it’s more interesting from a data analysis perspective.

Let’s get back to the Tetris study. There were four conditions: the no-task Control condition, Tetris+Reactivation, Tetris-Only, and Reactivation-Only. Remember, your job is not to determine which condition someone was in. This can be answered with a total of two yes-no questions (“Did the condition involve playing Tetris?” and “Did the condition involve memory reactivation?”). Your job is now to determine the average number of intrusions in each condition by asking questions about differences between the averages. To start, I will give you a hint, which is that the average over all conditions (i.e. the grand mean) is \(\hat{\mu} = 3.93\).

One set of questions you could ask is the following:

  1. What is the difference between the mean of the Control condition and the grand mean?
  2. What is the difference between the mean of the Tetris+Reactivation condition and the grand mean?
  3. What is the difference between the mean of the Tetris-Only condition and the grand mean?

The answers to these questions are

  1. \(\hat{\mu}_\text{contr} - \hat{\mu} = 5.11 - 3.93 = 1.18\).
  2. \(\hat{\mu}_\text{t+r} - \hat{\mu} = 1.89 - 3.93 = -2.04\).
  3. \(\hat{\mu}_\text{tetr} - \hat{\mu} = 3.89 - 3.93 = -0.0417\).

Because I have told you the value of \(\hat{\mu}\) already, you can simply add the value of this “hint” to the answer of each question to determine the mean of a condition. You do not need to ask a fourth question (“What is the difference between the mean of the reactivation-Only condition and the grand mean?”), because with the information provided, you would be able to determine this. Firstly, I should point out that \[\begin{aligned} \hat{\mu} &= \frac{\hat{\mu}_\text{contr} + \hat{\mu}_\text{t+r} + \hat{\mu}_\text{tetr} + \hat{\mu}_\text{react}}{4} \\ &= \frac{\overline{Y}_\text{contr} + \overline{Y}_\text{t+r} + \overline{Y}_\text{tetr} + \overline{Y}_\text{react}}{4} \end{aligned}\] i.e., the estimate of the grand mean is a “mean of means”. With a little algebra, we can then work out that \[\hat{\mu} - \frac{\hat{\mu}_\text{contr} + \hat{\mu}_\text{t+r} + \hat{\mu}_\text{tetr}}{4} = \frac{\hat{\mu}_\text{react}}{4}\] and hence \[\hat{\mu}_\text{react} = 4 \times \mu - (\hat{\mu}_\text{contr} + \hat{\mu}_\text{t+r} + \hat{\mu}_\text{tetr})\] The treatment effect can then also be determined as \[\hat{\mu}_\text{react} - \hat{\mu} = -1 \times\left( (\hat{\mu}_\text{contr} - \hat{\mu}) + (\hat{\mu}_\text{t+r} - \hat{\mu}) + (\hat{\mu}_\text{tetr} - \hat{\mu}) \right)\] i.e. as minus one times the sum of the treatment effect of the other conditions. It is precisely for dependencies like these that when constructing contrast codes for a nominal variable representing group membership, you need one less contrast code than the number of groups (i.e., you need \(g-1\) contrast codes).

7.4.1 Effect coding

Contrast codes are variables that assign numeric values to groups, such that when we use these values in a linear model, the resulting parameter estimates reflect differences in the means of the groups. They are, in a sense, a computational trick to allow us to model group differences with a linear model. We will denote contrast codes as \(c_j\), where \(j=1,\ldots,g\) is an indicator for groups. The questions about deviations between group means and the grand mean (the treatment effects) correspond to the following three contrast codes \(c_j\):

\(c_1\) \(c_2\) \(c_3\)
Control 1 0 0
Tetris+Reactivation 0 1 0
Tetris-Only 0 0 1
Reactivation-Only -1 -1 -1
Just like for the dummy coding example discussed above, the idea is to construct a new predictor for each of these contrast codes. The contrast codes have values for each group or condition. The corresponding contrast-coded predictor has a value for each case \(i\) in the data, where we give all cases in a condition the corresponding value of \(c_j\) for that condition. For example, the first predictor \(X_1\), which corresponds to the first contrast code \(c_1\), would have the value \(X_{1,i} = 1\) if case \(i\) is in the Control condition, the value \(X_{1,i} = 0\) if case \(i\) is in the Tetris+Reactivation or Tetris-only condition, and the value \(X_{1,i} = -1\) if case \(i\) is in the Reactivation-Only condition. Similarly, the second predictor \(X_2\), which corresponds to the second contrast code \(c_2\), would have the value \(X_{2,i} = 0\) if case \(i\) is in the Control condition, the value \(X_{2,i} = 1\) if case \(i\) is in the Tetris+Reactivation condition, the value \(X_{2,i} = 0\) if case \(i\) is in the Tetris-only condition, and the value \(X_{2,i} = -1\) if case \(i\) is in the Reactivation-Only condition. Having defined three contrast-coded predictors, \(X_1\), \(X_2\), and \(X_3\) in this manner, we can then estimate the linear (regression) model \[Y_i = \beta_0 + \beta_1 \times X_{1,i} + \beta_2 \times X_{2,i} + \beta_3 \times X_{3,i} + \epsilon_i \quad \quad \epsilon_i \sim \mathbf{Normal}(0,\sigma_\epsilon)\] This model is treated as any other regression model. So after re-coding a nominal variable “condition” with four levels (Control, Tetris+Reactivation, Tetris-Only, and Reactivation-Only) with three contrast codes \(c_1\), \(c_2\), and \(c_3\), each with a corresponding predictor \(X_1\), \(X_2\), and \(X_3\), we end up with a linear model that is effectively like any other multiple regression model. Estimating the model gives: \[\texttt{intrusions}_i = 3.93 + 1.18 \times \texttt{}X_1\texttt{}_i - 2.04 \times \texttt{}X_2\texttt{}_i - 0.0417 \times \texttt{}X_3\texttt{}_i + \hat{\epsilon}_i \quad \quad \hat{\epsilon}_i \sim \mathbf{Normal}(0, 3.18)\] Figure 7.4 shows how the parameters of the model are related to the average number of intrusions in each condition.
Average number of instrusions in the four conditions in the Tetris study and how they are related to the parameters of the effect-coding model. The intercept $\hat{\beta}_0 = \hat{\mu}$ is the grand mean (dotted line). The slopes reflect treatment effects, which are deviations from the average intrusions in a condition and the grand mean. The averages of the first three conditions are equal to $\beta_0 + \beta_j$, the sum of the intercept and the slope of the effect-coding predictor $X_j$ representing the treatment effect of that condition. The mean of the last condition is the intercept minus the sum of the slopes of the effect-coding predictors.

Figure 7.4: Average number of instrusions in the four conditions in the Tetris study and how they are related to the parameters of the effect-coding model. The intercept \(\hat{\beta}_0 = \hat{\mu}\) is the grand mean (dotted line). The slopes reflect treatment effects, which are deviations from the average intrusions in a condition and the grand mean. The averages of the first three conditions are equal to \(\beta_0 + \beta_j\), the sum of the intercept and the slope of the effect-coding predictor \(X_j\) representing the treatment effect of that condition. The mean of the last condition is the intercept minus the sum of the slopes of the effect-coding predictors.

Test results for the parameters of the model are given in Table 7.2. As can be seen there, the test of the intercept is significant. In this model, the intercept represents the “grand mean” \(\mu\) (the average of the means in each condition). The test indicates that the true value of the grand mean is unlikely to be 0. In addition, the slope of \(X_2\) is significant. This slope is equal to the treatment effect of the Tetris+Reactivation condition (i.e. \(\hat{\beta}_2 = \hat{\mu}_{t+r} - \hat{\mu}\)). The test is a test of the null-hypothesis \(H_0: \beta_2 = 0\), and this test involves a comparison of the models \[\begin{aligned} \text{MODEL G}: && Y_i &= \beta_0 + \beta_1 \times X_{1,i} + \beta_2 \times X_{2,i} + \beta_3 \times X_{3,i} + \epsilon_i \\ \text{MODEL R}: && Y_i &= \beta_0 + \beta_1 \times X_{1,i} + \beta_3 \times X_{3,i} + \epsilon_i \end{aligned}\] The comparison indicates that fixing the slope \(\beta_2 = 0\) in MODEL R results is a substantial increase in the Sum of Squared Error of MODEL R compared to MODEL G. As such, there is evidence that the true value of this slope does not equal 0, and with that, that the true treatment effect of the Tetris+Reactivation condition does not equal 0, i.e. that \(\mu_{t+r} - \mu \neq 0\). Furthermore, the estimate of the treatment effect is \(\hat{\beta}_2 = \hat{\mu}_{t+r} - \hat{\mu} = \overline{Y}_{t+r} - \frac{\overline{Y}_\text{contr} + \overline{Y}_\text{t+r} + \overline{Y}_\text{tetr} + \overline{Y}_\text{react}}{4} = -2.042\), which indicates that the number of intrusions in this condition is lower than the grand mean. As such, we would conclude that playing Tetris after memory reactivation reduces the subsequent memory intrusions. The tests of the slopes of \(X_1\) and \(X_3\) are not significant. We therefore do not have sufficient evidence that the treatment effect of the Control condition, or of the Tetris-only condition, are different to 0.

Table 7.2: Linear model predicting number of intrusions by three effect-coded predictors. Values under \(\hat{\beta}\) are parameter estimates. The value of SS in the Error row is the Sum of Squared Error (SSE) of the full model, and in the remaining rows reflect the Sum of Squares Reduced (SSR) between a MODEL R where that parameter is fixed to 0, and the full MODEL G where it is estimated. For the Condition row, MODEL R is a model which fixes all parameters (apart from the intercept) to 0. The values under df are \(\text{df}_2\) for the Error row, and \(\text{df}_1\) in the other rows. Values of \(F\) are the \(F\)-statistic for the null-hypothesis tests where the null hypothesis is that the true value of the parameter is 0, i.e. \(\beta_j = 0\). Values of \(p(\geq F)\) are the corresponding \(p\)-values.
\(\hat{\beta}\) \(\text{SS}\) \(\text{df}\) \(F\) \(p(\geq \lvert F \rvert)\)
Intercept 3.931 1112.347 1 110.289 0.000
Condition 114.819 3 3.795 0.014
\(\quad X_1\) 1.181 33.449 1 3.316 0.073
\(\quad X_2\) -2.042 100.042 1 9.919 0.002
\(\quad X_3\) -0.042 0.042 1 0.004 0.949
Error 685.833 68

Table 7.2 also includes a row labelled “Condition”. This is a test of the hypothesis that all of the slopes of the contrast-coded predictors for Condition are equal to 0: \[H_0\!: \beta_1 = \beta_2 = \beta_3 = 0\] This hypothesis test is based on comparing the following two models: \[\begin{aligned} \text{MODEL G}: && Y_i &= \beta_0 + \beta_1 \times X_{1,i} + \beta_2 \times X_{2,i} + \beta_3 \times X_{3,i} + \epsilon_i \\ \text{MODEL R}: && Y_i &= \beta_0 + \epsilon_i \end{aligned}\] i.e., it is a “whole model test”. The result of this so-called omnibus test (a simultaneous test of multiple parameters) is significant, indicating that it is unlikely that the true values of the slopes are all equal to 0. In MODEL R above, there is only a single parameter \(\beta_0\). As such, this model predicts that all conditions have the same mean \(\mu\). This test is thus a test of the null-hypothesis that all treatment effects for Condition are equal to 0: \[H_0\!: (\mu_\text{contr} - \mu) = (\mu_\text{t+r} - \mu) = (\mu_\text{tetr} - \mu) = (\mu_\text{react} - \mu) = 0\] which is also equivalent to the test that the means of all conditions are equal to each other: \[H_0\!: \mu_\text{contr} = \mu_\text{t+r} = \mu_\text{tetr} = \mu_\text{react} \]

The result of the omnibus test indicates that there is at least one treatment effect which is different from 0. Omnibus tests for all slopes reflecting treatment effects are what is traditionally focused on in an ANOVA. But these omnibus tests are not always that informative. We’d generally like to know more than “there is at least one treatment that is likely to have an effect”. It seems inherently more interesting to know which conditions are associated with a treatment effect. This is where the tests of the individual slopes come in handy. Assessing the effect of the three effect-coding predictors, we can conclude that only the treatment effect of the Tetris+Reactivation condition is significant. As such, we only have sufficient evidence that a combination of memory reactivation and playing Tetris changes the number of subsequent memory intrusions.

At this point, I’d like to make some important remarks. Firstly, the absence of sufficient evidence that any of the other treatment effects differs from 0 should not be taken as direct evidence that the true treatment effects equal 0. A non-significant result indicates a lack of evidence against the null-hypothesis, but not an abundance of evidence for the null hypothesis. You can think of this as follows: that a suspect in a murder trial is not able to provide evidence that she is innocent, is in itself not sufficient evidence that she is guilty. You might also think of black swans. While the empirical statement “All swans are white” is impossible to prove conclusively without checking the colour of all swans that have and will ever grace this world, finding a single black swan disproves that statement immediately (Popper, 1959). Although significance testing does not provide conclusive evidence for or against the null-hypothesis,18 the analogy can be described as follows: a significant test is like spotting a swan that is “off-white” enough for you to decide it is not exactly white. But not having spotted such a swan could either mean that such a swan does not exist (the null hypothesis is true), or that you have not searched hard enough (the null hypothesis is false, but your test lacked power).

Secondly, the effect-coding predictors only reflect three out of a total of four treatment effects. The treatment effect of the reactivation-Only condition follows directly as minus the sum of these three treatment effects. As such, it is redundant. But we don’t have a hypothesis test for this redundant treatment effect. If we had chosen a slightly different coding scheme, where we would have estimated the treatment effects of the Control, Tetris-Only, and Reactivation-Only condition, such that the treatment effect of the Tetris+Reactivation condition is redundant, then none of the significance tests of the slopes of this different model would have been significant. However, the omnibus test would give exactly the same results, thus indicating that at least one treatment effect does not equal 0. With the three treatment effects estimated in this different model, we couldn’t have easily spotted which treatment effect(s) these were. Whilst the estimates and slopes of contrast-coded predictors are often more specific and informative than an omnibus test, because we can only use \(g-1\) of such predictors (one less than the number of groups), we can’t test for all treatment effects in a single model, nor test all hypotheses that we might be interested in. We will come back to this when we discuss “multiple testing” approaches. For now, a main thing to realise is that a significant omnibus test indicates that at least two groups differ in their means. If none of the tests of the slopes of the contrast-coded predictors in a linear model are significant, but you have obtained a significant omnibus test, that indicates that none of the predictors encoded this particular difference between groups.

Whilst some authors bemoan the use of omnibus tests, they have a role to play in the inference process, for instance in spotting whether you have missed a potentially important effect. Other authors put too much emphasis on omnibus tests, for instance requiring a significant omnibus tests before you might consider tests of the individual slopes that comprise this omnibus test. It is perfectly possible for an omnibus test to be non-significant, whilst a slope for one (or more) contrast-coded predictors is significant. For instance, if you’d conduct an experiment with 10 conditions, and only one has an actual treatment effect, the omnibus test might be non-significant because it effectively assigns that one treatment effect to nine parameters (slopes for nine contrast-coded predictors). Let’s take an extreme example, where only \(X_1\) (the first contrast-coded predictor) reduces the SSE with \(\text{SSR}(X_1) > 0\), whilst the other predictors provide no reduction in the SSE whatsoever (i.e. \(\text{SSR}(X_j) = 0\) for \(j=2, \ldots, 9\)). Then the \(F\) statistic of the omnibus test might be \[\begin{aligned} F &= \frac{\frac{\text{SSE}(R) - \text{SSE}(G)}{\text{npar}(G) - \text{npar}(R)}}{\frac{\text{SSE}(G)}{n-\text{npar}(G)}} \\ &= \frac{\text{SSR}(X_1)/9}{\text{SSE}(G)/(n-10)} \end{aligned}\] whilst the \(F\) statistic for the slope of \(X_1\) would be \[\begin{aligned} F &= \frac{\frac{\text{SSE}(R) - \text{SSE}(G)}{\text{npar}(G) - \text{npar}(R)}}{\frac{\text{SSE}(G)}{n-\text{npar}(G)}} \\ &= \frac{\text{SSR}(X_1)/1}{\text{SSE}(G)/(n-10)} \end{aligned}\] which, with a higher value and smaller \(\text{df}_1\), would be more likely to be significant.

Thirdly, when you look at the value of \(\text{SSR}(\text{Condition}) = 115\) in Table 7.2, you can see that it is smaller than the sum of the SSR terms corresponding to the three predictors. This indicates that there is redundancy between the predictors \(X_1\), \(X_2\), and \(X_3\) (i.e., some form of multicollinearity). Although Venn diagrams such as Figure 5.10 imply that the SSR of the full model would be larger than the sum of the unique SSR terms attributable to each predictor, the opposite can also be true! This situation is commonly referred to as suppression and is rather difficult to explain here without a lengthy and detailed detour. For now, we will therefore leave it to the interested reader to consult other sources on this (e.g. MacKinnon, Krull, & Lockwood, 2000), and simply note that if the sum of the SSR terms of the predictors does not equal the “whole model” SSR, this indicates redundancy between the predictors.

7.4.2 Orthogonal contrast codes

Whilst redundancy is not necessarily a problem, it is in some sense preferable if all the predictors in a linear model are independent, as we can then neatly divide the total SSR into parts that we can uniquely assign to each predictor. When the number of cases in each group (\(n_j\)) is equal for all groups, there are contrast coding schemes that ensure that the resulting contrast-coded predictors are non-redundant (i.e., independent). Such coding schemes are called orthogonal contrast codes.

There are some benefits to employing orthogonal contrast codes, although these benefits are sometimes overstated. Firstly, using orthogonal contrast codes ensures that the model predictions will exactly equal to (sample) averages in the conditions – as long as no additional predictors are included in the model; we will discuss such additional predictors in the context of Analysis of Covariance (ANCOVA). Whilst redundant coding schemes such as dummy coding or effect coding (amongst others) also ensure this, when you start defining your own contrast coding schemes, it might be difficult to check whether this is the case, and then resorting to orthogonal contrast codes may provide useful guidance. Secondly, as already mentioned, using independent predictors will ensure that the whole model SSR can be divided into SSR terms for the individual predictors. This makes it somewhat less likely that you will miss a difference between the groups in the tests of individual parameters that would be identified in the whole model test. Thirdly, using orthogonal contrast codes provides a general formula to state the estimate of the slopes of contrast-coded predictors in terms of the averages of the groups. If you use a set of orthogonal contrast codes, then the estimated slope of each predictor \(X_j\) corresponding to contrast code \(c_j\) can be expressed as the following function of the values of \(c_{j,k}\) contrast code \(c_j\) for group \(k\) and the sample means \(\overline{Y}_k\) of the dependent variable in group \(k\) as: \[\begin{equation} \hat{\beta}_j = \frac{\sum_{k=1}^{g} c_{j,k} \overline{Y}_k}{\sum_{k=1} c_{j,k}^2} \tag{7.3} \end{equation}\] i.e. as the sum of the sample means \(\overline{Y}_k\) multiplied by the group-wise values of contrast code \(c_j\), divided by the sum of \(c_{j,k}^2\), the squared group-wise values of contrast \(c_j\). Note that this formula does not hold for the effect-coding scheme discussed previously. For instance, if you’d fill in the values of \(c_1\), you’d get \[\begin{aligned} \hat{\beta}_1 &= \frac{1 \times \overline{Y}_\text{contr} + 0 \times \overline{Y}_\text{t+r} + 0 \times \overline{Y}_\text{tetr} - 1 \times \overline{Y}_\text{react}}{(1)^2 + (0)^2 + (0)^2 + (-1)^2} \\ &= \frac{\overline{Y}_\text{contr} - \overline{Y}_\text{react}}{2} \end{aligned}\] which is not the same as the treatment effect the estimated slope actually reflects: \[\hat{\beta}_1 = \overline{Y}_\text{contr} - \frac{\overline{Y}_\text{contr} + \overline{Y}_\text{t+r} + \overline{Y}_\text{tetr} + \overline{Y}_\text{react}}{4}\] Nevertheless, the effect-coding scheme did give us interpretable parameters which reflect treatment effects. Hence, using non-orthogonal contrast codes does not prohibit clearly interpretable parameters. With that said, if you can define the comparisons between groups that are of interest to you in terms of a set of orthogonal contrast codes, that would be preferable. If you can’t, there are ways to deal with that too, and you shouldn’t worry too much.

There are two rules you can follow to create orthogonal contrast codes:

  1. For all contrast codes \(c_j\), \(j = 1, \ldots, g-1\), \(\sum_{k=1}^g c_{j,k} = 0\). The sum of the values of each contrast code \(c_j\) is zero.
  2. For all pairs of contrast codes \(c_j\) and \(c_l\), \(j \neq l\), \(\sum_{k=1}^g c_{j,k} \times c_{l,k} = 0\). The sum of the cross-products of each pair of contrast codes \(c_j\) and \(c_l\) is equal to zero.

In the statements above, \(c_{j,k}\) refers to the value of a contrast code \(c_j\) for group \(k\), and similarly \(c_{l,k}\) refers to the value of a contrast code \(c_l\) for group \(k\), whilst \(g\) refers to the total number of groups. Only the second rule is strictly necessary, but the first one is a useful one to follow as well, as –like in the effect coding scheme– it ensures that the intercept represents the grand mean. If we check the two rules for the effect coding scheme, we can see that the first requirement is adhered to:

\[\begin{aligned} \sum_{k=1}^g c_{1,k} &= 1 + 0 + 0 + (-1) = 0 \\ \sum_{k=1}^g c_{2,k} &= 0 + 1 + 0 + (-1) = 0 \\ \sum_{k=1}^g c_{3,k} &= 0 + 0 + 1 + (-1) = 0 \end{aligned}\] However, the second is not: \[\begin{aligned} \sum_{k=1}^g c_{1,k} \times c_{2,k} &= 1 \times 0 + 0 \times 1 + 0 \times 0 + (-1) \times (-1) = 1 \\ \sum_{k=1}^g c_{1,k} \times c_{3,k} &= 1 \times 0 + 0 \times 0 + 0 \times 1 + (-1) \times (-1) = 1 \\ \sum_{k=1}^g c_{2,k} \times c_{3,k} &= 0 \times 0 + 0 \times 1 + 0 \times 1 + (-1) \times (-1) = 1 \end{aligned}\]

One set of questions you could ask, which will result in orthogonal contrast codes, is the following:

  1. What is the difference between doing nothing (the Control condition) and doing something (the three remaining conditions)?
  2. What is the difference between playing Tetris (the Tetris-Only condition) and a procedure involving memory reactivation (the Tetris+Reactivation and reactivation-Only condition)?
  3. What is the difference between memory reactivation without playing Tetris (the Reactivation-only condition) and the playing Tetris after reactivation (the Tetris+Reactivation condition)?

To define our contrast codes, we need to assign values to the conditions in each contrast code such that the slopes of the predictors based on them provide the answers to these questions. The first question refers to the difference between the Control condition and the three other conditions. It then makes sense to compare the mean of the Control condition to the average of the three remaining conditions combined. In this comparison, we would not distinguish between those three remaining conditions, and hence we should give each of them the same value. In addition, remember that a slope reflects an in- or decrease in the dependent variable for a one-unit increase in the predictor. It then makes sense to let the difference between the value we assign to the Control condition and the values we assign to the three other conditions equal 1 (i.e. a one-unit difference). Finally, we would like to use values such that their sum over the groups equals 0 (the first condition for orthogonal contrast codes). Combining these three ideas, to answer the first question, we could define a contrast code \(c_1\) with value \(\tfrac{3}{4}\) for the control condition, and the value \(-\tfrac{1}{4}\) for the Tetris+Reactivation, Tetris-Only, and Reactivation-Only condition. These values sum to 0, and the difference between \(\tfrac{3}{4}\) and \(-\tfrac{1}{4}\) equals 1, meaning that a one-unit increase for this contrast code is equal to going from one of the three experimental conditions to the Control condition.

For the second question, we would like to compare the Tetris-Only condition to the Tetris+Reactivation and Reactivation-Only condition combined. In this comparison, we ignore the Control condition. By giving this condition a value of 0 in the contrast code, it will not be part of the comparison. Using similar reasoning as before, we can assign a value of \(\tfrac{2}{3}\) to the Tetris-only condition, and a value \(-\tfrac{1}{3}\) to the Tetris+Reactivation and Reactivation-Only condition. Using the same value for these latter two conditions means that we don’t distinguish between them in the comparison, and compare their combined average to the mean of the Tetris-only condition. The suggested second contrast code \(c_2\) then has values \(0, -\tfrac{1}{3}, \tfrac{2}{3}, -\tfrac{1}{3}\) for the Control, Tetris+Reactivation, Tetris-Only, and Reactivation-Only conditions respectively. Again, the difference between \(\tfrac{2}{3}\) and \(-\tfrac{1}{3}\) equals 1, so that a one-unit increase in this contrast code is equal to going from one of the reactivation conditions to the Tetris-Only condition.

For the third question, finally, we would like to compare the Reactivation-Only condition to the Tetris+Reactivation condition, ignoring the other two conditions. The suggested contrast code \(c_3\) then has values \(0, -\tfrac{1}{2}, 0, \tfrac{1}{2}\) for the four conditions respectively. Here, a one-unit increase is equal to going from the Tetris+Reactivation condition to the Reactivation-Only condition.

The values of the three contrast codes \(c_1\), \(c_2\), and \(c_3\) (for questions 1 to 3 respectively) for each of the four conditions, are summarized given in the table below:

Table 7.3: A set of orthogonal contrast codes.
\(c_1\) \(c_2\) \(c_3\)
Control \(\tfrac{3}{4}\) 0 0
Tetris+Reactivation \(-\tfrac{1}{4}\) \(-\tfrac{1}{3}\) \(-\tfrac{1}{2}\)
Tetris-Only \(-\tfrac{1}{4}\) \(\tfrac{2}{3}\) 0
Reactivation-Only \(-\tfrac{1}{4}\) \(-\tfrac{1}{3}\) \(\tfrac{1}{2}\)

Let’s check whether these three contrast codes are orthogonal. The first requirements (values sum to 0) holds: \[\begin{aligned} \sum_{k=1}^g c_{1,k} &= \tfrac{3}{4} - \tfrac{1}{4} - \tfrac{1}{4} - \tfrac{1}{4} = 0 \\ \sum_{k=1}^g c_{2,k} &= 0 - \tfrac{1}{3} + \tfrac{2}{3} - \tfrac{1}{3} = 0 \\ \sum_{k=1}^g c_{3,k} &= 0 - \tfrac{1}{2} + 0 + \tfrac{1}{2} = 0 \end{aligned}\] And indeed the second requirement (pairwise products sum to 0) also holds: \[\begin{aligned} \sum_{k=1}^g c_{1,k} \times c_{2,k} &= \tfrac{3}{4} \times 0 + (- \tfrac{1}{4}) \times (- \tfrac{1}{3}) + (-\tfrac{1}{4}) \times \tfrac{2}{3} + (- \tfrac{1}{4}) \times (- \tfrac{1}{3}) \\ &= \tfrac{1}{12} - \tfrac{2}{12} + \tfrac{1}{12} = 0 \\ \sum_{k=1}^g c_{1,k} \times c_{3,k} &= \tfrac{3}{4} \times 0 + (- \tfrac{1}{4}) \times (- \tfrac{1}{2}) + (-\tfrac{1}{4}) \times 0 + (- \tfrac{1}{4}) \times \tfrac{1}{2} \\ &= \tfrac{1}{8} - \tfrac{1}{8} = 0 \\ \sum_{k=1}^g c_{2,k} \times c_{3,k} &= 0 \times 0 + (-\tfrac{1}{3}) \times (- \tfrac{1}{2}) + \tfrac{2}{3} \times 0 + (-\tfrac{1}{3}) \times \tfrac{1}{2} \\ &= \tfrac{1}{6} - \tfrac{1}{6} = 0 \end{aligned}\]

Because the contrast codes \(c_1\), \(c_2\), and \(c_3\) are orthogonal, we can use Equation (7.3) to determine the estimates of the slopes. The estimated slope of the contrast-coded predictor \(X_1\) which takes its values from \(c_1\) will be: \[ \begin{aligned} \hat{\beta}_1 &= \frac{ \tfrac{3}{4} \times \overline{Y}_\text{contr} - \tfrac{1}{4} \times \overline{Y}_\text{t+r} - \tfrac{1}{4} \times \overline{Y}_\text{tetr} - \frac{1}{4} \times \overline{Y}_\text{react} }{ (\frac{3}{4})^2 + (-\frac{1}{4})^2 + (-\frac{1}{4})^2 + (-\frac{1}{4})^2 } \\ &= \frac{ \tfrac{3}{4} \times \overline{Y}_\text{contr} - \tfrac{1}{4} \times \overline{Y}_\text{t+r} - \tfrac{1}{4} \times \overline{Y}_\text{tetr} - \frac{1}{4} \times \overline{Y}_\text{react} }{ \tfrac{3^2}{4^2} + \tfrac{1^2}{4^2} + \tfrac{1^2}{4^2} + \tfrac{1^2}{4^2} } \end{aligned} \] The denominator evaluates to \(\tfrac{12}{16} = \tfrac{3}{4}\), and dividing both the numerator and denominator by \(\tfrac{3}{4}\) gives19 \[\hat{\beta}_1 = \overline{Y}_\text{contr} - \frac{\overline{Y}_\text{t+r} + \overline{Y}_\text{tetr} + \overline{Y}_\text{react}}{3}\] i.e. the slope of the first predictor is equal to the difference between the average intrusions in the Control condition, and the average of the average number of intrusions in the three other conditions. When we fill in the actual sample averages, we get: \[ \begin{aligned} \hat{\beta}_1 &= \frac{ \tfrac{3}{4} \times 5.11 - \tfrac{1}{4} \times 1.89 - \tfrac{1}{4} \times 3.89 - \frac{1}{4} \times 4.83 }{ \tfrac{3}{4}} \\ &= \frac{1.181}{0.75} = 1.574 \end{aligned} \]

The estimated slope of the contrast-coded predictor \(X_2\) which takes its values from \(c_2\) will be: \[ \begin{aligned} \hat{\beta}_2 &= \frac{ 0 \times \overline{Y}_\text{contr} - \tfrac{1}{3} \times \overline{Y}_\text{t+r} + \tfrac{2}{3} \times \overline{Y}_\text{tetr} - \frac{1}{3} \times \overline{Y}_\text{react} }{ (0)^2 + (-\frac{1}{3})^2 + (\frac{2}{3})^2 + (-\frac{1}{3})^2 } \\ &= \frac{ \tfrac{2}{3} \times \overline{Y}_\text{tetr} - \tfrac{1}{3} \times \overline{Y}_\text{t+r} - \frac{1}{3} \times \overline{Y}_\text{react} }{ \tfrac{6}{9}} \\ &= \overline{Y}_\text{tetr} - \frac{\overline{Y}_\text{t+r} + \overline{Y}_\text{react}}{2} \end{aligned} \] The slope of the second predictor is thus equal to the difference between the average number of intrusions in the Tetris-Only condition, and the average of the average number of intrusions in the Tetris+Reactivation condition and the Reactivation-Only condition. Filling in the actual sample averages then gives. \[ \begin{aligned} \hat{\beta}_2 &= \frac{ 0 \times 5.11 - \tfrac{1}{3} \times 1.89 + \tfrac{2}{3} \times 3.89 - \frac{1}{3} \times 4.83 }{ \tfrac{2}{3} } \\ &= \frac{0.352}{0.667} = 0.528 \end{aligned} \] Finally, the estimated slope of the contrast-coded predictor \(X_3\) which takes its values from \(c_3\) will be: \[ \begin{aligned} \hat{\beta}_3 &= \frac{ 0 \times \overline{Y}_\text{contr} - \tfrac{1}{2} \times \overline{Y}_\text{t+r} + 0 \times \overline{Y}_\text{tetr} + \frac{1}{2} \times \overline{Y}_\text{react} }{ (0)^2 + (-\frac{1}{2})^2 + (0)^2 + (\frac{1}{2})^2 } \\ &= \frac{\frac{1}{2} \times \overline{Y}_\text{react} - \tfrac{1}{2} \times \overline{Y}_\text{t+r} }{ \tfrac{2}{4} } \\ &= \overline{Y}_\text{react} - \overline{Y}_\text{t+r} \end{aligned} \] i.e. the difference between the average intrusions in the Reactivation-Only condition and the average intrusions in Tetris+Reactivation condition. Filling in the actual sample averages gives: \[ \begin{aligned} \hat{\beta}_3 &= \frac{ 0 \times 5.11 - \tfrac{1}{2} \times 1.89 + 0 \times 3.89 + \frac{1}{2} \times 4.83 }{\tfrac{1}{2}} \\ &= \frac{1.472}{0.5} = 2.944 \end{aligned} \]

While it is important to understand what each slope reflects in terms of differences between the group means, you would not normally use these equations to actually estimate the slopes. It is much easier to use statistical software for that. Estimating a linear regression model with the three predictors \(X_1\), \(X_2\), and \(X_3\), which take values according to the contrast codes \(c_1\), \(c_2\), and \(c_3\) respectively, gives us the following estimated model:

\[\texttt{intrusions}_i = 3.93 + 1.57 \times \texttt{}X_1\texttt{}_i + 0.528 \times \texttt{}X_2\texttt{}_i + 2.94 \times \texttt{}X_3\texttt{}_i + \hat{\epsilon}_i \quad \quad \hat{\epsilon}_i \sim \mathbf{Normal}(0, 3.18)\] As you can see, the parameter estimates are identical to those worked out above using Equation (7.3) and the sample means. Figure 7.5 shows how the parameters of the model are related to the average number of intrusions in each condition.
Average number of instrusions in the four conditions in the Tetris study and how they are related to the parameters of the orthogonal contrast-coding model. The intercept $\hat{\beta}_0 = \hat{\mu}$ is the grand mean (dotted line). The slopes reflect deviations from the average intrusions in a condition and combinations of other conditions.

Figure 7.5: Average number of instrusions in the four conditions in the Tetris study and how they are related to the parameters of the orthogonal contrast-coding model. The intercept \(\hat{\beta}_0 = \hat{\mu}\) is the grand mean (dotted line). The slopes reflect deviations from the average intrusions in a condition and combinations of other conditions.

Hypothesis tests for all the parameters, as well as the omnibus test for Condition, are provided in Table 7.4. As before, we find a significant overall effect for Condition, which indicates that at least one of the group means differs from another one. The tests for the first two contrasts are not significant. As such, we have no evidence that the Control condition differs from the other three conditions combined, or that the Tetris-Only condition differs from the Tetris+Reactivation or Reactivation-Only condition combined. The test of the third contrast is significant, however, indicating a difference between the Tetris+Reactivation and Reactivation-Only condition. The slope of this comparison is positive, indicating that, as expected, there are more memory intrusions in the Reactivation-Only condition compared to the Tetris+Reactivation condition.

Table 7.4: Linear model predicting number of intrusions by three contrast-coded predictors.
\(\hat{\beta}\) \(\text{SS}\) \(\text{df}\) \(F\) \(p(\geq \lvert F \rvert)\)
Intercept 3.931 1112.35 1 110.289 0.000
Condition 114.82 3 3.795 0.014
\(\quad X_1\) 1.574 33.45 1 3.316 0.073
\(\quad X_2\) 0.528 3.34 1 0.331 0.567
\(\quad X_3\) 2.944 78.03 1 7.736 0.007
Error 685.83 68

Comparing the results in Table 7.4 to those in Table 7.2, there are a few things to note. Firstly, the result of the omnibus test is exactly the same, whether you use effect coding or orthogonal contrast coding. Because both ways of coding in the end make the same prediction \(\hat{Y}_i\) (namely that \(\hat{Y}_i\) equals the group mean of the condition that case \(i\) belongs to), the model as a whole has the same SSE for both coding schemes. And as the omnibus test involves the same MODEL R (an intercept-only model), the SSR term of the omnibus Condition effect is the same for effect-coding and orthogonal coding. Indeed, for any form of contrast coding which results in model predictions that equal the group means, the omnibus test gives exactly the same results! The model with orthogonal contrast coding, however, separates the whole model SSR term neatly into into different SSR terms for the contrast-coded predictors. If you add up the SSR terms for the three predictors (in the column labelled SS), you get exactly the SSR term given for Condition. This shows that when you use orthogonal contrast coding, the variance explained by the model as a whole is separated exactly into three independent parts reflecting the unique part of the variance explained due to each predictor.

7.4.3 Defining your own (orthogonal) contrasts

Initially, coming up with orthogonal contrast codes (or for that matter, any coding scheme which allows the model predictions to equal the group means exactly) will not be easy, especially for nominal variables with more than 3 levels (i.e. more than 3 groups). With practice, you should become better at this. Ideally, the contrast codes reflect at least some of the theoretically important research questions you want to ask (e.g., relevant comparisons of the effects of experimental manipulations). In an experimental design with \(g\) groups, you need to define \(g-1\) contrast codes which correspond to such questions.

When designing your own contrast codes, you should start with the most important question you want to ask. For instance, you may want to compare the conditions with memory reactivation to those without memory reactivation, because you expect more memory intrusions after memory reactivation than without this reactivation. Your first contrast code would focus on this comparison, and you could define it as \(c_1 = (-\tfrac{1}{2}, \tfrac{1}{2}, -\tfrac{1}{2}, \tfrac{1}{2})\). Note that I’m giving the Control and Tetris-Only condition a negative value of \(-\tfrac{1}{2}\) and the Tetris+Reactivation and Reactivation-Only condition a positive value of \(\tfrac{1}{2}\). In this case, the resulting slope would be positive if the expectation of more memory intrusions after reactivation holds true. For expectations which involve not only a difference, but also a direction of that difference, I find it useful to assign values which are in line with those expectations. Negative slopes of contrast-coded predictors in the model then indicate that my expectations did not hold true.

Having defined this initial contrast, we have two more to go. Perhaps I’m also interested in assessing the effect of Tetris within the conditions that involve memory reactivation, because I expect playing Tetris to reduce the number of memory intrusions. This is a comparison between the Reactivation-Only and Tetris+Reactivation conditions. Because we gave these conditions the same value in the first contrast \(c_1\) (i.e. we did not differentiate between them in the first comparison), this new comparison will be independent of (orthogonal to) the first comparison. We can use the same contrast code for this as the one used earlier, so \(c_2 = (0,-\tfrac{1}{2}, 0, \tfrac{1}{2})\). It is easy to check that this contrast is indeed orthogonal to the first one \[\sum_{k=1}^g c_{1,k} \times c_{2,k} = -\tfrac{1}{2} \times 0 + \tfrac{1}{2} \times (-\tfrac{1}{2}) + (-\tfrac{1}{2}) \times 0 + \tfrac{1}{2} \times \tfrac{1}{2} = 0\] We now have one contrast left to specify. Perhaps we don’t really have any more important questions to ask, so this contrast might be arbitrary from a theoretical viewpoint. However, to have a model which is able to fit the group means exactly, we would need a contrast code that differentiates between the Control condition and the Tetris-Only condition, as these conditions received the same value on \(c_1\) (both \(-\tfrac{1}{2}\)) and \(c_2\) (both 0). If our last contrast code would not differentiate between these two conditions, then there would be no way in which the model can make different predictions for those two conditions. The model then involves a quite strong assumption that means of the Control and Tetris-Only condition are exactly identical. I don’t see a reason why such equivalence between the conditions would necessarily hold. So the final contrast code will be \(c_3 = (\tfrac{1}{2}, 0, -\tfrac{1}{2}, 0)\). Note that if I had used different values for the Reactivation-Only and Tetris+Reactivation conditions, for instance a contrast code of \((\tfrac{1}{2}, -\tfrac{1}{2}, -\tfrac{1}{2}, \tfrac{1}{2})\), then the contrast code would correlate with \(c_2\) (or with \(c_1\) if I had used different values). So the final contrast code needs to give the same value (0) to the Reactivation-Only and Tetris+Reactivation conditions to give us a set of orthogonal contrast codes. So the final set of orthogonal contrast codes is

\(c_1\) \(c_2\) \(c_3\)
Control \(-\tfrac{1}{2}\) 0 \(\tfrac{1}{2}\)
Tetris+Reactivation \(\tfrac{1}{2}\) \(-\tfrac{1}{2}\) 0
Tetris-Only \(-\tfrac{1}{2}\) 0 \(-\tfrac{1}{2}\)
Reactivation-Only \(\tfrac{1}{2}\) \(\tfrac{1}{2}\) 0

I will leave checking the orthogonality, and estimating and testing the parameters of the resulting model, as an exercise to the reader. Note that this strategy of splitting conditions in halves to compare, and then splitting these halves in other halves, can generally be applied. And the question strategy in Figure 7.3 is an example of this.

7.5 Default orthogonal coding schemes

While orthogonal contrasts provide benefits in terms of straightforward interpretation of the parameters, and, in the case of equally sized groups, independent predictors and a neat division of the whole model \(\text{SSR}\) into the separate \(\text{SSR}\) terms for the predictors, orthogonality is not a strict requirement. For instance, effect coding and dummy coding also provide interpretable parameters, although the predictors are not independent. The most important thing is that the contrast codes can answer questions about the Data Generating Process which are of interest to you.

In the absence of any comparisons of interest, you shouldn’t really be conducting hypothesis tests in the first place. That said, there might be times where you would like to resort to a “default” way of orthogonal contrast coding. Such default coding schemes can also provide inspiration for defining your own contrast codes. Two default orthogonal contrast codes are Helmert coding, and polynomial contrast coding.

Helmert coding involves a set of contrast codes in which each code compares a group mean to the average of all group means that come before it. You can find these contrast, for situations with a group size of \(g=1, 2, \ldots, 6\), in Table 7.5. Using Helmert contrasts will always provide a set of orthogonal contrasts. And if you reorder the conditions in Table 7.3 as Tetris+Reactivation, Reactivation-Only, tetris-Only, Control, you should recognize that this was actually an example of Helmert coding for \(g=4\).

Table 7.5: Helmert contrast codes for situations with two to six groups.
group \(c_1\) \(c_2\) \(c_3\) \(c_4\) \(c_5\)
1 -0.5
2 0.5
1 -0.5 -0.333
2 0.5 -0.333
3 0.0 0.667
1 -0.5 -0.333 -0.25
2 0.5 -0.333 -0.25
3 0.0 0.667 -0.25
4 0.0 0.000 0.75
1 -0.5 -0.333 -0.25 -0.2
2 0.5 -0.333 -0.25 -0.2
3 0.0 0.667 -0.25 -0.2
4 0.0 0.000 0.75 -0.2
5 0.0 0.000 0.00 0.8
1 -0.5 -0.333 -0.25 -0.2 -0.167
2 0.5 -0.333 -0.25 -0.2 -0.167
3 0.0 0.667 -0.25 -0.2 -0.167
4 0.0 0.000 0.75 -0.2 -0.167
5 0.0 0.000 0.00 0.8 -0.167
6 0.0 0.000 0.00 0.0 0.833

Polynomial contrast codes tend to be used when the groups can be ordered in a meaningful way. For instance, a study might involve different age groups, e.g. “18-30 year old”, “31-40 year old”, and “41-50 year old”. Using polynomial contrast coding, you can then determine a linear trend over these age groups with a linear contrast \(c_1 = (-1,0,1)\), as well as whether the average of the middle age group is lower (or higher) than would be expected from the linear contrast through a quadratic contrast \(c_2 = (1,-2,1)\). Examples of polynomial contrast codes for situations with between 2 and 6 groups are provided in Table 7.6.

Table 7.6: Polynomial contrast codes for situations with two to six groups.
group \(c_1\) \(c_2\) \(c_3\) \(c_4\) \(c_5\)
1 -1
2 1
1 -1 1
2 0 -2
3 1 1
1 -3 1 -1
2 -1 -1 3
3 1 -1 -3
4 3 1 1
1 -2 2 -1 1
2 -1 -1 2 -4
3 0 -2 0 6
4 1 -1 -2 -4
5 2 2 1 1
1 -5 5 -5 1 -1
2 -3 -1 7 -3 5
3 -1 -4 4 2 -10
4 1 -4 -4 2 10
5 3 -1 -7 -3 -5
6 5 5 5 1 1

7.6 Effect-size in ANOVA

As contrast-coded predictors are essentially just predictors in a linear model, we can use similar measures of effect-size as in multiple regression (see Section 5.6). So we can measure the size of an effect as the “proportion of variance explained”, or “eta-squared” \[\hat{\eta}^2 = \frac{\text{SS}(\text{effect})}{\text{SS}(\text{total})}\] Here, \(\text{SS}(\text{effect})\) is the Sum of Squares Reduced (SSR) for that effect. This is the reduction in the error when estimating the effect rather than fixing it to an a priori value. \(\text{SS}(\text{total})\) is the Sum of Squared Error of a model with only an intercept, which is a function of the sample variance of the dependent variable: \(\text{SS}(\text{total}) = n S^2_Y\). In the context of multiple regression, this is also called the coefficient of semi-partial determination.

We may also consider the “partial eta-squared”, or coefficient of partial determination: \[\hat{\eta}_p^2 = \frac{\text{SS}(\text{effect})}{\text{SS}(\text{error}) + \text{SS}(\text{effect}) }\]

For both these measures, the effect can be either a single contrast-coded predictor, or a set of these. So we may also consider the effect of Condition. In Table 7.4, we can see that the Sum of Squares Reduced for Condition is \(\text{SS}(\text{Condition}) = 114.82\). The value of \(\text{SS}(\text{total})\) is 800.65, which is equal to the sum of \(\text{SS}(\text{Condition})\) and \(\text{SS}(\text{error})\). We can then compute the eta-squared as \[\hat{\eta}^2(\text{Condition}) = \frac{114.82}{800.65} = 0.143\] The value of \(\text{SS}(\text{error})\) is 685.83, and hence the partial eta-squared is \[\hat{\eta}_p^2(\text{Condition}) = \frac{114.82}{685.83 + 114.82} = 0.143\] In a oneway ANOVA model, the partial and non-partial eta-squared measures are the same for the overall omnibus effect of condition. That is because in that case, \(\text{SS}(\text{error}) + \text{SS}(\text{Condition}) = \text{SS}(\text{total})\). For the individual contrasts, there is a difference in the partial and non-partial eta-squared measures. For instance, computing the effect-sizes of \(X_1\), which reflects the difference between the Control group and the other conditions, gives: \[\hat{\eta}^2(X_1) = \frac{33.45}{800.65} = 0.042\] and \[\hat{\eta}_p^2(X_1) = \frac{33.45}{685.83 + 33.45} = 0.047\]

7.7 Assumptions

The assumptions of the ANOVA model are the same as those for a (multiple) regression model (see Section 5.7). Thus, we assume the errors \(\epsilon\) are independent and Normal-distributed, with a mean of 0 and a homoscedastic standard deviation \(\sigma_\epsilon\). As an ANOVA model provides the same prediction for each case in a group/condition, the errors reflect variation in the dependent variable within each group or condition. Another way to state the assumptions of the General Linear Model in the case of an ANOVA model is then that the variance of the dependent variable within each group or condition is the same (see Figure 7.6).

An ANOVA model assumes that the variance within each condition or group is identical. The means of the conditions or groups can differ.

Figure 7.6: An ANOVA model assumes that the variance within each condition or group is identical. The means of the conditions or groups can differ.

We can use similar techniques to assess the assumptions as for multiple regression, such as QQ-plots and histograms of the errors. A predicted-vs-residual plot is less informative here, as there are only as many unique predictions as the number of conditions or groups. Because we have multiple observations within each group, we can test the homoscedasticity assumption. This assumption is also called the homogeneity of variance assumption. A widely-used test is known as Levene’s test (Levene, 1960). This is a test of the null-hypothesis \(H_0: \sigma^2_1 = \sigma^2_2 = \ldots = \sigma^2_g\), where \(\sigma^2_j\) is the variance of \(Y\) in a group \(j\). The test effectively uses absolute deviations between observed values of \(Y\) and the group average \(\overline{Y}_g\) as the dependent variable in an ANOVA model. If one group has a larger variance, then the absolute deviations will also be larger, on average, than those in another group. A significant Levene test indicates that the null-hypothesis of equal variances is likely false.

As usual, the assumptions of the model allow us to derive the sampling distribution of the test statistic under the null-hypothesis. If the assumptions hold, then we are assured the distribution of the \(F\)-statistic follows an \(F\)-distribution with the given degrees of freedom. Even if the errors are not strictly Normal-distributed, or there are some small differences in the variance between the groups, the \(F\)-statistic may still approximately follow the assumed \(F\)-distribution. The general consensus is that ANOVAs are reasonably robust against deviations from Normal-distributed errors, as long as the error distributions are approximately symmetric, or have the same shape in each group (Howell, 2012). When sample sizes are equal for each group, ANOVAs are also robust to violations of homoscedasticity (Howell, 2012; see Lix, Keselman, & Keselman, 1996; Maxwell, Delaney, & Kelley, 2017). So even if the Levene test is significant, you may still be able to use an ANOVA model.

Regarding potential heteroscedasticity, Maxwell et al. (2017) offer the following practical advice: in a one-way ANOVA model with equal sample sizes, you may stick to the standard \(F\) test if the ratio of the largest sample variance \(S^2_\text{max}\) to the smallest sample variance \(S^2_\text{min}\) is smaller than 4, i.e. \(\frac{S^2_\text{max}}{S^2_\text{min}} < 4\). When sample sizes are unequal, things become a little more complicated. The suggestion is to also take the ratio of the sample size in the largest group, \(n_\text{max}\) to the sample size of the smallest group, \(n_\text{min}\) into account, and stick to the standard \(F\) test whenever \[\frac{n_\text{max}}{n_\text{min}} \times \frac{S^2_\text{max}}{S^2_\text{min}} < 4\] When the ratio is larger than 4, you could decide to use an alternative test, such as the Brown-Forsythe or the Welch test (see e.g. Tomarken & Serlin, 1986). When the variance of a dependent variable increases with the mean, it may also be possible to transform the dependent variable using e.g. a logarithmic or square-root transformation.

G. E. Box (1954) showed that in the case of heteroscedasticity (unequal variances), when there are equal sample sizes, the correct critical value for the \(F\)-test is somewhere between the critical value of the \(F\)-statistic with the usual degrees of freedom (i.e. \(\text{df}_1 = \text{npar}(G) - \text{npar}(R)\) and \(\text{df}_2 = n - \text{npar}(G)\)), and a critical value of the \(F\)-statistic with smaller degrees of freedom: \(\text{df}_1 = 1\) and \(\text{df}_2 = \frac{n - \text{npar}(G)}{g}\). In the case of unequal variances, you may then also perform a conservative test, using the critical value for these smaller degrees of freedom. If this provides a significant result, that implies there are differences in group means regardless of whether the variances are equal or not.

7.8 Multiple testing and post-hoc tests

Using contrast codes, you can test a total of \(g-1\) comparisons within a single analysis. At times, this may suffice to test all hypotheses of interest. At other times, you would like to test more hypotheses.

For example, you might want to test for differences in the means of all pairs of conditions. In a design with 4 conditions, there are a total of \((4-1)! = 6\) pairwise comparisons between the conditions. Obtaining all these pairwise tests is actually pretty simple: you can just perform multiple analyses with different contrast codes. If you are interested in pairwise differences, then dummy coding is the most obvious choice. In any model with dummy coding, you would obtain all pairwise tests comparing each condition to a reference condition. If you perform three analyses, and in each change the reference condition, you would obtain all pairwise tests. Case closed?

Well, maybe not entirely. Although this is a straightforward way to obtain the tests, you will end up performing quite a lot of these. The number of tests increases when you have more conditions. For instance, with 5 conditions, you would need to perform \((5-1)! = 24\) tests, and with 6 conditions you would perform \((6-1)! = 120\) tests. For each test, we allow a Type 1 error rate equal to the significance level \(\alpha\), usually \(\alpha = .05\). Say that we perform 120 tests. When there are no differences between the conditions, so all the null-hypotheses are true, then we would still expect \(.05 \times 120 = 6\) significant test results. In a set of multiple tests, the probability of making at least one Type 1 error is higher than the significance level of each individual test.

The reason for focusing on the number of Type 1 Errors (also called false positives) in a set of tests, rather than for each individual test, is related to the distinction between a priori comparisons and post-hoc comparisons. A priori comparisons are comparisons between conditions that reflect hypotheses which researchers planned to test before data collection or statistical analysis. Generally, there are only a relatively small number of such a priori comparisons. Post-hoc comparisons are comparisons which are planned after the data has been collected and an initial statistical analysis. For instance, the data might point to an unexpected difference between the Control and Tetris-only condition, and then the researcher might conduct a further analysis to test this specific difference. Or a significant omnibus test but a non-significant result for an expected difference might inspire you to conduct all pairwise comparisons to see what this significant omnibus test is due to. A danger with such post-hoc tests is that the result that inspired the subsequent tests might itself be a Type 1 error (i.e. a “fluke”). Post-hoc and data-driven hypotheses have a risk of confirming noise in the data. The means of two conditions might differ purely due to chance, and testing whether such an unexpected difference is significant may then just confirm chance patterns in the data, rather than reflecting a true aspect of the Data Generating Process. Although there is such risk in any statistical hypothesis test, whether planned a priori or post-hoc, the fact that post-hoc comparisons are generally tests of unexpected differences, there is reason to be more conservative in these latter tests. As a priori comparisons are, by definition, not inspired by the data, there is less of a chance of following a trail of spurious results.

The overall Type 1 error rate in a set of tests is conventionally called the family-wise Type 1 error rate, or \(\alpha_\text{FW}\). This is equal to the probability of making at least one Type 1 error within a set of hypothesis tests. This probability is equal to one minus the probability of making no Type 1 error at all. If all tests are independent, then the probability of making no Type 1 error in a total of \(q\) tests is \((1-\alpha)^q\). Hence, the family-wise error rate is \[\alpha_\text{FW} = 1 - (1-\alpha)^q\] If we use the conventional \(\alpha = .05\) for each test, and \(q = 120\), then the family-wise error rate would be \[\alpha_\text{FW} = 1 - .95^{120} = 0.998\] which shows we are virtually guaranteed to make at least one Type 1 error.

This inflation of Type 1 error can be dealt with, though. One (perhaps too) simple solution is to apply the so-called Bonferroni correction to the significance level of each test, such that the family-wise Type 1 error rate is kept at a desired level. The Bonferroni correction is simply to choose a desired value for \(\alpha_\text{FW}\), e.g. \(\alpha_\text{FW} = .05\), and then to set the significance level of the individual tests to \(\alpha = \frac{\alpha_\text{FW}}{q}\). Doing so, the family-wise error rate is never above what we want it to be. For example: \[\alpha_\text{FW} = 1 - \left(1 - \frac{.05}{120}\right)^{120} = 0.049\] So, following the Bonferroni correction, we would use a significance level of \(.05/120 = 0.000417\) for each test of a pairwise difference. While this correction ensures that the family-wise error rate is kept within bounds, it is also rather conservative. The difference between each pair of conditions would have to be rather large to pass such a high bar. Luckily, clever statisticians have devised alternatives which obtain the same objective (limiting the family-wise error rate), whilst being less conservative. There are a lot of these, and I will only mention a few here.

The Holm correction requires you to perform all tests first. You then need to order the resulting \(p\)-values, \(p_j\), \(j=1, \ldots,q\) for each test from smallest to largest. The smallest \(p\)-value is assigned \(\text{rank}(p_j) = 1\), and the largest \(\text{rank}(p_j) = q\). The Holm procedure is to compare each \(p\)-value to a significance level of \[\alpha(p_j) = \frac{\alpha_\text{FW}}{(q - \text{rank}(p_j) + 1)}\] For example, if you performed a \(q=80\) tests and want to maintain a family-wise significance level of \(\alpha_\text{FW} = .05\), you would compare the 40th lowest \(p\)-value to a significance level of \(.05/(80 - 40 + 1) = 0.00122\). Using the Bonferroni correction, the comparison level would be \(.05/80 = 0.000625\). As soon as you obtain a non-significant result (i.e. \(p_j > \alpha(p_j)\)), you stop testing. All subsequent tests are declared non-significant. Like the Bonferroni correction, this procedure ensures that the family-wise Type 1 error rate is kept within a desired bound \(\alpha_\text{FW}\), but it is more powerful than the Bonferroni correction (i.e. the rate of Type 2 errors is lower). The Holm correction may be less intuitive and more involved to compute than the Bonferroni correction, but because it reaches the same goal with more power, there is no reason to use the Bonferroni correction instead of the Holm correction.

Another option, relevant within the context of ANOVA, is to use the Scheffé adjusted critical \(F\)-value: \[F_\text{Scheffé} = (g-1)F_{g-1,n-\text{npar}(G);\alpha_\text{FW}}\] where \(F_{g-1,n-\text{npar}(G);\alpha_\text{FW}}\) is the critical value for an \(F\) distribution with \(\text{df}_1 = g-1\) and \(\text{df}_2 = n - \text{npar}(G)\) degrees of freedom (note that \(g\) refers to the number of groups/conditions in the design, not the number of tests you want to perform), and a desired family-wise error rate \(\alpha_\text{FW}\). Just as for the Holm procedure, the Scheffé testing procedure ensures that the family-wise error rate never exceeds the desired level. It can also be more powerful than the Bonferonni correction, depending on the number of hypotheses you want to test. The Scheffé criterion is independent of the number of contrasts you want to test, and is effectively a correction for any number of possible contrast you want to test, whether this be pairwise comparisons between conditions, or comparisons of combinations of the different conditions. The Holm procedure can also be applied in the case of these more general comparisons. If you have a limited number of hypotheses to test, the Holm procedure will generally be more powerful than the Scheffé procedure.

Finally, if you are just interested in all pairwise comparisons between conditions (whether post-hoc or not), I want to point out the Tukey Honestly Significant Difference (HSD) test. It was specifically designed for this purpose, and will often be more powerful than the previously discussed correction methods.

We now have to discuss a rather tricky issue. Suppose you are really just interested in testing a small number of interesting comparisons between conditions. Good! That means that you have thought about your study and the information you want to get out of it. The question is now whether you should apply any correction at all. You may have already wondered why, in a linear model with \(m\) predictors, and tests of all \(m+1\) parameters (the intercept and slopes for the \(m\) predictors), we haven’t worried before about an inflation of Type 1 error. Clearly, if we’d included 20 predictors in our model, we’d on average expect one of these tests to be significant, even if none of the predictors have any real effect. That is a very good question, and I don’t have a definite answer to it. The answer really depends on what you think is an allowable error rate. And that is not tied to a single analysis. In your scientific career, you might perform a total of 100,000 hypothesis tests. How many of those would you like to be possible Type 1 errors? Should you apply something like a Bonferonni or Holm correction to all these tests? This is of course a rather silly question to ask, but it highlights that recommendations for multiple testing should be considered carefully. They are recommendations, and not strict rules that should be followed blindly. In my humble opinion, as long as you have a limited number of a priori comparisons, you don’t necessarily have to apply any correction. And if you want to be careful, apply a Holm correction, as this is more powerful than a Bonferroni correction. When your tests are exploratory (post-hoc and inspired by previous test results), you should be careful and something like the Scheffé procedure makes sense, because it corrects for all comparisons you might conduct, whether the result of “data-snooping” or otherwise. There is of course still no guarantee that a given test result is not a Type 1 error, but at least you have limited the total number of likely Type 1 errors within the set of possible comparisons.

7.9 In practice

  1. Explore the data. Check the distribution of the dependent variable in each condition/group. Are there outlying or otherwise “strange” observations? If so, you may consider removing these from the dataset. Do the distributions look roughly Normal or are the distributions at least similar over the groups? Calculate the sample variances for each group. Is the largest variance more than 4 times the smallest sample variance? If so, then you may consider an alternative analysis to ANOVA. If you have doubts about the homogeneity of variance, you can perform a Levene test. If this test is significant, you may still perform the ANOVA analysis as usual if the largest variance is less than 4 times larger than the smallest sample variance and the sample sizes in the groups are equal.

  2. Define a useful set of contrast codes for your study. Aim for these codes to represent the most important comparisons between the groups. Estimate the ANOVA model. Then check again for potential issues in the assumptions with e.g. histograms for the residuals and QQ-plots. If there are clear outliers in the data, remove these, and then re-estimate the model.

  3. Consider both the results of the omnibus test (are there any differences between the groups?) and the tests of the individual contrasts (are there specific differences between the groups of interest?). If there are more comparisons of interest than those encoded in the contrast-coded predictors, perform additional follow-up tests. If there are many of these tests, consider correcting for this by using e.g. a Scheffe-adjusted critical value.

  4. Report the results. When reporting the results, make sure that you include all relevant statistics. For example, one way to write the results of the analysis with the contrast codes of Table 7.3, is as follows:

We asessed the effect of memory reactivation and playing Tetris on subsequent memory intrusions with a oneway ANOVA. This showed a significant overall effect of Condition, \(F(3, 68) = 3.79\), \(p = .014\), \(\hat{\eta}^2_p = .143\). A priori contrasts showed that the memory intrusions following a combination of memory reactivation and playing Tetris were significantly lower than those following memory reactivation alone, \(b = 2.94\), 95% CI \([0.83, 5.06]\), \(t(68) = 2.78\), \(p = .007\). However, we found no significant difference between the Control condition and the remaining three conditions, \(b = 1.57\), 95% CI \([-0.15, 3.30]\), \(t(68) = 1.82\), \(p = .073\). Finally, memory reactivation with or without playing Tetris did not significantly increase or reduce the number of intrusive memories compared to just playing Tetris without memory reactivation, \(b = 0.53\), 95% CI \([-1.30, 2.36]\), \(t(68) = 0.58\), \(p = .567\). Post-hoc pairwise comparisons with Tukey’s HSD test showed that the combined Tetris+reactivation condition differed from the Control condition and the Reactivation-only condition. There were no other significant pairwise differences.

References

Box, G. E. (1954). Some theorems on quadratic forms applied in the study of analysis of variance problems, i. Effect of inequality of variance in the one-way classification. The Annals of Mathematical Statistics, 290–302.
Howell, D. C. (2012). Statistical methods for psychology. Cengage Learning.
James, E. L., Bonsall, M. B., Hoppitt, L., Tunbridge, E. M., Geddes, J. R., Milton, A. L., & Holmes, E. A. (2015). Computer game play reduces intrusive memories of experimental trauma via reconsolidation-update mechanisms. Psychological Science. https://doi.org/10.1177/0956797615583071
Levene, H. (1960). Robust tests for equality of variances. In I. Olkin (Ed.), Contributions to probability and statistics. Essays in honor of Harold Hotelling (pp. 279–292). Palo Alto, CA: Stanford University Press.
Lix, L. M., Keselman, J. C., & Keselman, H. J. (1996). Consequences of assumption violations revisited: A quantitative review of alternatives to the one-way analysis of variance F test. Review of Educational Research, 66, 579–619.
MacKinnon, D. P., Krull, J. L., & Lockwood, C. M. (2000). Equivalence of the mediation, confounding and suppression effect. Prevention Science, 1, 173–181.
Maxwell, S. E., Delaney, H. D., & Kelley, K. (2017). Designing experiments and analyzing data: A model comparison perspective. Routledge.
Popper, K. (1959). The logic of scientific discovery. Routledge.
Tomarken, A. J., & Serlin, R. C. (1986). Comparison of ANOVA alternatives under variance heterogeneity and specific noncentrality structures. Psychological Bulletin, 99, 90.

  1. Remember that the significance level sets an allowable error rate if the null hypothesis is true, whilst the error rate when the null hypothesis is not true is unknown, but unlikely to be 0.↩︎

  2. you are allowed to multiply or divide the numerator and denominator by the same constant, as this does not change the fraction, i.e. \(\tfrac{a}{b} = \frac{a/c}{b/c}\)↩︎