# Chapter 11 Linear mixed-effects models

In this chapter, we will discuss an alternative approach to dealing with the non-independence of errors that can result from repeated-measures of the same individual, or otherwise multiple observations which come from different groupings in the data. Traditionally, such data has been analysed with repeated-measures ANOVA models. Linear mixed-effects models offer an alternative to repeated-measures ANOVA with certain benefits, particularly in dealing with missing values. Once you get your head around the idea of random effects, linear mixed-effects models are a natural extension of the General Linear Model, which can then make them easier to understand than repeated-measures ANOVA models, which rely on complicated model comparison schemes and strong assumptions. Linear mixed-effects models offer a more flexible univariate modelling technique. This flexibility does mean, as we will see, that sometimes difficult choices need to be made regarding the random-effects structure of the model. As always, there are no set rules in defining statistical models, and this should be led by substantive concerns, in addition to issues of reliability and maintaining appropriate error rates. With practice and experience, you should become more confident in making such decisions.

In writing this chapter, I have adapted some sections of Singmann & Kellen (2019). This is a very clear introduction to linear mixed-effects models, that you may wish to consult in addition to this Chapter (you can download it here).

## 11.1 Non-independence in linear models

The General Linear Model we have considered thus far can be stated as follows:

\[Y_i = \beta_0 + \beta_1 \times X_{1,i} + \beta_2 \times X_{2,i} + \ldots + \beta_m \times X_{m,i} + \epsilon_i \quad \quad \quad \epsilon_i \sim \mathbf{Normal}(0,\sigma_\epsilon)\]
The assumptions of this model concern the residuals or errors \(\epsilon_i\). They are assumed to be *independent and identically distributed* (iid), following a Normal distribution with a mean of 0 and a standard deviation \(\sigma_\epsilon\).

The assumption that the errors are independent means that knowing the value of the error for one case \(i\) in the data does not give you any information to determine the value of the error for another case \(j\). More formally, statistical independence is defined in terms of conditional probabilities. Remember when we discussed the rules of probability (Section 2.2.1.1)? In particular, we used independence to state rule 7, which is a specific case of the multiplication rule (rule 6). We can define the distribution of the error for case \(i\) as a conditional probability \(p(\epsilon_i | \epsilon_i, X_{1,i}, \ldots, X_{m,i})\), where we conditionalise on all the predictor values, as well as another error term \(\epsilon_j\). The errors are independent when \[p(\epsilon_i | \epsilon_j, X_{1,i}, \ldots, X_{m,i}) = p(\epsilon_i | X_{1,i}, \ldots, X_{m,i})\] i.e. the distribution of \(\epsilon_i\), conditional upon knowing the value of the predictors and the error \(\epsilon_j\) of another case \(j\), is the same as the distribution conditional upon just knowing the value of the predictors. This means that error \(\epsilon_j\) provides no information about the error \(\epsilon_i\).

When might such independence *not* hold? Let’s for the moment go back to the simplest version of the GLM, the model with just an intercept that was the focus of Chapter 3:
\[\begin{equation}
Y_i = \beta_0 + \epsilon_i
\tag{11.1}
\end{equation}\]
Suppose \(Y\) are again judgements of the height of Mount Everest, but that one of the labs in the ManyLabs study that collected the data, was based in Nepal. We might expect participants in Nepal to have a more accurate idea of the height of Mount Everest than participants in Poland (the country the participants were from in the subset of the data we considered in Chapter 3). As a result, participants in Nepal would likely be less influenced by the low anchor than those with less knowledge of Mount Everest. If \(\beta_0\) represents the mean judgement over all participants, then participants in Nepal would generally provide judgements which are higher than \(\beta_0\) (i.e positive errors), whilst participants in Poland would generally provide judgements that are lower than \(\beta_0\) (i.e. negative errors). If our model does not account for such grouping in the data, then the errors for cases from Nepal would be correlated, because all would tend to be positive. Similarly, the errors for cases from Poland would also be correlated, as all would tend to be negative. What this example points to is that the simple model does not account for the fact that participants from different countries might have a different average judgement. As such, the model is misspecified. We could of course alleviate this problem by including an effect for country, using a contrast-coded predictor \(X_1\) to reflect the difference between participants from Nepal (\(X_1 = \tfrac{1}{2}\)) and Poland (\(X_1 = -\tfrac{1}{2}\)) in an expanded model
\[Y_i = \beta_0 + \beta_1 \times X_{1,i} + \epsilon_i\]
This model would allow for a different mean judgement for cases from Nepal (\(\beta_0 + \tfrac{1}{2} \times \beta_1\)) and cases from Poland (\(\beta_0 - \tfrac{1}{2} \times \beta_1\)). Doing so for all labs in the data quickly becomes cumbersome, however. In this study, there were 31 labs involved, meaning we would need 30 contrast-coded predictors to reflect possible differences between them. Moreover, differences between the labs are not of primary interest here. The question we posed in Chapter 3 was whether people in general would, on average, give an accurate judgement of the hight of Mount Everest, not whether the country they reside in influences this. Thus, country, if it has an effect, can be seen as a *nuisance effect*, something we have to consider for a valid model, but we would otherwise rather ignore. **Linear mixed-effects models** allow you to solve this conundrum gracefully, providing a way for effects to differ between groupings in the data, without dramatically increasing the number of to-be-estimated parameters. They do so by treating such effects as random, rather than fixed, effects. The model then focuses on estimating the distribution of these random effects, rather than estimating each separately.

You can expect violations of the *iid* assumption if data are collected from units of observations that are clustered in groups. A common example of this in psychology is when you repeatedly observe behaviour from the same person. Other examples of this are data from experiments collected in group settings, students within classrooms, or patients within hospitals. In such situations one would expect that observations within each cluster (i.e., a specific group, classroom, or hospital) are more similar to each other than observations across clusters. Unfortunately, compared to violations of other assumptions, such as the normality assumption or the assumption of variance homogeneity, standard statistical procedures are usually not robust to violations of the independence assumption (Charles M. Judd, Westfall, & Kenny, 2012; Kenny & Judd, 1986). They often lead to considerably increased Type I errors (i.e., false positives) and more generally can produce overconfident results (e.g., too narrow standard errors).

## 11.2 Random intercept models

We will now define a model which allows for each grouping (e.g., lab) in the data to have a different mean. To help define the model, let \(Y_{i,j}\) denote the judgement of the height of Mount Everest for participant \(i = 1,\ldots,n_j\) in grouping (lab) \(j = 1, \ldots, n_g\). The double subscript is solely to distinguish between participants and groupings. We still consider a single dependent variable \(Y\). We will allow each group \(j\) to have a different mean judgement. The first step to do so is to rewrite the simple model of Equation (11.1) as \[\begin{equation} Y_{i,j} = \beta_{0,j} + \epsilon_{i,j} \quad \quad \quad \epsilon_{i,j} \sim \mathbf{Normal}(0,\sigma_{\epsilon}) \tag{11.2} \end{equation}\] For each observation \(Y_{i,j}\), we thus assume the intercept \(\beta_{0,j}\) which is different for each group \(j\). In other respects, the model is exactly the same as the one of Equation (11.1), and observations can deviate from the intercept (which represents the mean in group \(j\)) through the error term \(\epsilon_{i,j}\), which is assumed Normal-distributed with a mean of 0 and constant (homogeneous) standard deviation \(\sigma_\epsilon\). Whilst the model effectively states that the structural part \(\beta_{0,j}\) is different for each grouping level \(j\), the groupings are linked because they share the same distribution for the error term.

The second step is to define a model for the grouping-dependent intercepts:
\[\begin{equation}
\beta_{0,j} = \beta_0 + \gamma_{0,j} \quad \quad \quad \gamma_{0,j} \sim \mathbf{Normal}(0,\sigma_{\gamma_0})
\tag{11.3}
\end{equation}\]
Let’s pause for a moment and consider the Equation (11.3) above in detail. You can think of \(\beta_{0,j}\) as the dependent variable here. On the right hand side, there are two terms, an intercept \(\beta_0\), and a residual term \(\gamma_{0,j}\) which is assumed Normal-distributed with a mean of 0 and a constant (homogeneous) standard deviation \(\sigma_{\gamma_0}\). Note that this model is structurally identical to that of Equation (11.1). If you’d replace \(\beta_{0,j}\) by \(Y_i\), \(\gamma_{0,j}\) by \(\epsilon_i\), and \(\sigma_{\gamma_0}\) by \(\sigma_\epsilon\), you’d reproduce Equation (11.1) exactly. We thus have a simple model for the parameter \(\beta_{0,j}\) as an outcome. In such a **parameters as outcomes** formulation of linear mixed-effects models, it is customary to call Equation (11.2) a **level 1** model. It is a model on the level of the observed values \(Y_{i,j}\). Equation (11.3) is then a **level 2** model, a higher-order model which is on the level of the parameters of the model on the level below it (level 2). As we will see later on, further higher-order models can be specified, leading to so-called multilevel models. For present purposes, a level 2 model suffices.

The random-intercepts model defined by equations (11.2) and (11.3) has random elements (\(\gamma_{0,j}\) and \(\epsilon_{i,j}\)), and non-random elements (\(\beta_0\)). Parameter \(\beta_0\) is an (unknown) constant, and in the context of mixed-effects models is called a **fixed effect**. The \(\gamma_{0,j}\) terms are called **random effects**. The reflect deviations between the means in the groupings from the overall mean \(\beta_0\):
\[\gamma_{j,0} = \beta_{0,j} - \beta_0\]
This is similar to how the residuals \(\epsilon_{i,j}\) reflect deviations of observations from the means \(\beta_{0,i}\). In the study, differences between the labs (grouping levels) are not of direct interest, and this is generally the case for random effects. As indicated earlier, they are usually nuisance factors, which we include in a model to make the model valid and precise, but we would otherwise like to ignore them. What is of interest in e.g. determining whether people’s judgements on average are equal to the true height of Mount Everest is the fixed effect \(\beta_0\). In our model with random intercepts \(\beta_{0,j}\), this parameter equals the mean of these intercepts, because we can rewrite Equation (11.3) in an equivalent form as
\[\beta_{0,j} \sim \mathbf{Normal}(\beta_0,\sigma_{\gamma_0})\]
A useful view of the model is then as a hierarchical model, as we have already done in our level-1 and level-2 formulation. This is depicted in Figure 11.1.

## 11.3 Parameter estimation

If we could observe the parameters \(\beta_{0,j}\) directly, model estimation would be easy. We could just use the methods of Chapter 3. But parameters are not observed, and need to be inferred. This is where the “magic” of linear mixed-effects models happens. In estimating a mixed-effects model, the particular values of the random effects \(\gamma_{0,j}\) are not directly the focus. Rather, the objective is to estimate the variance of the random effects, \(\sigma^2_{\gamma_0}\). Once we have estimates of the fixed effects, as well as the random-effect variance \(\sigma^2_\gamma\) and residual error variance \(\sigma^2_\epsilon\), we can obtain predictions of the random effects \(\hat{\gamma}\), which are called **best linear unbiased predictions** (BLUPs). These can be considered a side-product of the estimation, rather than an integral part.

As we have discussed in earlier chapters, maximum likelihood provides biased estimates of variances. As you may recall, the sample variance \(S^2 = \frac{\sum_{i=1}^n (Y_i - \overline{Y})^2}{n}\) is the maximum likelihood estimate of the variance, whilst the unbiased estimator is \(\hat{\sigma^2} = \frac{\sum_{i=1}^n (Y_i - \overline{Y})^2}{n-1}\). In the context of linear mixed-effects models, there is a similar issue that maximum likelihood estimation provides biased estimates of the variances of random effects. Unbiased estimates are obtained with a procedure called **restricted maximum likelihood** (REML). It is beyond the scope of this book to go into the details of this procedure, but it roughly corresponds by first removing any fixed effects from the data, and then estimating the variance components. For most purposes, relying on restricted maximum likelihood estimation is generally recommended. A main exception is when we need to obtain the maximised likelihood to perform a likelihood ratio test (to e.g. test the significance of random effect variances). For these limited cases, you can re-estimate the model with maximum likelihood (ML), solely to obtain the maximised (log) likelihood.

Parameter estimation in mixed-effects models is more complicated than for linear models with only fixed effects. This is especially the case for more complex models with random intercepts and slopes and correlations between these. Care must therefore be taken in defining a model which is complex enough to allow you to capture the dependence between observations, but not so complex to render the estimation of parameters so unreliable, or even impossible, to make the inferences effectively meaningless. We will discuss the specification of the appropriate random effects structure in more detail later on.

## 11.4 Parameter inference

After estimating the parameters of the linear mixed-effects model, the same procedures of model comparisons can be applied to obtain null hypothesis significance tests for the fixed effects. In the same way as for the General Linear Model, this can consist of computing \(F\) statistics from comparisons of the Sum of Squared Errors of a general MODEL G and a restricted MODEL R which sets some of the fixed effects parameters to a priori values, often \(\beta = 0\).

A complication with linear mixed-effects models concerns the appropriate degrees of freedom. In models with only fixed effects, we could use \(\text{df}_1 = \text{npar}(G) - \text{npar}(R)\) and \(\text{df}_2 = n - \text{npar}(G)\). Unfortunately, this is not so for mixed-effects models. The problem is the inclusion of the random effects \(\gamma\). These are not parameters in the sense that the fixed effects \(\beta\) and the variances \(\sigma\) are parameters, they nevertheless are factors which influence the model errors \(\epsilon\). You might think of them as “partial” parameters, which you can assign a weight between 0 and 1, in terms of how they influence the model errors \(\epsilon\). Assigning each random effect \(\gamma_{0,j}\) a weight of 1 would add, in our example above, a total of \(n_g\) parameters to the model (and hence reduce \(\text{df}_2\) by \(n_g\)). But this generally assigns too much importance to the random effects, and the appropriate correction is somewhere between 0 and \(n_g\). Precisely what value the correction to the degrees of freedom should have is unfortunately not known for most models. Different approximations have been proposed. Two main ones are the Satterthwaite (Satterthwaite, 1941) and the Kenward-Roger (Kenward & Roger, 1997) approximation. As the latter is known to provide the best control of Type I errors with the limited sample sizes that are common in psychology studies, this one is generally recommended (Singmann & Kellen, 2019). Luke (2017) provides a further comparison of the approximation methods.

You may also be interested in testing evidence for the existence of random effects. If there are no random effects (i.e. all \(\gamma_{0,j} = 0\)), that is equivalent to setting \(\sigma_{\gamma_0} = 0\). The \(F\) statistic is not suitable for comparing a model where we set the standard deviation of a random effect \(\gamma\) to \(\sigma_{\gamma} = 0\) to a model where we allow it to take any value \(\sigma_{\gamma} \geq 0\) (and hence need to estimate it). For such model comparisons, we can revert back to the general idea of comparing models via the likelihood ratio (e.g. Section 2.5):

\[\begin{aligned} \text{likelihood-ratio} &= \frac{p(Y_1,\ldots,Y_n|\text{MODEL R})}{p(Y_1,\ldots,Y_n|\text{MODEL G})} \\ &= \frac{p(\text{DATA}|\text{MODEL R})}{p(\text{DATA}|\text{MODEL G})} \end{aligned}\] where \(\text{DATA}\) in the second line of this equation is just a short-hand for all observations (e.g. \(\text{DATA} = Y_1,\ldots,Y_n\)). This will make it easier to refer to more complex data structures later on.

If the number of total observations \(n\) is sufficiently large, we can rely on an important and general statistical theorem by Wilks (1938) that shows that as \(n \rightarrow \infty\) (i.e., as the number of observations approaches infinity), then under the null-hypothesis that MODEL R is true:
\[\begin{equation}
-2 \log \left(\text{likelihood-ratio}\right) \sim \mathbf{chi-squared}(\text{npar}(G) - \text{npar}(R))
\end{equation}\]
where the “minus two times log likelihood ratio” is computed as
\[\begin{aligned}
-2 \log \left(\text{likelihood-ratio}\right) &= -2 \log p\text{DATA}|\text{MODEL R}) - (-2 \log p(\text{DATA}|\text{MODEL G}))
\end{aligned}\]
In words, under the null-hypothesis that the restricted model is true, the sampling distribution of minus twice the natural logarithm of the likelihood ratio is distributed as a Chi-squared (\(\chi^2\)) distribution. A Chi-squared distribution has one parameter, the degrees of freedom, which here equals the difference in the number of parameters of the two models. In this test, if there are any unknown parameters in the models, they need to be estimated by maximum likelihood (not by restricted maximum likelihood). This is because the ratio needs to be the ratio between the maximum value of each likelihood. The parameters to count are the fixed effects \(\beta\) and the variances of the random effects \(\sigma^2_\gamma\), as well as any covariances between them (we will discuss this later), and the residual error variance \(\sigma^2_\epsilon\). In this test, you don’t have to worry about adjusting the degrees of freedom for the influence of the random effects \(\gamma\). An important caveat is however that Wilks’ theorem assumes that the restriction of the parameters in MODEL R are in the *interior of the parameter space*. That means that if parameters have upper or lower bounds, the fixed values can not be set at exactly these bounds in MODEL R. For example. a variance can not be negative, so a variance parameter has a lower bound of \(\sigma^2 \geq 0\). Unfortunately, in the model comparison to test for random effects, we need to set the variance at exactly this lower bound in MODEL R. As a result, the Chi-squared distribution is not an accurate approximation to the true sampling distribution. Usually, the tests are too conservative, providing too large \(p\)-values (suggestions are that when testing a single variance of a random effect, they are approximately twice as large as they should be, see e.g. Pinheiro & Bates (2006)).^{27} The general conservatism of the likelihood ratio tests means that if you obtain a significant result, you can be reassured that you have sufficient evidence for an effect. A conservative test procedure however means that the true rate of Type 1 errors will be lower than set by the significance level \(\alpha\). When it is crucial to maintain the rate of Type 1 errors sought for by the significance level, a better approach is to use a **parametric bootstrap** procedure where you simulate a large number of data sets according to the estimated MODEL R, and for each compute the value of the \(-2 \log\left(\text{likelihood-ratio}\right)\) value, which provides an “empirical distribution” of the sampling distribution of this statistic. You can then determine how “unusual” the actual value computed for the true data is in the context of this simulated distribution.

## 11.5 Application of the random-intercepts model

Having spent a bit of time discussing the estimation and testing of effects in linear mixed-effects models in abstract terms, let’s see how we can use a model like this in practice. For this first example, we will consider the data from the anchoring study again, focussing on those (non US or UK based) labs that were able to obtain estimates of the height of Mount Everest in meters, after having provided the participants with a low anchor. The judgements of participants acquired by these different labs, as well as the mean judgements and standard errors of the mean (remember, these standard errors are the standard deviation of the sampling distribution of the mean) are shown in Figure 11.2.

As this figure shows, the mean judgements seem to vary quite a bit between the different labs. As a result, predicting judgements in each group by the mean over all groups will likely introduce dependence between the errors, with errors in groups with a relatively high mean tending to be positive, and errors in groups with a relatively low mean will tend to be negative. We can eliminate such dependence by introducing random intercepts for each group, because the errors will then be deviations of the observations from the group means: \[\epsilon_{i,j} = Y_{i,j} - \beta_{0,j}\] We are still interested in the overall mean judgement, reflected by the fixed intercept \(\beta_0\), which can be interpreted as the average of all group-wise intercepts \(\beta_{0,j}\).

The estimated model (by REML) can be written as \[\begin{aligned} Y_{i,j} &= \beta_0 + \gamma_{0,j} + \epsilon_{i,j} \\ &= 4392.91 + \gamma_{0,j} + \epsilon_{i,j} \\ \gamma_{0,j} &\sim \mathbf{Normal}(0,2223.22) \\ \epsilon_{i,j} &\sim \mathbf{Normal}(0,2618.12) \end{aligned}\]

We can compare this to an estimated model without random intercepts: \[\begin{aligned} Y_{i,j} &= \beta_0 + \epsilon_{i,j} \\ &= 4471.43 + \epsilon_{i,j} \\ \epsilon_{i,j} &\sim \mathbf{Normal}(0,3344.03) \end{aligned}\]

In the latter model, which is similar to the simple model we focused on in Chapter 3, the estimate of the intercept is the sample mean over all the observations. You can see that this differs from the estimated fixed intercept of the random intercepts model. This fixed effect represents the average of the random intercepts \(\beta_{0,j}\). Contrary to what you might have thought, the predicted^{28} random intercepts are not equal to the group-wise sample averages. Table 11.1 shows the sample averages, and the predicted random intercepts, as well as the estimated random effects.

\(\overline{Y}_j\) | \(n_j\) | \(\hat{\beta}_{0}\) | \(\hat{\beta}_{0,j}\) | \(\hat{\gamma}_{0,j}\) | \((\overline{Y}_j - \hat{\beta}_{0})\) |
---|---|---|---|---|---|

2746 | 44 | 4393 | 2797 | -1596 | -1646.6 |

8650 | 2 | 4393 | 6907 | 2514 | 4257.1 |

4488 | 36 | 4393 | 4485 | 92 | 95.5 |

1946 | 48 | 4393 | 2015 | -2378 | -2446.6 |

1432 | 31 | 4393 | 1558 | -2834 | -2961.3 |

4258 | 36 | 4393 | 4263 | -130 | -135.1 |

7325 | 73 | 4393 | 7271 | 2878 | 2932.4 |

3571 | 48 | 4393 | 3594 | -799 | -822.1 |

6703 | 56 | 4393 | 6647 | 2254 | 2310.1 |

As you can see, the random intercepts \(\hat{\beta}_{0,j}\) are closer to the fixed intercept \(\hat{\beta}_0 = 4392.91\) than the sample means \(\overline{Y}_j\). In statistical terms, this is usually called **shrinkage**. Due to the assumption that the random effects are Normal-distributed, the random intercepts are pulled towards the average of that distribution, which is the fixed intercept \(\beta_0\). This is because in a Normal distribution, large deviations from the mean are unlikely. Hence, if the average in a grouping is far from the overall average (the average over all the groupings), that is perhaps due to a random fluke in this particular dataset. Adjusting the estimated mean of the grouping to be closer to those of the other groupings then makes sense: although each grouping can differ, there will be some similarities between the groupings. The relative adjustment depends not only on the difference between the sample mean from the overall mean, but also on the sample size (\(n_j\)) of each grouping level. The smaller the sample size, the more the estimate will be adjusted towards the overall mean, because there is less information in smaller samples to precisely estimate a parameter. Hence, because estimates are more noisy, the true mean can be expected to be further away from the sample mean.

Because the groupings are not completely dissimilar, when estimating the mean for one grouping, we can partly rely on the data from other groupings. This is also called **partial pooling**. It can be contrasted with **complete pooling**, where all the data is treated as if from one single group (i.e. as in the model of Equation (11.1)). Complete pooling is inappropriate here, because when there are effective groupings in the data, this will violate the *iid* assumption which the complete pooling model relies on. The opposite of complete pooling is **no pooling**. This means that you would estimate a model for each grouping separately. If the sample sizes of the groupings are sufficiently large to allow for reliable parameter estimation, this approach is not unreasonable and would not violate the *iid* assumption. However, compared to partial pooling, there are several disadvantages, most notably that there is no easy way to combine the results of the separate analyses to obtain inferences over the whole set of groupings.

Testing whether the fixed intercept \(\beta_0\) equals an a priori value, such as \(\beta_0 = 8848\), can be essentially done in the same way is in Chapter 3. We can estimate a model in which we fix the intercept to that value, and then conducting a model comparison. The results of this comparison, using the Kenward-Roger approximation for the degrees of freedom is \(F(1, 7.832) = 33.32\), \(p <.001\). As such, we reject the null hypothesis that the average judgement is equal to the true height of Mount Everest. Performing the same analysis but with models which don’t include random effects would have given \(F(1, 373) = 640.621\), \(p <.001\). The latter test is not a good reflection of the data, because the *iid* assumption on which it rests is violated here. This leads to a biased and *overconfident* test result (i.e., a too high \(F\) value). You can also see the Kenward-Roger approximation of the degrees of freedom at work in the test for the mixed-effects model. The value for \(\text{df}_1\) is 1 because the model comparison involves one less parameter in MODEL R compared to MODEL G. The difference is in the value of \(\text{df}_2\): Compared to the (wrong) full pooling model, the value of \(\text{df}_2\) is substantially lower (7.832 compared to 373). Therefore, the test with the mixed model has effectively less power, but as the complete pooling model is not valid, the mixed-model analysis is preferred to the latter.

## 11.6 Models with random intercepts and slopes

The model we dealt with so far is the simplest example of a linear mixed-effects model. Let’s now consider a slightly more complex model, where we will now assess the effect of the anchor that was given to participants. The data from the international labs providing judgements in meters are shown in Figure 11.3. As for the low anchor condition, average judgements in the high anchor condition seem to differ between the labs, albeit to a lesser extent.

Accounting for the effect of anchor is done by including a contrast-coded predictor in the model, e.g. a predictor \(X_1\) which has the value \(\tfrac{1}{2}\) for cases in the high anchor condition, and a value \(-\tfrac{1}{2}\) for cases in the low anchor condition. Just adding this predictor to the random intercepts model of Equations (11.2) and (11.3) would mean that we would change Equation (11.2) to \[Y_{i,j} = \beta_{0,j} + \beta_{1} \times X_{1,i,j} + \epsilon_{i,j} \quad \quad \quad \epsilon_{i,j} \sim \mathbf{Normal}(0,\sigma_{\epsilon})\] For this model, we don’t need a further random-effects specification to Equation (11.3). In this new model, as for ANOVA models, the intercepts \(\beta_{0,j}\) reflect a grand mean (average of the averages of the two anchor conditions), whilst the fixed slope \(\beta_1\) reflects the difference between the mean of the high and low anchor condition. As there is only a fixed slope \(\beta_1\), this model assumes that the effect of the anchors is exactly the same for each grouping (lab). That implies that for each grouping, the distance between the mean judgement in the high and low anchor is assumed to be identical. The random intercept would allow the midpoint between the two means to vary over the groupings, such that if one grouping has relatively high judgements in the low anchor condition, it would also have relatively high judgements in the high anchor condition. Looking at Figure 11.3, this may be a too strong assumption. For example, referrer \(\texttt{swpson}\) has a relatively high mean in the low anchor condition, but a relatively low mean in the high anchor condition (and both are suspiciously close to the true height of Mount Everest). Because we have multiple cases in each lab-condition combination, we could allow the effect of anchor to differ between the groupings. This can be achieved by adding a random effect for anchor. Conceptually, that means that we allow the effect of anchor to be moderated by lab. If we allow a different slope \(\beta_{1,j}\) for each grouping \(j\), we can write the model as \[\begin{equation} Y_{i,j} = \beta_{0,j} + \beta_{1,j} \times X_{1,i,j} + \epsilon_{i,j} \quad \quad \quad \epsilon_{i,j} \sim \mathbf{Normal}(0,\sigma_{\epsilon}) \tag{11.4} \end{equation}\] The level-2 models for the intercept and slope are: \[\begin{align} \beta_{0,j} &= \beta_0 + \gamma_{0,j} \quad \quad \quad \gamma_{0,j} \sim \mathbf{Normal}(0,\sigma_{\gamma_0}) \\ \beta_{1,j} &= \beta_1 + \gamma_{1,j} \quad \quad \quad \gamma_{1,j} \sim \mathbf{Normal}(0,\sigma_{\gamma_1}) \tag{11.5} \end{align}\] As for the random intercepts, the random slope \(\beta_{1,j}\) consists of a fixed part \(\beta_1\), which is the average slope over all groupings, and a random part \(\gamma_{1,j}\), which is the deviation of the grouping-wise slope (\(\beta_{1,j}\)) from the average (\(\beta_1\)).

The estimated model (obtained with REML) can be stated as \[\begin{aligned} Y_{i,j} &= \beta_0 + \gamma_{0,j} + (\beta_1 + \gamma_{1,j}) \times X_{1,i,j} + \epsilon_{i,j} \\ &= 7005.51 + \gamma_{0,j} + (5393.84 + \gamma_{1,j}) \times X_{1,i,j} + \epsilon_{i,j} \\ \gamma_{0,j} &\sim \mathbf{Normal}(0,772.69) \\ \gamma_{1,j} &\sim \mathbf{Normal}(0,2778.40) \\ \epsilon_{i,j} &\sim \mathbf{Normal}(0,2565.50) \end{aligned}\] In terms of inference, we are mainly interested in the fixed effect of anchor. A test of the hypothesis that the slope of anchor is \(\beta_1 = 0\), again using the Kenward-Roger approximation for the degrees of freedom, gives us \(F(1, 7.888) = 31.057\), \(p <.001\). Hence, we reject the null hypothesis that the true effect of the anchor equals 0. There clearly is a difference between the low and high anchor condition, such that the judgements in the former are lower on average than in the latter conditions.

At this point, let’s also consider testing whether there is evidence that the effect of the anchor varies over the groupings. In order to obtain this test, we need to use a likelihood ratio test, or a parametric bootstrap. Here, we’ll focus on the former. To test the null-hypothesis \(H_0: \sigma_{\gamma_1} = 0\) with a likelihood-ratio test, we first re-estimate the full MODEL G (which includes random intercepts and slopes) by maximum likelihood. This provides us a value of \[-2 \log p(Y_{i,j}|\text{MODEL G}) = 12108.94\] In MODEL R, we set \(\sigma_{\gamma_1} = 0\), effectively omitting the random slopes \(\gamma_{1,j}\). Otherwise, we keep everything else the same. Estimating this MODEL R gives \[-2 \log p(Y_{i,j}|\text{MODEL R}) = 12243.61\] The likelihood-ratio test statistic is simply the difference between these two values: \[-2 \log (\text{likelihood-ratio}) = 12243.61 - 12108.94 = 134.673\] As MODEL G contains just one more estimated parameter, the degrees of freedom for the test equals \(\text{npar}(G) - \text{npar}(R) = 5 - 4 = 1\). The critical value for a Chi-squared distribution with one degree of freedom and significance level \(\alpha = .05\) is \(\chi^2(1; .05) = 3.841\). Clearly, the value computed for the data is well above the critical value. Hence, we reject the null hypothesis that \(\sigma^2_{\gamma_1} = 0\). So, we have sufficient evidence that the effect of anchor varies over the groupings.

### 11.6.1 Correlation between random effects

In the model above, we assumed the random effects \(\gamma_{0,j}\) and \(\gamma_{1,j}\) are independent and drawn from separate Normal distributions (each with a different standard deviation, \(\sigma_{\gamma_0}\) and \(\sigma_{\gamma_1}\). There are situations where we would however like to allow the random effects to be correlated. In the example above, it might be the case that the effect of the anchor differs between groups who on average give relatively high judgements, compared to groups who tend to give lower judgements. Such a dependence indicates a moderation of the effect of anchor by the average judgement in each group. In linear mixed-effects models, such a moderation can be implemented by allowing for a correlation between random intercepts (e.g. the average values within each lab) and slopes (e.g. the effects of anchor).

In general, linear mixed-effects models assume that the random effects are samples from a **multivariate Normal distribution**. A multivariate Normal distribution is a distribution over *vectors* of values. A multivariate Normal distribution is parametrized by a mean vector \(\boldsymbol{{\mu}}\) and a covariance matrix \(\boldsymbol{{\Sigma}}\). For the two random effects in our model, we can specify the joint distribution as follows:
\[\begin{equation}
\begin{aligned}
\left( \begin{matrix} \gamma_{0,j} \\ \gamma_{1,j} \end{matrix} \right)
&\sim \mathbf{Normal}\left(\boldsymbol{\mu}, \boldsymbol{\Sigma} \right) \\
&\sim \mathbf{Normal}\left( \left[ \begin{matrix} 0 \\ 0 \end{matrix} \right] , \left[ \begin{matrix} \sigma^2_{\gamma_0}&\rho_{\gamma_0,\gamma_1}\sigma_{\gamma_0}\sigma_{\gamma_1} \\ \rho_{\gamma_0,\gamma_1}\sigma_{\gamma_0}\sigma_{\gamma_1}&\sigma^2_{\gamma_1} \end{matrix} \right] \right)
\end{aligned}
\end{equation}\]
So
\[\boldsymbol{\mu} = \left[ \begin{matrix} 0 \\ 0 \end{matrix} \right]\]
and
\[\boldsymbol{\Sigma} = \left[ \begin{matrix} \sigma^2_{\gamma_0}&\rho_{\gamma_0,\gamma_1}\sigma_{\gamma_0}\sigma_{\gamma_1} \\ \rho_{\gamma_0,\gamma_1}\sigma_{\gamma_0}\sigma_{\gamma_1}&\sigma^2_{\gamma_1} \end{matrix} \right]\]
In the covariance matrix \(\boldsymbol{\Sigma}\), the \(\rho_{\gamma_0,\gamma_1}\sigma_{\gamma_0}\sigma_{\gamma_1}\) component in the off-diagonal elements are the covariance between \(\gamma_{0,j}\) and \(\gamma_{1,j}\). A covariance can be written as a product of the correlation \(\rho_{\gamma_0,\gamma_1}\) between \(\gamma_{0,j}\) and \(\gamma_{1,j}\), and the standard deviations \(\sigma_{\gamma_0}\) and \(\sigma_{\gamma_1}\).

Unfortunately, the added complexity of estimating the correlation between the random effects is sometimes too much, resulting in estimation failures. That was the case for the subset of data for the international labs analysed above. By including all the labs in the full study, the model is estimable. In this case, we focus on the estimation in feet, rather than meters. In other respects, the model is the same as before.

The estimated model can be written as \[\begin{aligned} Y_{i,j} &= \beta_0 + \gamma_{0,j} + (\beta_1 + \gamma_{1,j}) \times X_{1,i,j} + \epsilon_{i,j} \\ &= 22840.77 + \gamma_{0,j} + (23578.76 + \gamma_{1,j}) \times X_{1,i,j} + \epsilon_{i,j} \\ \left( \begin{matrix} \gamma_{0,j} \\ \gamma_{1,j} \end{matrix} \right) &\sim \mathbf{Normal}\left( \left[ \begin{matrix} 0 \\ 0 \end{matrix} \right] , \left[ \begin{matrix} \sigma^2_{\gamma_0}&\rho_{\gamma_0,\gamma_1}\sigma_{\gamma_0}\sigma_{\gamma_1} \\ \rho_{\gamma_0,\gamma_1}\sigma_{\gamma_0}\sigma_{\gamma_1}&\sigma^2_{\gamma_1} \end{matrix} \right] \right) \\ & \sim \mathbf{Normal}\left( \left[ \begin{matrix} 0 \\ 0 \end{matrix} \right] , \left[ \begin{matrix} (1571.776)^2 & -0.803 \times 1571.776 \times 6003.001 \\ -0.803 \times 1571.776 \times 6003.001 & (6003.001)^2 \end{matrix} \right] \right) \\ \epsilon_{i,j} &\sim \mathbf{Normal}(0,9374.82) \end{aligned}\] This shows a negative correlation between the random intercepts and the random slopes: $_{_0,_1} = -0.803. This can be interpreted as indicating that the higher the average judgements in a grouping (the intercept), the smaller the difference is between the low and high anchor conditions within that grouping (the slope).

To assess whether there is sufficient evidence to include this correlation between the random intercepts and slopes, we can perform a model comparison, comparing the MODEL G just estimated, to a MODEL R where we fix the correlation to 0. As we are testing for random effects, we will use the likelihood-ratio test. The result of this test is \(\chi^2(1) = 15.781\), \(p < .001\). Hence, we would reject the null-hypothesis that in the Data Generating Process, the correlation between the random effects is 0.

## 11.7 Crossed random effects: dating partners in the speed dating experiment

As a final example of a mixed-effects analysis, we will reconsider the data from the speed-dating experiment of Fisman et al. (2006) we analysed in Chapter 6. In this experiment, groups of participants encountered each other in a number of speed-dates. They then rated each other on a number of attributes (e.g. physical attractiveness, intelligence, etc). Each participant thus rated a number of different other participants, but was also the object of ratings by multiple other participants. In a sense, the design is then a “doubly repeated” design, where the same participant provides multiple ratings, but the same “Item” (i.e. dating partner) is also rated multiple times. For example, in one group, female participants 132, 133, 134, 135, and 136, each went on a four-minute date with male participants 137, 138, 139, 140, and 141. So participant 132 provided ratings for five other participants (the “items”), but was herself also an “item” for five other participants (137, 138, 139, 140, and 141). If, as we did in Chapter 6, we focus on how much participants liked their date, there may be both *person effects* and *item effects*. By person effects we mean individual differences between raters. For instance, some participants might generally like all their dating partners more than expected because they take more of an interest in their partners. By item effects we mean differences between the persons who were rated. For instance, some people may generally receive higher ratings than expected because they have a positive quality that is not captured by the predictors. Although calling a dating partner an item is not very nice, the name “item effect” is pretty standard in the literature, and hence we will use “item” in the remainder for the person being rated, and “person” for the person who provides the rating. Both person and item effects can lead to dependencies in the model errors. Unless you have a full factorial design where each person rates each item, this is difficult to deal with in a repeated-measures ANOVA. But it is pretty straightforward with linear mixed-effects models.

Let’s consider modelling the dependent variable \(\texttt{like}_{i,j}\), which is a liking rating by person \(i\) for item \(j\), as a function of attractiveness, \(\texttt{attr}_{i,j}\), intelligence, \(\texttt{intel}_{i,j}\), and the product-predictor for their interaction, \((\texttt{attr} \times \texttt{intel})_{i,j}\). We will consider the level-1 model
\[\texttt{like}_{i,j} = \beta_{0,i,j} + \beta_{1, i, j} \times \texttt{attr}_{i,j} + \beta_{2, i, j} \times \texttt{intel}_{i,j} + \beta_{3, i, j} \times (\texttt{attr} \times \texttt{intel})_{i,j} + \epsilon_{i,j} \quad \quad \epsilon_{i,j} \sim \mathbf{Normal}(0,\sigma_{\epsilon})\]
Here, we assume that the intercept \(\beta_{0,i,j}\) as well as the slopes \(\beta_{1, i, j}\), \(\beta_{2, i, j}\), and \(\beta_{3, i, j}\), may be different for each combination of person \(i\) and item \(j\). As we have only a single rating of the liking for each person-item combination, this is not a model we would be able to estimate separately for each person-item combination. However, because persons and items are **crossed**, we do have multiple observations for each person and item, which allows us to use partial pooling for both. As level two models, we can use
\[\begin{aligned}
\beta_{0,i,j} &= \beta_0 + \gamma_{P,0,i} + \gamma_{I,0,j} \\
\beta_{1,i,j} &= \beta_1 + \gamma_{P,1,i} + \gamma_{I,1,j} \\
\beta_{2,i,j} &= \beta_2 + \gamma_{P,2,i} + \gamma_{I,2,j} \\
\beta_{3,i,j} &= \beta_3 + \gamma_{P,3,i} + \gamma_{I,3,j}
\end{aligned}\]
So we are using one fixed part for each parameter (e.g. \(\beta_0\)), but now *two* random parts: one dependent on the person \(i\) who provides the rating (e.g. \(\gamma_{P,0,i}\)), and one dependent on the item \(j\) (e.g. \(\gamma_{I,0,j}\)). Whilst the random effects for Person may be correlated, and the random effects for Item may be correlated as well, the random effects for Person cannot be correlated with effects for Item. This results in a so-called **block-diagonal** structure for the random effects:
\[\begin{equation}
\left( \begin{matrix} \gamma_{0,i} \\ \gamma_{P,1,i} \\ \gamma_{P,2,i} \\ \gamma_{P,3,i} \\ \gamma_{I,0,j} \\ \gamma_{I,1,j} \\ \gamma_{I,2,j} \\ \gamma_{I,3,j} \end{matrix} \right)
\sim \mathbf{Normal}\left( \left[ \begin{matrix} 0 \\ 0 \\ 0 \\ 0 \\ 0 \\ 0 \\ 0 \\ 0 \end{matrix} \right] , \left[ \begin{matrix}
\sigma^2_{{P0}} & \sigma_{{P01}} & \sigma_{P02} & \sigma_{P03} & 0 & 0 & 0 & 0 \\
\sigma_{P01} & \sigma^2_{P1} & \sigma_{P12} & \sigma_{P13} & 0 & 0 & 0 & 0\\
\sigma_{P02} & \sigma_{P12} & \sigma^2_{P2} & \sigma_{P23} & 0 & 0 & 0 & 0 \\
\sigma_{{0,3,P}} & \sigma_{P13} & \sigma_{P23} & \sigma^2_{P3} & 0 & 0 & 0 & 0 \\
0 & 0 & 0 & 0 & \sigma^2_{I0} & \sigma_{I01} & \sigma_{I02} & \sigma_{I03} \\
0 & 0 & 0 & 0 & \sigma_{I01} & \sigma^2_{I1} & \sigma_{I12} & \sigma_{I13} \\
0 & 0 & 0 & 0 & \sigma_{I02} & \sigma_{I12} & \sigma^2_{I2} & \sigma_{I23} \\
0 & 0 & 0 & 0 & \sigma_{I03} & \sigma_{I13} & \sigma_{I23} & \sigma^2_{I3}
\end{matrix} \right] \right)
\end{equation}\]
Here, \(\sigma^2_{P0}\) refers to the variance of the random intercept for persons, i.e. the variance of \(\gamma_{P,0,i}\), and \(\sigma^2_{I3}\) to the variance of the random slopes for \((\texttt{attr} \times \texttt{intel})_{ij}\) for items, i.e. the variance of \(\gamma_{I,3,j}\). A term such as \(\sigma_{{P01}}\) stands for the covariance between the random intercept and random slope for \(\texttt{attr}_{ij}\), i.e. the covariance between \(\gamma_{P,0,i}\) and \(\gamma_{P,1,i}\), which can be computed from the correlation and standard deviations as usual:
\[\sigma_{P01} = \rho_{P01} \sigma_{P0} \sigma_{P1}\]
where \(\rho_{P01}\) refers to the correlation between \(\gamma_{P,0,i}\) and \(\gamma_{P,1,i}\).

As the name suggests, a block-diagonal structure implies there are blocks (groups) of parameters which are interrelated, but not related to other parameters outside the group.

The results for the fixed effects of the model, where \(\texttt{attr}\) and \(\texttt{intel}\) were mean-centered, are provided in Table 11.2. Compared to Table 6.3, which effectively considered the same model without random effects, we generally obtain similar results, although the estimates and \(F\)-values do differ due to the partial pooling and shrinkage of the mixed-effects model. As dependency between ratings from the same rater, and for the same ratee (“item”) is rather likely, this new analysis should be preferred to the one in Table 6.3.

\(\hat{\beta}\) | \(\text{df}_1\) | \(\text{df}_2\) | \(F\) | \(P(\geq F)\) | |
---|---|---|---|---|---|

Intercept | 6.224 | 1 | 122.8 | 7783.26 | 0.000 |

\(\texttt{attr}\) | 0.555 | 1 | 120.3 | 263.45 | 0.000 |

\(\texttt{intel}\) | 0.375 | 1 | 100.3 | 86.97 | 0.000 |

\(\texttt{attr}\times\texttt{intel}\) | -0.009 | 1 | 48.3 | 0.42 | 0.521 |

The estimated covariance matrix for the random Person effects is provided in Table 11.3.

\(\gamma_{P,0}\) | \(\gamma_{P,1}\) | \(\gamma_{P,2}\) | \(\gamma_{P,3}\) | |
---|---|---|---|---|

\(\gamma_{P,0}\) | 0.079 | -0.023 | 0.021 | -0.004 |

\(\gamma_{P,1}\) | -0.023 | 0.023 | -0.021 | -0.001 |

\(\gamma_{P,2}\) | 0.021 | -0.021 | 0.023 | 0.002 |

\(\gamma_{P,3}\) | -0.004 | -0.001 | 0.002 | 0.001 |

The estimated covariance matrix for the random Item effects is provided in Table 11.4.

\(\gamma_{I,0}\) | \(\gamma_{I,1}\) | \(\gamma_{I,2}\) | \(\gamma_{PI,3}\) | |
---|---|---|---|---|

\(\gamma_{I,0}\) | 0.285 | -0.051 | 0.013 | -0.008 |

\(\gamma_{I,1}\) | -0.051 | 0.042 | -0.029 | 0.002 |

\(\gamma_{I,2}\) | 0.013 | -0.029 | 0.049 | -0.002 |

\(\gamma_{PI,3}\) | -0.008 | 0.002 | -0.002 | 0.002 |

As you can see, the item variability is generally estimated to be higher than the person variability. This, to a certain extent, indicates that variability between items (the dating partners here) weighs more heavily than differences between the persons (the raters here). As such, whether someone is liked more or less than average may be more due to characteristics of that person than characteristics of the rater.

## 11.8 Choosing the random effects structure

The correct specification of the random-effects structure in the model is very important. Omitting a random effect when there is in fact variability in this effect across the levels of a grouping factor can dramatically increase Type I errors well beyond the desired significance level (Barr, Levy, Scheepers, & Tily, 2013; Charles M. Judd et al., 2012). As such, it has been recommended that one should initially start with the **maximal random effects structure** (Barr et al., 2013; Singmann & Kellen, 2019). The maximal model is the model that includes random effects for all grouping factors in a study. Specifically, for each grouping factor, the maximal model contains random intercepts as well as random slopes for all fixed effects, plus all correlations among the random effects.

By using a maximal structure, it is unlikely that you would miss a potentially important source of variability in the data. Inclusion of random effects which are not really there might, given sufficient data, not be too much of an issue, as the model would estimate the variance of such random effects to be low. However, if the sample size is limited, a common problem is that the maximal model is not fully identified (Bates, Kliegl, Vasishth, & Baayen, 2015), especially for mixed models with complicated random effects structures. An unidentified model means that not all parameters can be estimated properly. This is often evident by the estimated variance-covariance matrix of the random effects being degenerate or singular. In such cases, you may obtain estimated variances of 0, near 0, or even negative estimated variances,^{29} and correlations between random effects of \(\pm 1\). The occurrence of such situations is due to the fact the parameters associated to random effects (e.g., \(\sigma^2_{\gamma_1}\)) are more difficult to estimate than fixed effects (e.g., \(\beta_1\)).

When the maximal structure can not be used because of these estimation issues, it is generally recommended to reduce the complexity of the random effects structure, because degenerate or overparameterised models can reduce the statistical power of any tests conducted with them (Matuschek, Kliegl, Vasishth, Baayen, & Bates, 2017). A first step is generally to remove correlations among random slopes, as these tend to contribute the largest number of random effects parameters that need to be estimated. They are usually also more difficult to estimate than the variance parameters. If a model still shows problems after removing correlations, other random-effects parameters could be removed, starting with the highest-order random effects parameter with the lowest estimated variance. Iterative approaches to obtain the **optimal random effects structure**, rather than the maximal one, are provided by Matuschek et al. (2017) and Bates et al. (2015). Applying these iterative procedures can be complicated. They are also data-driven, and hence the results may be influenced by random noise in the data. Sometimes, it might not be possible to reduce the random-effects structure such that all problematic random effects parameters are removed (e.g., in cases when there is random variability in higher-order effects, but not in lower-order effects). In those cases, one might accept a few problematic or degenerate parameters (e.g., variances of zero). This would provide more conservative tests for the fixed effects than simply removing justifiable random effects and inflating Type I error rates to an unknown degree. It is also advisable to check the robustness of results by comparing the fixed-effects estimates and associated hypothesis tests across all estimated models with a different random effects structure. It is often the case that the testing of fixed effects in highly overparameterised models diverge from the corresponding tests in reduced models. In these cases, the tests for the simpler models may be preferred.

### References

*Journal of Memory and Language*,

*68*, 255–278.

*arXiv:1506.04967 [Stat]*.

*The Quarterly Journal of Economics*,

*121*, 673–697.

*Journal of Personality and Social Psychology*,

*103*, 54–69.

*Psychological Bulletin*,

*99*, 422–431.

*Biometrics*,

*53*, 983–997.

*Behavior Research Methods*,

*49*, 1494–1502.

*Journal of Memory and Language*,

*94*, 305–315.

*Mixed-effects models in s and s-PLUS*. Springer Science & Business Media.

*Psychometrika*,

*6*, 309–316.

*New methods in cognitive psychology*(pp. 4–31). Psychology Press.

*The Annals of Mathematical Statistics*,

*9*, 60–62.

When parameters are set on the bounds of the parameter space, the sampling distribution of the statistic is a

*mixture*of different Chi-squared distributions, which effectively is a weighted sum of chi-squared distributions. Although such mixture distributions are well-defined in principle, it is not straightforward to determine the component weights accurately in practice.↩︎In the context of mixed-effects models, it is common to state that random effects are

*predicted*, rather than*estimated*.↩︎A variance can never be negative, so this is clearly an indication of issues in the estimation!↩︎