Chapter 5 Multiple regression

In this chapter, we extend the simple regression model to include multiple predictor variables. The slopes of the predictors in the resulting multiple regression model reflect the unique effect of each predictor on the dependent variable, after removing the effect of all other predictors from both the dependent and predictor variable. We will look at what that means in some detail. We then go on to discuss parameter estimation and testing.

5.1 Trump, votes, and hate groups (again)

In the previous chapter, we found evidence for a relation between the number of hate groups and votes for Trump. This relation was found with an observational study, not an experimental one. Because the number of hate groups was not randomly assigned to states, there are possible confounding factors that could account for the result. The well-known phrase “correlation does not imply causation” should be taken seriously. For instance, it could be that hate groups are especially prevalent in places with relatively low levels of education. And it might be that those with a relatively low level of education are also more likely to vote for Trump. In other words, education level might be a common cause of both hate groups and Trump votes, whilst there is no direct relation between hate groups and Trump votes. The difference between a direct relation, and a spurious one through a common cause, is depicted in Figure 5.1. You can see more, often rather amusing examples of spurious relations on the spurious correlation website.

Figure 5.1: A direct relation between hate groups and Trump votes vs a spurious relation through a common cause (education).

To assess the relation between hate groups and Trump votes in the possible presence of common causes and other confounds, ideally, we would like to “remove” the effect of such confounds from both the “hate groups” predictor, and the “Trump votes” dependent variable. If, after statistically controlling for confounds in such a way, we still find evidence for a relation between hate groups and Trump votes, that would strengthen our belief that the relation is real, rather than spurious. Multiple regression is a way in which to determine unique effects of predictor variables on the dependent variable. Before we move on to define the multiple regression model, we will work through an indirect way to control for a third variable (e.g. a possible common cause), which will provide insight into what is meant by unique effects in a multiple regression model.

5.1.1 Controlling for education level

We have indicated that, potentially, the level of education might be a common cause for both the prevalence of hate groups, and voter support for Trump. What if we could assess the relation between hate groups and Trump votes after removing the effect of education level from both? One reasonable way to do this, using the tools we already have, is by means of simple regression models. The idea is reasonably straightforward. In a regression model predicting Trump votes from education level, the error terms (residuals) of that model reflect that part of the variation in Trump votes which can not be predicted from education level. Similarly, in a regression model predicting hate groups from education level, the error terms (residuals) of that model reflect the variation in hate groups which can not be predicted from education level. So, if there is still a relation between the residual Trump vote, and the residual hate groups, then the relation can not be due to education level as a common cause.

When we estimate a regression model predicting Trump votes from education level (defined as the percentage of citizens who obtained a Bachelors degree or higher), we obtain the following estimated model \[\texttt{trump_votes}_i = 90.069 - 1.357 \times \texttt{education}_i + \hat{\epsilon}_{\texttt{votes},i}\] and for the hate groups we obtain \[\texttt{hate_groups}_i = 6.146 - 0.105 \times \texttt{education}_i + \hat{\epsilon}_{\texttt{hate},i}\] Note that I’m labelling the error terms for both models with a subscript (either \(\texttt{votes}\) or \(\texttt{hate}\)) to denote these are the residual terms for a model predicting Trump votes and hate groups from education levels respectively. Also, I’m adding the hat above them to indicate that these are not the “true errors” (from a model with the true values of the slope and intercept) but rather estimates of these resulting from using estimated parameters. The residuals, as well as the estimated regression lines, are depicted in Figure 5.2. The slope of \(\texttt{education}_i\) is significant in both models. In the model for Trump votes, \(\hat{\beta}_\texttt{education} = -1.357\), \(t(48) = -7.30\), \(p < .001\). In the model for hate groups, \(\hat{\beta}_\texttt{education} = -0.105\), \(t(48) = -2.11\), \(p = .040\). This thus shows a reliable relation between education level and Trump votes, and between education level and hate groups, which is supportive of the idea of a common cause. However, significance is not a prerequisite to statistically control for a third variable.

Predicting Trump votes and hate groups by linear regression models with education level as predictor. The residuals of each model are depicted as vertical lines from the observations to the regression line.

Figure 5.2: Predicting Trump votes and hate groups by linear regression models with education level as predictor. The residuals of each model are depicted as vertical lines from the observations to the regression line.

The regression line in both models represents the variation in the dependent variable (Trump votes and hate groups) that is dependent on education level. In both models, the error term \(\epsilon\) is assumed to be completely independent from the predictor. Hence, these are the aspects of Trump votes and hate groups that cannot be explained by education level. To assess whether there is a relation between these residual terms, which would indicate a relation between Trump votes and hate groups that is independent from education level, we can estimate another linear regression model \[ \hat{\epsilon}_{\texttt{votes},i} = 0 + 1.314 \times \hat{\epsilon}_{\texttt{hate},i} + \hat{\epsilon}_i \tag{5.1} \] Note that the intercept is equal to 0. This is necessarily the case, because the residuals of estimated regression models always have a mean of 0. If you go back to the estimate of the intercept in Equation (4.4), you will see that if the means are both 0, the result has to be 0 as well: \[\begin{aligned} \hat{\beta}_0 &= \overline{Y} - \hat{\beta}_1 \overline{X} \\ &= 0 - \hat{\beta}_1 \times 0 \\ &= 0 \end{aligned}\]

A scatterplot of the dependent variable (the residual Trump votes \(\hat{\epsilon}_{\texttt{votes},i}\)) and the predictor (the residual hate groups \(\hat{\epsilon}_{\texttt{hate},i}\)), together with the estimated regression line, is provided in Figure 5.3.

Residual Trump votes and hate groups, after statistically removing the effect of education level from both, and the estimated regression line.

Figure 5.3: Residual Trump votes and hate groups, after statistically removing the effect of education level from both, and the estimated regression line.

We can see that there appears to be a positive relation between residual Trump votes and residual hate groups. If we perform a test of the null hypothesis \(H_0: \beta_1 = 0\) in this model, we find that the result is significant, \(t(48) = 2.60\), \(p = .012\). This thus indicates that there is a relation between Trump votes and hate groups that can not be attributed to education level. While the procedure of removing the effect of education level from both Trump votes and hate groups is appropriate, there however is a slight issue with the hypothesis test we just performed. To remove the effect of education level, we first had to estimate two new models. These estimates are noisy, and this noise may affect the estimates of the third model in which we predicted the residual Trump votes from the residual hate groups. The hypothesis test does not take this additional source of noise into account.

It turns out that we do not need to estimate separate models to remove the effect of education level. By including both education level and hate groups as predictors of Trump votes in a multiple regression model, the estimated slope of hate groups will be exactly equal to the one that we just computed from the residuals. Multiple regression models thus concern the unique effects of each predictor on the dependent variable, removing the effect of all other predictors from that relation.

5.2 The multiple regression model

The multiple regression model is a straightforward extension of the simple linear regression model, including more than one predictor \(X\): \[\begin{equation} Y_i = \beta_0 + \beta_1 \times X_{1,i} + \beta_2 \times X_{2,i} + \ldots + \beta_m \times X_{m,i} + \epsilon_i \quad \quad \quad \epsilon_i \sim \mathbf{Normal}(0,\sigma_\epsilon) \tag{5.2} \end{equation}\] Note that we are using \(m\) to reflect the total number of predictors \(X\) in the model. As for the simple regression model, this model consists of a structural part which reflects the conditional mean of the dependent variable \(Y\), conditional upon all predictors: \[\beta_0 + \beta_1 \times X_{1,i} + \beta_2 \times X_{2,i} + \ldots + \beta_k \times X_{m,i} = \mu_{Y|X_1,\ldots,X_m}\] and a random part \(\epsilon_i\).

For a model with \(m=2\) predictors, such as \[ \texttt{trump_votes}_i = \beta_0 + \beta_1 \times \texttt{hate_groups}_i + \beta_2 \times \texttt{education}_i + \epsilon_i \quad \quad \epsilon_i \sim \mathbf{Normal}(0,\sigma_\epsilon) \tag{5.3} \] the data can be represented in a three-dimensional space. In this space, the conditional means \(\mu_{Y | X_{1,i}, X_{2,i}}\) can be represented as a regression plane. A visual representation is given below.

Figure 5.4: Three-dimensional representation of a regression model predicting Trump votes from hate groups and education level. By clicking on the image and moving your mouse, you should be able to rotate the image to explore it further. You can also click on the ‘Play’ button for an animation of such a rotation.

Keeping all predictors constant apart from one, the model becomes a simple linear regression model for the non-constant variable. If you pick a value of “% bachelors degree or higher”, you can think of this as slicing the regression plane at that point. The slice of the plane is a single straight line, which is then a simple regression model. For example, if we focus on states with an education level of 15.4% with a bachelors degree or higher, then the multiple regression model can be written as \[\begin{aligned} \texttt{trump_votes}_i &= \beta_0 + \beta_1 \times \texttt{hate_groups}_i + \beta_2 \times 15.4 + \epsilon_i \\ &= (\beta_0 + \beta_2 \times 15.4) + \beta_1 \times \texttt{hate_groups}_i + \epsilon_i \\ &= \beta_0' + \beta_1 \times \texttt{hate_groups}_i + \epsilon_i \end{aligned}\] which is a simple regression model with a new intercept \(\beta_0' = \beta_0 + \beta_2 \times 15.4\) composed as the sum of the original intercept and 15.4 times the slope of \(\texttt{education_level}\), and the same slope \(\beta_1\) for \(\texttt{hate_groups}\). You can see these simple regression lines, for three different values of education level, in Figure 5.5. Notice that the regression lines for different education levels are parallel, because they have the same slope. However, the regression lines are at different heights, because they have a different intercept.

Percentage of votes for Trump in the 2016 elections for 50 US states and the number of hate groups per 1 million citizens, with regression lines for three different values of education level. The colour of each data point and regression line reflects the value of education level.

Figure 5.5: Percentage of votes for Trump in the 2016 elections for 50 US states and the number of hate groups per 1 million citizens, with regression lines for three different values of education level. The colour of each data point and regression line reflects the value of education level.

5.3 Estimation

The estimate of the intercept is a direct generalization of that for a simple regression model: \[\begin{equation} \hat{\beta}_0 = \overline{Y} - \hat{\beta}_1 \times \overline{X}_1 - \hat{\beta}_2 \times \overline{X}_2 - \ldots - \hat{\beta}_m \times \overline{X}_m \tag{5.4} \end{equation}\] Again, it is an adjustment of the mean of the dependent variable, subtracting means multiplied by slopes of each predictor variable.

The estimates of the slopes are straightforward to compute using matrix algebra.¹¹ However, we won’t cover matrix algebra here, so we will leave the estimation of the slopes to statistical software.

The (unbiased) estimate of the error variance is also a direct generalization of the estimate for a simple regression model: \[\begin{equation} \hat{\sigma}^2_\epsilon = \frac{\sum_{i=1}^n (Y_i - \hat{\mu}_{Y|X_1, \ldots, X_m})^2}{n-\text{npar}(M)} \tag{5.5} \end{equation}\] Here, \(\text{npar}(M)\) is the number of parameters in the model, excluding the error variance (i.e., it is the number of \(\beta_j\) parameter, where \(j=0,\ldots,m\). So \(\text{npar}(M) = m + 1\).

Estimating the model in which we predict Trump votes from hate groups and education level gives the following estimates: \[\texttt{trump_votes}_i = 81.99 + 1.314 \times \texttt{hate_groups}_i - 1.219 \times \texttt{education}_i + \hat{\epsilon}_i \quad \quad \hat{\epsilon}_i \sim \mathbf{Normal}(0, 6.643)\] A main thing to note here is that the estimated slope of \(\texttt{hate_groups}\) is exactly the same as in Equation (5.1) when first computing the residual Trump votes and hate groups! This is because the slopes in a multiple regression model are unique effects of a predictor on the dependent variable, after removing the effect of all other predictors from both that predictor and the dependent variable. So, equivalently, the slope of \(\texttt{education}\) in the model above is identical to the slope we would obtain by first computing the residual Trump votes from \[\texttt{trump_votes}_i = \hat{\beta}_0 + \hat{\beta}_1 \times \texttt{hate_groups}_i + \hat{\epsilon}_{\text{vote},i}\]

and the residual education from \[\texttt{education}_i = \hat{\beta}_0 + \hat{\beta}_1 \times \texttt{hate_groups}_i + \hat{\epsilon}_{\text{education},i}\] and then determining the slope of the residual \(\texttt{education}\) in the model \[\hat{\epsilon}_{\text{vote},i} = \beta_0 + \beta_1 \times \hat{\epsilon}_{\text{education},i} + \epsilon_i\]

5.4 Inference

Testing hypotheses regarding the parameters of multiple regression models is analogous to testing the parameters of simple regression models. So one approach is to look at the sampling distribution of the estimates under the null hypothesis and compute the appropriate \(t\) statistic. In a multiple regression model, the \(t\)-statistic follows a \(t\)-distribution with \(n-(m+1)\) degrees of freedom, where as before \(m\) stands for the number of predictors in the model. As we estimate a slope for each predictor, as well as an additional intercept parameter, the total number of estimated parameters is \(\text{npar} = m + 1\). Hence, the degrees of freedom equals \(n-\text{npar}\).

Testing, for each parameter, the null hypothesis \(H_0: \beta_j = 0\), where \(j = 0,\ldots,2\), we obtain the results given in 5.1.

Table 5.1: Null-hypothesis significance tests using the t statistic.
	\(\hat{\beta}\)	\(\text{SE}(\hat{\beta})\)	\(t\)	\(p(\geq \lvert t \rvert)\)
Intercept	81.99	6.155	13.32	0.000
Hate groups per million	1.31	0.510	2.58	0.013
% bachelors degree or higher	-1.22	0.184	-6.62	0.000

Alternatively, we can compare models in which we either fix the parameter to the value assumed in the null hypothesis or not. As discussed in the previous chapter, this is a more general approach. When comparing two multiple regression models, we use the \(F\)-statistic defined in Equation (4.8): \[F = \frac{\frac{\text{SSE}(R) - \text{SSE}(G)}{\text{npar}(G) - \text{npar}(R)}}{\frac{\text{SSE}(G)}{n-\text{npar}(G)}}\] For instance, to test the null hypothesis \(\beta_1 = 0\) in the model defined in Equation (5.3), we compare this general model \[\text{MODEL G: } \quad \texttt{trump_votes}_i = \beta_0 + \beta_1 \times \texttt{hate_groups}_i + \beta_2 \times \texttt{education}_i + \epsilon_i\]
to a restricted model where we fix \(\beta_1 = 0\): \[\begin{aligned} \text{MODEL R: } \quad \texttt{trump_votes}_i &= \beta_0 + 0 \times \texttt{hate_groups}_i + \beta_2 \times \texttt{education}_i + \epsilon_i \\ &= \beta_0 + \beta_1 \times \texttt{education}_i + \epsilon_i \end{aligned}\]
In other words, we would compare the multiple regression model to a simple regression model where we don’t include \(\texttt{hate_groups}\), as the null hypothesis assumes that this predictor has no unique effect on the dependent variable. If the Sum of Squared Error of MODEL R is not substantially higher than that of MODEL G, then there would be no good evidence that \(\beta_1 \neq 0\). Remember that the Sum of Squared Error of MODEL R can never be lower than that of MODEL G, i.e. \(\text{SSE}(R) \geq \text{SSE}(G)\). This is because estimating the parameters of MODEL G by maximum likelihood is equivalent to minimizing the Sum of Squared Error. If \(\hat{\beta}_1 = 0\) in MODEL G, that would mean that the SSE of both models is exactly the same. If \(\hat{\beta}_1 \neq 0\) in MODEL G, then \(\text{SSE}(R) > \text{SSE}(G)\), because otherwise the estimated MODEL G would not have maximised the likelihood (or equivalently minimised the SSE).

We have already shown results for MODEL G. The Sum of Squared Error for that model is \(\text{SSE}(G) = 2074.202\). Estimating MODEL R gives \[\texttt{trump_votes}_i = 90.07 - 1.357 \times \texttt{education}_i + \hat{\epsilon}_i \quad \quad \hat{\epsilon}_i \sim \mathbf{Normal}(0, 7.022)\] with \(\text{SSE}(R) = 2366.875\). The \(F\) statistic then becomes: \[\begin{align} F &= \frac{\frac{2366.875 - 2074.202}{3 - 2}}{\frac{2074.202}{50 - 3}} \\ &= \frac{292.673}{44.132} \\ &= 6.632 \end{align}\] To determine whether this value is significant, we need to compare it to an \(F\)-distribution with \(\text{df}_1 = 3-2 = 1\) and \(\text{df}_2 = 50 - 3 = 47\) degrees of freedom. For this distribution and a significance level of \(\alpha = .05\), the critical value is 4.047, which is lower than the computed \(F\) value, and hence the result is significant and we reject the null hypothesis.

Similarly, to test the null-hypothesis \(\beta_2 = 0\) in MODEL G as defined above, we compare this model to a new restricted MODEL R where we fix \(\beta_2 = 0\): \[ \begin{aligned} \text{MODEL R: } \quad \texttt{trump_votes}_i &= \beta_0 + \beta_1 \times \texttt{hate_groups}_i + 0 \times \texttt{education}_i + \epsilon_i \\ &= \beta_0 + \beta_1 \times \texttt{hate_groups}_i + \epsilon_i \end{aligned} \]
This is the model we estimated in the previous chapter, where we determined the estimates as \[\texttt{trump_votes}_i = 42.9 + 2.3 \times \texttt{hate_groups}_i + \hat{\epsilon}_i \quad \quad \hat{\epsilon}_i \sim \mathbf{Normal}(0, 9.142)\] with an associated \(\text{SSE}(R) = 4011.398\). The \(F\) statistic is then computed as \[\begin{aligned} F &= \frac{\frac{4011.398 - 2074.202}{3 - 2}}{\frac{2074.202}{50 - 3}} \\ &= \frac{1937.197}{44.132} \\ &= 43.896 \end{aligned}\] The critical value is the same as before, and hence we also reject this null hypothesis.

To test whether the intercept \(\beta_0 = 0\), we would compare MODEL G to yet another MODEL R: \[ \begin{aligned} \text{MODEL R: } \quad \texttt{trump_votes}_i &= 0 + \beta_1 \times \texttt{hate_groups}_i + \beta_2 \times \texttt{education}_i + \epsilon_i \\ &=\beta_1 \times \texttt{hate_groups}_i + \beta_1 \times \texttt{education}_i + \epsilon_i \end{aligned} \]
which is estimated as \[\texttt{trump_votes}_i = 4.776 \times \texttt{hate_groups}_i + 1.13 \times \texttt{education}_i + \hat{\epsilon}_i \quad \quad \hat{\epsilon}_i \sim \mathbf{Normal}(0, 14.36)\] with \(\text{SSE}(R) = 9904.891\). The \(F\) statistic is computed in the same way as before: \[\begin{align} F &= \frac{\frac{9904.891 - 2074.202}{3 - 2}}{\frac{2074.202}{50 - 3}} \\ &= \frac{7830.689}{44.132} \\ &= 177.438 \end{align}\]

The results of all three tests are collected in Table 5.2.

Table 5.2: Null-hypothesis significance tests by model comparisons and the F statistic.
	\(\hat{\beta}\)	\(\text{SS}\)	\(\text{df}\)	\(F\)	\(p(\geq F)\)
Intercept	81.99	7831	1	177.44	0.000
Hate groups per million	1.31	293	1	6.63	0.013
% bachelors degree or higher	-1.22	1937	1	43.90	0.000
Error		2074	47

When performing such model comparisons, it is very important to remember that what we call MODEL G and MODEL R will depend on the null hypothesis tested. Generally, MODEL G will remain the same within the context of an analysis, but MODEL R will vary from test to test.

There is one more comparison that we could perform, namely to compare MODEL G to a MODEL R without any predictors. This is a so-called omnibus test of the null hypothesis about two parameters simultaneously, namely \(H_0: \beta_1 = 0 \text { and } \beta_2 = 0\). This results in another version of the restricted MODEL R: \[\begin{aligned} \text{MODEL R: } \quad \texttt{trump_votes}_i &= \beta_0 + 0 \times \texttt{hate_groups}_i + 0 \times \texttt{education}_i + \epsilon_i \\ &= \beta_0 + \epsilon_i \end{aligned}\]
This is like the alternative model of the one-sample t-test we considered earlier. Here, the model is estimated as \[\texttt{trump_votes}_i = 49.86 + \hat{\epsilon}_i \quad \quad \hat{\epsilon}_i \sim \mathbf{Normal}(0, 10.09)\] with an associated \(\text{SSE}(R) = 4992.178\). The \(F\) statistic is then computed as \[\begin{align} F &= \frac{\frac{4992.178 - 2074.202}{3 - 1}}{\frac{2074.202}{50 - 3}} \\ &= \frac{1458.988}{44.132} \\ &= 33.06 \end{align}\] The critical value is different this time, because the degrees of freedom are now \(\text{df}_1 = 3 - 1 = 2\) and \(\text{df} = 50 - 3 = 47\), and equals 3.195. Clearly, the computed \(F\) value is well above the critical value, hence we reject the null hypothesis (i.e. reject MODEL R in favour of MODEL G).

5.5 Partitioning and explaining variance

Let’s have another look at the formula for the \(F\) statistic:

\[F = \frac{\frac{\text{SSE}(R) - \text{SSE}(G)}{\text{npar}(G) - \text{npar}(R)}}{\frac{\text{SSE}(G)}{n-\text{npar(G)}}}\] The numerator (top part) consists of a difference between two SSE terms, divided by the difference in the number of estimated parameters. Because the comparison is between nested models, and MODEL G contains more estimated parameters than MODEL R, we can view this part as the average reduction of the SSE (per parameter) due to the additional parameters of MODEL G. The numerator of the \(F\)-statistic is also referred to as the Mean Square Reduced (MSR): \[\text{MSR} = \frac{\text{SSE}(R) - \text{SSE}(G)}{\text{npar}(G) - \text{npar}(R)}\] The denominator (bottom part) in the formula of the \(F\)-statistic consists of the SSE of MODEL G divided by \(n-\text{npar}(G)\). This is an unbiased estimate of the error variance of MODEL G, and also referred to as the Mean Squared Error (MSE): \[\text{MSE} = \frac{\text{SSE}(G)}{n-\text{npar(G)}}\] So an alternative definition of the \(F\)-statistic is \[F = \frac{\text{MSR}}{\text{MSE}}\]

The SSE of any linear model \(M\) is related to an error variance, as the SSE is the sum of squared deviations between the observations and the model predictions (\(\hat{Y} = \mu_{Y|X_1,\ldots,X_m}\)): \[\text{SSE}(M) = \sum_{i=1}^n (Y_i - \hat{Y}_{M,i})^2\] Dividing the SSE by \(n - \text{npar}(M)\) would provide us with an unbiased estimate of the error variance of that model. Although this is only done in the denominator of the \(F\) statistic, which considers MODEL G, it should be clear there is a close relation between any SSE term and an error variance.

A useful perspective on unique effects in multiple regression models and GLMs in general is to consider them as partitioning the variance of the dependent variable in parts that are uniquely attributable to the predictors. If you consider the SSE of a very simple MODEL R \[Y_i = \beta_0 + \epsilon_i\] with estimate \(\hat{\beta}_0 = \overline{Y}\), then for this model \[\text{SSE}(R) = \sum_{i=1}^n (Y_i - \hat{\beta}_0)^2 = \sum_{i=1}^n (Y_i - \overline{Y})^2 = n S_Y^2\] So the SSE of a regression model without predictors is equal to \(n\) times the sample variance of the dependent variable. When we add predictors to the model, we know the SSE can only decrease (or remain the same, but that is rather unlikely). The proportion of the SSE that is reduced by adding a predictor to the model can be seen as the proportion of the variance that is “explained” by the predictor. This proportional reduction in error is a useful measure of the strength of the relation between a predictor and the dependent variable. As usual, this strength will have a true value for the Data Generating Process, and a value we can compute for a given limited dataset. We’ll denote the true value as \(\eta_p^2\). The sample estimate is: \[\begin{equation} \hat{\eta}_p^2 = \frac{\text{SSE}(R) - \text{SSE}(G)}{\text{SSE}(R)} \tag{5.6} \end{equation}\] Because \(0 \leq \text{SSE}(G) \leq \text{SSE}(R)\), the value of \(\hat{\eta}_p^2\) is between 0 and 1 (as a proportion should be). Figure 5.6 depicts such “proportions of variance explained” for a situation with two predictors (\(X_1\) and \(X_2\)) of a dependent variable \(Y\).

Partitioning the variance in a multiple regression model. Each circle represents the variance of a variable. Overlapping regions represent shared variability (e.g. covariance) between variables.

Figure 5.6: Partitioning the variance in a multiple regression model. Each circle represents the variance of a variable. Overlapping regions represent shared variability (e.g. covariance) between variables.

In this Figure, the variance of \(Y\) is represented by the circle labelled \(Y\), and the variance of the predictors is represented by the circles labelled \(X_1\) and \(X_2\) respectively. Overlap between the circles represents shared variance (e.g. covariance) between the variables. First, let’s consider a simple regression model where \(X_1\) is the sole predictor of \(Y\). The proportion of variance of \(Y\) that this model accounts for is the total shared area between \(X_1\) and \(Y\), which is the sum of the regions labelled as \(B\) and \(D\). Similarly, if we’d consider a simple regression model where \(X_2\) is the sole predictor of \(Y\), the proportion of variance of \(Y\) that this model accounts for is the total shared area between \(X_2\) and \(Y\), which is the sum of the regions labelled as \(C\) and \(D\). If we consider a multiple regression model with both \(X_1\) and \(X_2\) as predictors of \(Y\), the proportion of variance of \(Y\) that this model accounts for is the total shared area between \(X_1\), \(X_2\) and \(Y\), which is the sum of the regions labelled as \(B\), \(C\), and \(D\). The area labelled as \(A\) is variance of \(Y\) that cannot be accounted for by \(X_1\) or \(X_2\), and is the SSE (the random, unexplainable part) of a multiple regression model which includes both \(X_1\) and \(X_2\) as predictors. When MODEL R is an intercept-only model (i.e. a model without any predictors) and MODEL G a model with predictors, the estimate of the proportion of variance accounted for is often denoted by \(R^2\). In this case, \(R^2\) reflects the proportion of the variance of the dependent variable \(Y\) accounted for, or “explained”, by the model as a whole.

Note that the area labelled as \(D\) is “explained” by both \(X_1\) and \(X_2\) when each is used as a sole predictor of \(Y\). In a multiple regression model however, the effects of predictors are unique effects, meaning that we consider only that part of the variance of \(Y\) which is solely explained by each predictor. For \(X_1\), this unique proportion of the variance is the area labelled as \(B\), and for \(X_2\) it is the area labelled as \(C\). The area labelled as \(D\) is not uniquely attributable to either \(X_1\) or \(X_2\). Although it is not purely random variability of \(Y\) – in principle, it could be explained by either \(X_1\) or \(X_2\) – because it cannot be uniquely attributed to either \(X_1\) or \(X_2\), it will not contribute to a hypothesis test of the effect of \(X_1\) or \(X_2\). We will discuss this more when we consider the practical problem of multicollinearity later on.

5.6 Effect size and the importance of predictors

In a multiple regression model, you may be interested in which predictors are more important, in the sense that they have a stronger relation with the dependent variable and therefore help you to provide better predictions of the dependent variable than other variables. It is important to realise that this question can not be answered by simply comparing the (absolute) value of the slopes of the predictors. Slopes reflect the change in the conditional mean of the dependent variable for every one-unit increase in the predictor. But what a one-unit increase is depends on the scale of the predictor. For example, I might predict people’s weight by their height and daily calorie intake. Whether I measure their height in meters or centimetres should be arbitrary. If I know someone’s height in centimetres, I also know their height in meters, and the choice of scale (whether centimetres or meters) does not change the relation between height and weight in any meaningful way. But the choice of scale clearly changes the (estimated) slope of the height predictor. If people’s average weight increases by 1 gram for every increase in their height by 1 centimetre, the corresponding increase in weight for an increase in height by 1 meter is 100 grams. In other words, the slope of height in centimetres is 1, but the slope for height in meters is 100. But that obviously doesn’t mean that height in meters is a better or more important predictor of weight than height in centimetres!

Even when predictors are measured on the same scale, their relative importance can not be determined by comparing the (absolute) value of their slopes. For example, we could attempt to predict the impact of a negative life event (such as divorce or a bereavement) by the size of someone’s support network, which could consist of the number of people in their immediate family, and the number of friends. Both these predictors (family and friends) are measured on the same scale (number of people), but they may have rather different ranges (e.g., between 0 and 5 for family, and between 0 and 30 for friends). If for example the slope of friends is smaller than the slope of family, that means that every additional family member may soften the negative impact more than every additional friend. However, because the range of friends is larger, the overall effect of friends on the impact of a negative life events may well be larger than that of family.

To resolve such issues of scale and range, we can attempt to standardize all the predictors. Generally, this is done by the \(Z\)-transformation (Equation (3.2)), i.e. by subtracting the mean from all values and then dividing these by the standard deviation \[Z_{X_j} = \frac{X_{j,i} - \overline{X}_j}{S_{X_j}}\] Conventionally, the same \(Z\)-transformation is also applied to the dependent variable. After this standardization of all (predictor and dependent) variables, we obtain the standardised slopes: \[\begin{equation} \hat{\beta}^\text{std}_{X_j} = \hat{\beta}_j \frac{S_{X_j}}{S_Y} \tag{5.7} \end{equation}\] These standardized slopes¹² reflect the increase or decrease of the dependent variable (in units of standard deviation \(S_Y\)) for every increase in the predictor by one standard deviation \(S_{X_j}\). Although this makes the slopes more comparable, some issues with range may still remain. For one thing, as we are basing the standardisation on sample estimates of the standard deviations, the result will depend on the particulars of the dataset. If, by chance, a dataset includes only people with between 2 and 6 friends, the standardised slope of friends may be much higher than if the dataset included people with between 0 and 20 friends, even if the relation between friends and life-event impact is the same in both datasets.

Determination of the relative importance of predictors is tricky. There is no measure which resolves all issues. But if you take note of potential issues with artificially restricted ranges of predictors, measures of relative importance are not completely useless. They can inform you of the relative size of the effect a predictor has on the dependent variable. When considering whether there is evidence that a predictor has any effect on the dependent variable (i.e., when testing the null-hypothesis \(H_0\): \(\beta_j = 0\)), it is often a good idea to also consider a measure of the effect size. For large datasets, the power of tests to detect any effect of a predictor is generally large. That means that even if a predictor has a very small effect on the dependent variable (i.e. it only changes the conditional mean by a very small fraction), you would still be able to reliably detect such a small effect and reject the null hypothesis. But that would not imply that a predictor is “practically significant”, in the sense that e.g. persuading someone to make one additional friend might help soften the blow of a negative life event in a subjectively perceivable manner.

5.6.1 \(R^2\) changes and the coefficient of (semi-)partial determination

The standardised slopes can be viewed as measures of effect size. But it is more common to consider effect size in terms of “proportion of variance explained”. Personally, I like these measures because they have a clear and bounded scale between 0 (no effect) and 1 (a predictor is perfect in accounting for the dependent variable). Let’s consider how we can use this measure to reflect the effect size for the different predictors in a multiple regression model. To frame our discussion, we will consider a situation where the dependent variable \(Y\) can be modelled by two predictors, \(X_1\) and \(X_2\). In this situation, we can define four different models:

\[\begin{aligned} \text{MODEL R0}: && Y_i &= \beta_0 + \epsilon_i \\ \text{MODEL R1}: && Y_i &= \beta_0 + \beta_1 X_{1i} + \epsilon_i \\ \text{MODEL R2}: && Y_i &= \beta_0 + \beta_1 X_{2i} + \epsilon_i \\ \text{MODEL G}: && Y_i &= \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + \epsilon_i \end{aligned}\]

Referring back to Figure 5.6, the Sums of Squared Error of each model can be defined as

\[\begin{aligned} \text{SSE}(R0) &= A + B + C + D = S^2_Y \\ \text{SSE}(R1) &= A + C \\ \text{SSE}(R2) &= A + B \\ \text{SSE}(G) &= A \end{aligned}\]

Recall that the \(R^2\) measure reflects the proportion of variance “explained” by all the predictors in a model, and is the \(\hat{\eta}^2\) of a model compared to an intercept-only model (i.e. MODEL R0). For each of the models apart from R0, we can compute the \(R^2\) measure as

\[\begin{aligned} R^2_{R1} &= \frac{\text{SSE}(R0) - \text{SSE}(R1)}{\text{SSE(R0)}} = \frac{B + D}{A + B + C + D} \\ R^2_{R2} &= \frac{\text{SSE}(R0) - \text{SSE}(R2)}{\text{SSE(R0)}} = \frac{C + D}{A + B + C + D} \\ R^2_{G} &= \frac{\text{SSE}(R0) - \text{SSE}(G)}{\text{SSE(R0)}} = \frac{B + C + D}{A + B + C + D} \end{aligned}\]

We can define a measure of effect size by considering the following question: what additional proportion of the variance of \(Y\) can be explained by including a predictor \(X_j\) in a model. For example, what additional proportion of the variance can be explained by adding the predictor \(X_2\) to MODEL R1? This is easily computed as the difference in \(R^2\) values: \[R^2_{G} - R^2_{R1} = \frac{C}{A+B+C+D}\] Similarly, we can define a measure of the effect of \(X_1\) by considering what additional proportion of the variance is explained by adding the predictor \(X_1\) to MODEL R2, providing \[R^2_{G} - R^2_{R2} = \frac{B}{A+B+C+D}\] These \(R^2\)-change values are also called the coefficient of semi-partial determination. They reflect the unique proportion of variance of \(Y\) that can be attributed to a predictor. Defining the Sum of Squared Reduced (SSR) – the value of this for each predictor is usually given in the “SS” column of a multiple regression ANOVA table – as follows: \[\begin{aligned} \text{SSR}(X_1) &= \text{SSE}(R2) - \text{SSE}(G) = A + B - A = B \\ \text{SSR}(X_2) &= \text{SSE}(R1) - \text{SSE}(G) = A + C - A = C \end{aligned}\] the coefficient of semi-partial determination can be computed as \[\begin{equation} \text{Semi-partial determination of }X = \hat{\eta}^2 = \frac{\text{SSR}(X)}{\text{SSE}(R0)} \tag{5.8} \end{equation}\] The main thing to remember is that the coefficient of semi-partial determination reflects the proportion of the variance of \(Y\) that can be uniquely attributed to a predictor \(X\).

A different measure of effect size is called the coefficient of partial determination. This reflects not the proportion of the total variance of \(Y\) that can be uniquely attributed to a predictor \(X\), but rather the proportion of the variance of \(Y\) that can not be explained by the other predictors in a model, but which can (uniquely) be explained by predictor \(X\). For instance, with reference to Figure 5.6, MODEL R1 leaves \(A + C\) “unexplained”. Of this unexplained variance, predictor \(X_2\) can explain part \(C\) when added to model R1. So the proportion of previously unexplained variance now explained is \(\frac{C}{A+C}\). This proportion can also be computed from \(R^2\) values. As the proportion of unexplained variance for MODEL R1 is simply \(1 - R^2_{R1}\), the proportion of this variance accounted for by \(X_2\) is \(\frac{R^2_{G} - R^2_{R1}}{1 - R^2_{R1}}\). Because the numerator is equal to the coefficient of semi-partial determination, and the denominator is between 0 and 1, the coefficient of partial determination is equal to or larger than the coefficient of semi-partial determination. It can also be computed directly from SSE and SSR terms as \[\begin{equation} \text{Partial determination of }X = \hat{\eta}_p^2 = \frac{\text{SSR}(X)}{\text{SSE}(G) + \text{SSR}(X)} \tag{5.9} \end{equation}\] The coefficient of partial determination is (in the context of ANOVA models, which we will discuss later) also called the partial eta-squared, or \(\eta_p^2\). We have already come across this measure as the proportional reduction in error (Equation (5.6)).

5.7 Assumptions

The assumptions of the multiple regression model, and with that any of the versions of the General Linear Model we discuss, are all really about the errors, which are assumed to be independently and identically distributed: \[\epsilon_i \sim \mathbf{Normal}(0, \sigma_\epsilon)\]

In other words, this implies the following assumptions for a multiple regression and/or GLM:

Normality: the errors \(\epsilon_i\) are Normal-distributed
Unbiasedness: the mean of \(\epsilon_i\) is 0. This implies that the conditional means \(\mu_{Y|X_1,\ldots,X_m}\) are indeed a linear function of the predictors, and that the model predictions are unbiased.
Homoscedasticity: the errors \(\epsilon_i\) have a constant variance \(\sigma_\epsilon^2\) and thus also a constant standard deviation \(\sigma_\epsilon\).
Independence: any error term \(\epsilon_i\) is independent of any other \(\epsilon_j\) (for all \(i \text{ and } j \neq i\)). Independence here means that \(p(\epsilon_j|\epsilon_i) = p(\epsilon_j)\) for all \(i \text{ and } j \neq i\)

It is important to realise that the assumption of Normal-distributed errors does not necessarily translate in assuming that the dependent variable \(Y\) is itself Normal-distributed. This will often not be the case. Nor does the model require the predictors to be Normal-distributed. No assumption is made at all about the distribution of the predictors.

In Section 3.7 we discussed some methods to assess the assumption of Normality (i.e. histograms, QQ-plots and statistical tests such as the Shapiro-Wilk test). The assumption of unbiasedness and homoscedasticity is generally assessed visually with a so-called predicted-by-residual plot. This plot (see Figure 5.7) depicts the residuals as a function of the model predictions (i.e. the estimated conditional means). The unbiasedness assumption implies that for each predicted value \(\hat{Y}\) (i.e. each conditional mean), the residual or error terms are scattered around 0. For instance, it should not be the case that for relatively low and high values of the predictions, the errors are generally above 0, while for medium values of the predictions, the errors are generally below 0. Such a pattern would be indicative of biased predictions and a likely non-linear relation between predictors and the dependent variable. The homoscedasticity assumption implies that the spread of the residuals is equal for each predicted value \(\hat{Y}\). Looking at the predicted vs residual plot in Figure 5.7, there is no clear indication that either assumption is violated. In addition, the QQ-plot indicates that the expected and sample quantiles do not differ radically, at least for the middle quantiles. Note that at the extremes (low and high quantiles), you will generally find more variable results, as quantiles in the tails of a distribution are less reliable.

Figure 5.7: Predicted vs residual plot, and a QQ plot, for MODEL G predicting Trump votes from hate groups and education level.

For comparison, Figure 5.8 shows two examples of predicted vs residual plots where the assumption of unbiasedness and homoscedasticity do not hold.

Predicted vs residual plots for two cases where the assumptions of the GLM do not hold. In the plot of the left, the assumption of unbiasedness is violated. There is clearly a nonlinearity in the data, which result in the underprediction (positive residuals) for low and high predicted values, and an overprediction (negative residuals) for medium predicted values. In the plot on the right, the homoscedasticity assumption is violated. The variability of the residuals clearly increases as a function of the predicted values.

Figure 5.8: Predicted vs residual plots for two cases where the assumptions of the GLM do not hold. In the plot of the left, the assumption of unbiasedness is violated. There is clearly a nonlinearity in the data, which result in the underprediction (positive residuals) for low and high predicted values, and an overprediction (negative residuals) for medium predicted values. In the plot on the right, the homoscedasticity assumption is violated. The variability of the residuals clearly increases as a function of the predicted values.

5.7.1 Transforming variables

In certain cases where the assumptions of unbiasedness and/or homoscedasticity do not hold, it may be possible to transform the dependent variables and/or predictors and obtain a model with these transformed parameters where the assumptions do hold.

For example, the true underlying relation in the left plot of Figure 5.8 was \[Y_i = \beta_0 + \beta_1 \times X_{1,i}^2 + \epsilon_i \quad \quad \epsilon_i \sim \mathbf{Normal}(0,\sigma_\epsilon)\] In this case, a linear model would be obtained by transforming the predictor \(X_1\) with a power-transformation, raising it to the power of 2 (i.e., using \(X_1^2\) rather than \(X_1\) as a predictor). The true underlying relation in the right plot of Figure 5.8 was \[\begin{aligned} Y_i &= \beta_0 + \beta_1 \times X_{1,i} + \epsilon_{i} \quad \quad \epsilon_i \sim \mathbf{Normal}(0,X_{1,i} \times \sigma_\epsilon) \end{aligned}\] In this case, the standard deviation of the errors increases with the value of \(X_1\). We could get a constant standard deviation by modelling \(\frac{Y_i}{X_i}\) as the dependent variable, instead of \(Y_i\).

You might – rightly! – think: but what if I don’t know the true underlying relation between the predictors and the dependent variable? Well, in that case you might have to try a number of different transformations. As there are an infinite number of possible transformations, you would hopefully be lucky enough to arrive at one that works. Common transformations of the dependent variable, which aim to resolve issues with violations of the homoscedasticity assumption, are provided in Table 5.3.

Table 5.3: Some common transformations for the dependent variable in linear models.
Transformation	Name	Usage
\(Y^2\)	square	useful for negatively-skewed data, as it will increase the spread among higher scores compared to lower ones
\(\sqrt{Y}\)	square-root	useful for positively skewed data, as it will decrease the spread among higher scores compared to lower one
\(\log_e{(Y)}\)	natural logarithm	useful for positively skewed data, as it will decrease the spread among higher scores compared to lower one. If there are values equal to or smaller than 1, you can use an alternative form \(\log_e{(Y + c)}\), with e.g. \(c = \min(1,\|\min(Y)\|)\)
\(\log_{10}{(Y)}\)	base 10 logarithm	useful for highly positively skewed data, as it will decrease the spread among higher scores compared to lower one. If there are values equal to or smaller than 1, you can use a similar adjustment as above.

Of course, you don’t have to try different transformations blindly. Inspecting the relations between the predictors and the dependent variable, and between the residuals and predicted values, can sometimes give reasonable clues to potentially useful transformations. The tricky thing is that non-linear transformations of the dependent or predictor variables can change many aspects of the model. For example, if the model \[Y_i = \beta_0 + \beta_1 \times X_{1,i} + \epsilon_i\] provides unbiased predictions, but violates the assumption of homoscedasticity, a transformed model \[\log(Y_i) = \beta_0 + \beta_1 \times X_{1,i} + \epsilon_i\] might resolve some of the issues with the homoscedasticity assumption, but lead to biased predictions. In this case, we might want to transform both the dependent variable and the predictor \[\log(Y_i) = \beta_0 + \beta_1 \times \log(X_{1.i}) + \epsilon_i\] to (hopefully) resolve the issue with homoscedasticity and maintain unbiased estimates. And when we have more than one predictor, the situation becomes even more tricky to resolve. Transforming variables to make a model conform to the assumptions of the linear model is, in some sense, perhaps more of an art than a science. It is certainly not easy, and something that should be done with caution.

A general technique which aims to help you determine the appropriate transformation function is the so-called Box-Cox transformation (see e.g. Sakia, 1992). Describing this procedure in detail is beyond the scope of this book, but knowing of its existence might help you out in the future. And once you become comfortable with linear models, you will become more comfortable with the various ways in which to transform your data as well.

5.7.2 Polynomial regression

If the assumption of unbiasedness is violated because the relation between a predictor and the dependent variable is not linear, you may also try a technique called polynomial regression. For example, if we go back to analysing the relation between hate groups and votes for Trump considered earlier, you might not expect the relation to be exactly linear. It might be more reasonable to assume that the effect of hate groups on votes for Trump diminishes for very large numbers of hate groups. In other words, you might expect the effect of a one-unit increase in hate groups to be larger for moving from one to two hate groups per million citizens, compared to moving from eight to nine hate groups per million citizens. This means that the relation between hate groups and votes for Trump is nonlinear.

There are various techniques to deal with nonlinear relations. Polynomial regression can allow you to estimate nonlinear relations within the context of a linear model. That seems contradictory at first: how can a linear model provide a nonlinear relation? The key to doing so is to effectively transform the predictor in a variety of ways, and then average the functions relating each transformed predictor to the dependent variable. This “averaging” is done automatically, by estimating regression coefficients for each in a multiple regression model.

The transformations considered in a polynomial regression are all power-transformations of a predictor (i.e. \((X_i)^k\), with \(k=1, 2, \ldots\)), up to a certain maximum power, called the degree of the model. For example, a 2nd-degree polynomial model would be

\[Y_i = \beta_0 + \beta_1 \times X_{i} + \beta_2 \times X_i^2 + \epsilon_i\] which is sometimes also called a quadratic model. A 3rd-degree polynomial, also called a cubic model, would be \[Y_i = \beta_0 + \beta_1 \times X_{i} + \beta_2 \times X_i^2 + \beta_3 \times X_i^3 + \epsilon_i\] The power-transformed predictors (e.g., \(X^2\) and \(X^3\)) are treated as any other predictor. But, because when we know the value of the original predictor \(X\), we also exactly know the value of each transformed predictor, the conditional mean \(\mu_{Y|X,X^2,X^3}\) is really just a function of \(X\). Hence, we can unambiguously consider the predictions of the model for each possible value of \(X\).¹³ These are shown, for a quadratic (second-order) and cubic (third-order) polynomial regression model in Figure 5.9. As you can see, both the quadratic and cubic model predict a nonlinear relation between hate groups and Trump votes. The quadratic model indicates that the effect of hate groups diminishes with more hate groups. In the cubic model, the effect even appears to reverse, such that after an initial increase, Trump votes decrease with more hate groups. As such, the quadratic model seems more plausible.

Figure 5.9: Polynomial regression models for the relation between hate groups and Trump votes.

At this point in time, I would like you to just be aware of the possibility of polynomial regression. We will hopefully return to it at some later stage.

5.8 Multicollinearity: Redundancy between predictors

Multicollinearity is a rather fancy word for interdependence between predictors in a linear model. Interdependence means that the values of a predictor are related to the values of other predictors, such that you can predict one from the others. In Figure 5.6, this is represented by the partial overlap of the circles representing \(X_1\) and \(X_2\). A high covariation between predictors indicates redundancy between them, in the sense that variation in one predictor can be accounted for by another. If this shared variation is also shared with the dependent variable (e.g. region \(D\) in Figure 5.6), then this can not be attributed uniquely to any of those covarying predictors. While the model as a whole may be able to account for a large proportion in the variation of the dependent variable, the unique variance “explained” by individual predictors could be very small so that the hypothesis tests are not significant for any predictor, even though the “whole model test” (i.e. the comparison between the full model and an intercept-only model) is highly significant. This makes parameter inference tricky, to say the least.

Figure 5.10 illustrates the difference between a model without redundancy between between predictors and with (substantial) redundancy between predictors. In the case of no redundancy, each predictor uniquely accounts for a substantial proportion (regions labelled as \(B\) and \(C\)) of the variance of the dependent variable. In the case of redundancy between the predictors, while the overlapping region between each predictor and the dependent variable is the same size as before (region \(B + D\) for \(X_1\), and region \(C + D\) for \(X_2\)), because a large part of this (region \(D\)) is shared between \(X_1\) and \(X_2\), the unique parts (\(B\) and \(C\) respectively) are substantially smaller. In the most extreme case, the circles representing \(X_1\) and \(X_2\) would be completely overlapping, meaning that neither accounts for a unique proportion of the variation in \(Y\). Multicollinearity is an extension of this concept, referring to dependency between more than two predictors.

Figure 5.10: Partitioning the variance in a multiple regression model. Each circle represents the variance of a variable. Overlapping regions represent shared variability (e.g. covariance) between variables.

Another way to see the problem of multicollinearity is to consider the confidence interval of a slope in a multiple regression model. The confidence interval can be computed as \[\begin{equation} \hat{\beta}_j \pm \sqrt{\frac{F_{1,n - \text{npar}(G);\alpha} \text{MSE}(G)}{n S^2_{X_j} (1-R^2_{X_j})}} \tag{5.10} \end{equation}\] In this rather complicated looking equation, \(F_{1,n - \text{npar}(G);\alpha}\) is the critical value for the \(F\)-distribution with \(\text{df}_1 = 1\) and \(\text{df}_2 = n - \text{npar}(G)\) degrees of freedom and a significance level \(\alpha\) (e.g. \(\alpha = .05\) for a 95% confidence interval). \(\text{MSE}(G) = \frac{\text{SSE}(G)}{n - \text{npar}(G)}\) is the Mean Squared Error of MODEL G, \(S^2_{X_j}\) is the sample variance of \(X\) (not the unbiased estimator¹⁴), and \(R^2_{X_j}\) is the \(R^2\) of a model predicting \(X_j\) from all other predictors \[X_{j,i} = \beta_0 + \beta_1 X_{1,i} + \ldots + \beta_{j-1} X_{j-1,i} + \beta_{j+1} X_{j+1,i} + \ldots + \beta_m X_{m,i} + \epsilon_i\] The value of \(R^2_{X_j}\) is a measure of the redundancy of the predictor \(X_j\). For \(X_1\), this is represented in the right-hand plot of Figure 5.10 as the proportion \(\frac{D+G}{B+D+E+G}\). For \(X_2\), the corresponding proportion is \(\frac{D+G}{C+D+F+G}\).

If the value of \(X_j\) could be perfectly predicted from the values of all other predictors, then \(X_j\) would not be able to provide any information about the dependent variable \(Y\) over and above the other predictors. Remember that the slope in a multiple regression model represents an increase or decrease in the conditional mean of \(Y\) for a one-unit increase in the predictor. If any increase in \(X_j\) can be perfectly predicted by an increase or decrease in the other predictors, then those in- and decreases of the other predictors can also perfectly account for the associated in- or decrease in the conditional mean of \(Y\). Very high values of \(R^2_{X_j}\) indicate that a predictor \(X_j\) will not be able to uniquely account for much variation in \(Y\). The value of \(1 - R^2_{X_j}\) then conversely represents the uniqueness of predictor \(X_j\), and this value is commonly referred to as the “tolerance”: \[\begin{equation} \text{tolerance}(X_j) = 1 - R^2_{X_j} \tag{5.11} \end{equation}\]

All else being equal, a decrease in the tolerance will increase the width of the confidence interval. In the extreme where \(1 - R^2_{X_j} = 0\), the denominator would be 0, and the confidence interval would be \[\hat{\beta}_j \pm \sqrt{\frac{F_{1,n - \text{npar}(G);\alpha} \text{MSE}(G)}{0}} = \hat{\beta}_j \pm \infty\] An infinite confidence interval means that we would not be able to reject any null-hypothesis regarding the slope of \(X_j\). In other words: anything goes.

A tolerance of 0 and complete redundancy is not common if you are working with real variables. It can certainly happen if you include the same predictor twice in a model, or if you for instance include a number of predictors and also their case-wise mean. For example, in the model \[Y_i = \beta_0 + \beta_1 X_{1,i} + \beta_2 X_{2,i} + \beta_3 \left( \frac{X_{1,i} + X_{2,i}}{2} \right) + \epsilon_i\] the third predictor (the case-wise average of \(X_1\) and \(X_2\)) is completely redundant. We can rewrite this model in an equivalent form as \[\begin{aligned} Y_i &= \beta_0 + (\beta_1 + \tfrac{1}{2} \beta_3) X_{1,i} + (\beta_2 + \tfrac{1}{2} \beta_3) X_{2,i} + \epsilon_i \\ Y_i &= \beta_0 + \beta_1' X_{1,i} + \beta_2' X_{2,i} + \epsilon_i \end{aligned}\] which would give exactly the same predictions of \(Y_i\). Moreover, while we can estimate the slopes \(\beta_1'\) and \(\beta_2'\) as usual, we would not be able to determine uniquely which part of \(\hat{\beta}_1'\) should be assigned to \(\beta_1\), and which part should be assigned to \(\tfrac{1}{2}\beta_3\). Any such assignment would be arbitrary. For instance, if \(\hat{\beta}_1' = 2\), then we could say that \(\beta_1 = 1\) and \(\beta_3 = 2\), but also that \(\beta_1 = 2\) and \(\beta_3 = 0\), without changing anything substantial about the relation between \(Y\) and \(X_1\) and \(X_2\).

5.8.1 Detecting and dealing with multicollinearity

Generally, tolerance values below \(1- R^2_{X_j} = .20\) are considered potentially problematic. This is by no means a precise cut-off value, but it can guide you nonetheless. Some statistics programs also report the variance inflation factor (VIF), which is defined as \[\text{VIF} = \frac{1}{1 - R^2_{X_j}}\] As this is just a transformation of the tolerance, I don’t really see the point. Although it might be more intuitive as a measure of collinearity in the sense that high values indicate stronger collinearity, the scale of this measure is between 1 and \(\infty\), and when possible, I find a bounded scale more easily interpretable. But, as there is a one-to-one relation between the tolerance and VIF measures, both provide exactly the same information, so either is fine.

Multicollinearity is not a major issue if you are mainly concerned with predicting the dependent variable. The model as a whole will provide the best possible predictions, with or without collinearity. Problems arise mainly when your goal is to infer and test the true values of the parameters as reflecting relations in the Data Generating Process. When there is redundancy between predictors (high multicollinearity), the estimates of the slopes of each predictor become unreliable, and hypothesis tests will have low power. A “quick-and-dirty” solution is to simply remove some of the collinear predictors, keeping a set of predictors which are more independent of each other. The choice of which predictors to remove can be rather arbitrary however, and eliminating a predictor from consideration in an analysis does not mean that this predictor has no theoretical importance. Another option, if possible, is to increase the sample size \(n\), which may reduce the width of the confidence intervals to more agreeable magnitudes. Alternatively, you could consider transforming all predictors into a set of orthogonal (uncorrelated) variables using a technique such as Principal Components Analysis, where each new variable is a linear combination of the original predictors. Those new predictors can be difficult to interpret, however. Another option is to use a technique called ridge regression (Hoerl & Kennard, 1970), which allows inclusion of all predictors in the model with a reduced variance of the parameter estimates, but thereby introduces some bias in these estimates.

5.9 Outliers

When we analysed the data regarding Trump votes and hate groups up to now, we excluded the data from the District of Columbia (Washington D.C.), because, while it is not part of any US state, it is itself not a state either. If we had included the data from this district, the results would have been quite different. Figure 5.11 shows the Trump votes and hate groups with the Distric of Columbia included. You can see that the data point for the District of Columbia is far removed from all other data points. The District of Columbia has both a very high number of hate groups (30.83 per million citizens), and a low percentage of votes for Trump (4.1%). This makes the data District of Columbia rather unusual when compared to the 50 states. The plot also shows the estimated regression line of a simple regression model (with hate groups as a single predictor) for all the data, and the data excluding the District of Columbia. After inclusion of the District of Columbia, the estimated relation goes from positive to negative!

Figure 5.11: Percentage of votes for Trump in the 2016 elections for 50 US states and the District of Columbia, as a function of the number of hate groups per 1 million citizens. The solid line is the estimated regression line for all data, and the dotted line the estimated regression line excluding the District of Columbia.

When we estimate a multiple regression model for all the data, we obtain the results given in Table 5.4. Instead of a significant positive relation between hate groups and Trump votes, we now find the slope is estimated close to 0, and not significant. Note that this estimated relation is not negative (as in Figure 5.11), which illustrates the fact that the unique relations determined in multiple regression models can differ substantially from the bivariate relations of simple regression models.

Table 5.4: Null-hypothesis significance tests by model comparisons and the F statistic.
	\(\hat{\beta}\)	\(\text{SS}\)	\(\text{df}\)	\(F\)	\(p(\geq \lvert F \rvert)\)
Intercept	93.517	18525.97	1	367.803	0.000
Hate groups per million	0.054	2.29	1	0.046	0.832
% bachelors degree or higher	-1.484	3912.90	1	77.684	0.000
Error		2417.72	48

The question now is: how reasonable are these new results? When looking at all 50 states (excluding the District of Columbia), there appears to be rather good evidence for a positive relation between hate groups and Trump votes, whether assessing this relation in a simple regression model or in a multiple regression model where we control for the possible confounding effect of education level. That these results change dramatically after inclusion of a single additional data point indicates that this additional data point might be an outlier. An outlier can be rather imprecisely defined as an unusual data point that does not follow the general trend of the rest of the data. Clearly, the District of Columbia does not appear to follow the general trend in terms of Trump votes and hate groups.

Figure 5.12: Percentage of votes for Trump in the 2016 elections for 50 US states and the District of Columbia, as a function of the percentage of citizens with a bachelors degree or higher. The solid line is the estimated regression line for all data, and the dotted line the estimated regression line excluding the District of Columbia.

Figure 5.12 shows the relation between Trump votes and education level for all the data. While the District of Columbia is again far removed from the other data points, it doesn’t seem to “break” the negative relation between education level and Trump votes. Indeed, the estimated regression lines for a simple regression model predicting Trump votes from education level don’t differ all that much whether we include the District of Columbia or not.

So, sometimes “unusual” data points can have a large influence on model estimates, and sometimes not. Outliers which do not affect the models estimates much are of relatively little concern. However, outliers which have an undue influence on the model estimates are better removed from the dataset. You don’t generally want the results of an analysis to depend largely on a single observation (it just wouldn’t be fair to the other data points). For an observation to have a large effect on the model estimates, it generally has to have values on the predictor variables which are far from the average, and a value on the dependent variable which is far removed from the true regression plane (i.e. the true residual \(\epsilon_i\) is large). If an observation has values on the predictor variables which are close to other observations, then those other observations generally reduce the impact of the outlier, because they will have values on the dependent variable which are not unusual (thus “pulling” the model estimates towards them and with that towards the true values). If an observation has a value on the dependent variable which is not far removed from the true regression plane, then it will not bias the solution either.

There are several measures which aim to determine the unusualness and undue influence of observations on the model estimates. Commonly used ones include the leverage or lever (which assess whether the predictor values are far from the average), Studentized residuals (which aim to estimate whether an observation is far from the true regression line), and Studentized deleted residuals and Cook’s distances (which aim to assess both). Some of these (e.g. Studentized deleted residuals) are effectively hypothesis tests applied to each individual observation. That has the benefit of providing them with a formal criterion to identify observations as outliers or not. On the other hand, by performing a total of \(n\) hypothesis tests with a significance level of \(\alpha\), on average \(n \times \alpha\) of these tests will involve a Type 1 error (a false rejection of the null hypothesis that an observation is not an outlier). These measures are therefore likely to classify observations as outliers which are not truly outliers. Other measures such as Cook’s distance use more heuristic cut-off values to determine when there is a needs to carefully consider whether an observation is an outlier or not.

At this point, I don’t want to go into the mathematical details of these outlier detection measures. For now, I just want you to be aware of the potential problems that arise from outliers. Often, a thorough visual exploration of the data will indicate potential outliers. If you suspect an outlier, it is useful to repeat an analysis with and without including the potential outlier. When the results differ substantially, you will then need to carefully consider whether there are good reasons to exclude the outlying observation from the analysis. In the case of the District of Columbia, there are several reasons to exclude the District of Columbia. Not only does this electoral district has an unusual high number of hate groups, and an unusual low number of Trump votes, the district is not a state but a single city (Washington D.C.), with consequently a much higher population density than the 50 states. As the political capitol of the United States, it is likely to be different in many aspects related to politics. As such, treating this as a separate entity from the 50 US states seems reasonable.

5.10 In practice

Linear regression is the “bread and butter” of data analysis. Whilst the name is generally reserved for models with only metric predictors, we we will see later on, the model we have discussed here is the foundation of the General Linear Model.

Explore the data. Check the distribution of the dependent variable and each (potential) predictor. Are there outlying or otherwise “strange” observations? Also explore pairwise relations between the (potential) predictors and the dependent variable, as well as between the predictors themselves. This may point to potential issues with the assumption of linearity (although pairwise linearity or nonlinearity is not necessarily reflective of linearity in a multiple regression model).
Estimate the multiple regression model. Then check for potential issues in the assumptions with e.g. predicted vs residual plots. This may point to potential outliers, but you can also check for overly influential cases with a measure such as Cook’s distance. A value of Cook’s distance larger than 1 is generally considered as a cut-off. If there are clear outliers in the data, remove these, and then re-estimate the model. Also check for issues with multicollinearity.
Report the results. When reporting the results, make sure that you include all relevant statistics. For example, one way to write the results of the analysis reported in this chapter is as follows:

We assessed the relation between votes for Donald Trump and the presence of hate groups in 50 states of the USA with a multiple linear regression model (we excluded data from the District of Columbia, as this was a clear outlier). For each state, we computed the number of hate groups per one million citizen, to correct for differences in population size in the states. As the level of education might be a possible confound, we controlled for education level (measured as the percentage of the population with a Bachelor’s degree or higher) by including this as a second predictor. The model accounted for a large proportion of the variance in Trump votes, \(R^2 = .58\), \(F(2, 47) = 33.06\), \(p < .001\). The analysis showed a significant positive effect of hate groups on votes for Donald Trump, \(b = 1.31\), 95% CI \([0.29, 2.34]\), \(t(47) = 2.58\), \(p = .013\). In addition, we found a significant negative relation between education level and votes for Donald Trump, \(b = -1.22\), 95% CI \([-1.59, -0.85]\), \(t(47) = -6.63\), \(p < .001\).

References

Hoerl, A. E., & Kennard, R. W. (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12, 55–67.

Sakia, R. M. (1992). The box-cox transformation technique: A review. Journal of the Royal Statistical Society: Series D (The Statistician), 41, 169–178.

A vector with the intercept and slopes can be computed by first constructing a so-called design matrix \(\mathbf{X}\), where the first column contains only 1’s, and the remaining columns contain values for each predictor. The vector with estimates is then computed as \(\hat{\boldsymbol{{\beta}}} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{Y}\), where \(\mathbf{Y}\) is the vector with the values of the dependent variable, and \(^\top\) stands for the matrix transpose, and \(\mathbf{X}^{-1}\) for the inverse of matrix \(\mathbf{X}\).↩︎
In some of the literature and output of statistical software, these are referred to as beta-coefficients. As we are using beta (\(\beta\)) to reflect the true value of a linear model parameter, this may be confusing.↩︎
In usual multiple regression models, the values of the other predictors \(X_k\) can take a variety of values for any focal predictor \(X_j\), \(j \neq k\). Here there is a perfect–but crucially_nonlinear relation between the predictors.↩︎
you could also use that, but then you’d have to replace the preceding \(n\) by \(n-1\)↩︎