Chapter 4 Simple linear regression

In this chapter, we continue our journey into the General Linear Model by extending the very simple model of the previous chapter to include a predictor. The resulting model assumes there is a linear relation between the dependent variable and the predictor, and is also known as a simple linear regression model. We will look at the parameters of the model in detail, and discuss their estimation, as well as testing whether their values are equal to a priori ones of interest.

4.1 Trump, votes, and hate groups

Donald Trump was perhaps the most divisive president in American history. The 2016 US elections were mired in controversy. Some reports indicate the number of white nationalist hate groups have gone up by 55% in the Trump era. One study found a strong relation between Trump support and anti-immigrant sentiment, racism, and sexism (Schaffner, Macwilliams, & Nteta, 2018).

Paul Ricci collated data about the number of hate groups in the different US states and votes for Trump. A scatterplot of these two variables is provided in Figure 4.1. You can see that in states where there are relatively little hate groups, there appear to also be relatively little votes for Trump. In states where there are a relatively large number of hate groups, there appear to be relatively many votes for Trump.

Figure 4.1: Percentage of votes for Trump in the 2016 elections for 50 US states and the number of hate groups per 1 million citizens

To assess the relation between Trump votes and the number of hate groups, we will use a simple linear regression model. We are interested in a model which allows us to predict the percentage of votes for Trump from the number of hate groups. In other words, votes for Trump is the dependent variable, and hate groups a predictor variable. A predictor variable is sometimes also referred to as an independent variable.

As the name suggests, a linear regression model involves a line, a straight one in fact. This straight line represents the average or expected value of the dependent variable $Y$ (i.e. the percentage of Trump votes) over all cases which have a particular value on the predictor (or independent) variable $X$ (i.e. the number of hate groups). For example, we may aim to infer the average percentage of Trump votes in all states with 5.55 hate groups per 1 million citizens, or the average percentage of Trump votes in states with 0 hate groups. The model allows the values of the dependent variable for individual cases to vary around the average. So the percentage of Trump votes for two different states, each with 5.55 hate groups, may differ from the average as well as each other. The model assumes that such variability can be described through a Normal distribution. Importantly, the model assumes that the variance of this normal distribution is the same, no matter the value of the predictor.

4.2 The model

In the previous chapter, we used a simple statistical model: \[Y_i \sim \mathbf{Normal}(\mu,\sigma)\] This model assumes that each observation is independently drawn from a Normal distribution, with a mean $\mu$ and a standard deviation $\sigma$. We can state this model in an equivalent way as: \[\begin{equation} Y_i = \mu + \epsilon_i \quad \quad \quad \epsilon_i \sim \mathbf{Normal}(0,\sigma) \tag{4.1} \end{equation}\] In this two-part formulation, the first part (formula on the left side) decomposes each observation $Y_i$ in a structural part (here the mean, $\mu$), and a random part that we usually call the error ($\epsilon_i$). We don’t know much about this error. We just assume that it is drawn from a Normal distribution with a mean of 0 and standard deviation $\sigma_\epsilon$. This assumption about the distribution forms the second part (the formula on the right side of Equation (4.1)).⁷

A model like this doesn’t allow for very precise predictions. As the error term is assumed to be completely random, it is unpredictable. So if you were asked to predict the value of $Y$, all you could really do is to use $\mu$ as your prediction. If, however, the dependent variable (e.g. votes for Trump) is related to a predictor variable (e.g. number of hate groups), then we should be able to use to use the predictor to make more precise predictions. Linear regression allows us to do this in a straightforward way.

The bivariate regression model is depicted in Figure 4.2. More formally, the model can be defined as follows:

\[\begin{equation} Y_i = \beta_0 + \beta_1 \times X_{i} + \epsilon_i \quad \quad \quad \epsilon_i \sim \mathbf{Normal}(0,\sigma_\epsilon) \tag{4.2} \end{equation}\]

Here, $Y_i$ is the value of the $i$-th observation of the dependent variable (with $i = 1, \ldots, n$) and $X_i$ the value of the $i$-th observation of the predictor variable. These are the observable parts of the data. The model contains three important parameters:

The intercept $\beta_0$, which is the mean of $Y$ when $X=0$.
The slope $\beta_1$, which reflects the increase or decrease in the mean of $Y$ for every 1-unit increase in the predictor $X$. By a 1-unit increase, we simply mean that the value of $X$ goes up by 1, e.g. from $X=2$ to $X=3$, or from $X=12.63$ to $X=13.63$.
The standard deviation $\sigma_\epsilon$ of the error or residual terms $\epsilon_i$. The errors are assumed to be drawn independently from the same Normal distribution, with a mean of 0 and standard deviation $\sigma_\epsilon$.

$The simple regression model. $\textbf{A}$: Simple regression aims to capture the relation between two variables, a dependent variable ($Y$) and a predictor variable ($X$). Each case in a dataset provides a pair of observations on both variables. $\textbf{B}$: The intercept is the mean of $Y$ when $X=0$ and is the point at which the regression line crosses the y-axis. The slope determines the steepness of the regression line and represents the increase (or decrease) in the mean of $Y$ for every 1-unit increase in the predictor $X$. $\textbf{C}$: the error terms or residuals are the vertical distances of each observed Y-value from the regression line. $\textbf{D}$: The errors in a regression model are assumed to follow a Normal distribution around the regression line.$

Figure 4.2: The simple regression model. $\textbf{A}$: Simple regression aims to capture the relation between two variables, a dependent variable ($Y$) and a predictor variable ($X$). Each case in a dataset provides a pair of observations on both variables. $\textbf{B}$: The intercept is the mean of $Y$ when $X=0$ and is the point at which the regression line crosses the y-axis. The slope determines the steepness of the regression line and represents the increase (or decrease) in the mean of $Y$ for every 1-unit increase in the predictor $X$. $\textbf{C}$: the error terms or residuals are the vertical distances of each observed Y-value from the regression line. $\textbf{D}$: The errors in a regression model are assumed to follow a Normal distribution around the regression line.

Remember, unlike variables, parameters are not observable. We can infer their value from the data by estimation and/or performing hypothesis tests. But we can never be completely sure that such inferences are correct. Let’s consider the structural part of the simple regression model again: \[Y_i = \beta_0 + \beta_1 \times X_{i} + \epsilon_i\] This formula involves both observable variables ($Y$ and $X$) and unobservable parameters ($\beta_0$, $\beta_1$, and $\epsilon_i$). Indeed, the errors $\epsilon_i$ are really parameters as well, because they cannot be directly observed. There are many of these (one for each observation $i = 1, \ldots, n$), and they are not of primary concern. The other part, $\beta_0 + \beta_1 \times X_i$, determines the regression line, representing the average value of $Y$ for all cases with a particular value on the predictor variable $X$. These average values are conditional means, they are the mean of $Y$, conditional upon a particular value of the predictor $X$. “Conditional upon” here means that we only consider cases with a particular value on the predictor variable. For instance, we may consider all states with 5.55 hate groups. The conditional mean is then just the average of the dependent variable (votes for Trump) in this particular group of states. We can use $\mu_{Y|X}$ to denote the mean of $Y$ conditional upon $X$. If we then use $\mu_{Y|X_i}$ to denote the mean of $Y$ conditional upon $X$ having the value $X_i$ (i.e. the value of $X$ for case $i$ in the dataset), we can define the conditional mean in terms of the regression model as: \[\begin{equation} \mu_{Y|X_{i}} = \beta_0 + \beta_1 \times X_{i} \tag{4.3} \end{equation}\]

Now, plugging the conditional mean into Equation (4.2) we get \[Y_i = \mu_{Y|X_{i}} + \epsilon_i \quad \quad \quad \epsilon_i \sim \mathbf{Normal}(0,\sigma_\epsilon)\] which, in many respects, is very similar to the simple model of Equation (4.1). The key difference is the use of a conditional mean $\mu_{Y|X_{i}}$ instead of an unconditional (constant) mean $\mu$. Effectively, within each group of cases with the same value on the predictor variable $X$ (e.g. all states with 5.55 hate groups), there is just a single conditional mean. Within such a group, the model is equivalent to the simple model of Equation (4.1). What is new here is that we now also take into account differences between different groups. In particular, we assume that these groups differ just in the conditional mean $\mu_{Y|X}$, whilst the standard deviation of the errors ($\sigma_\epsilon$) is the same for all groups. We finally assume that there is a simple, linear relation between the value of $X$ and the conditional mean $\mu_{Y|X}$. So the regression model extends the simple model of the previous chapter by using a straight line to represent the relation between the (conditional) mean of the dependent variable and a predictor variable.

4.3 Estimation

It can be shown that for the model specified in Equation (4.2), the maximum likelihood estimates of the model parameters are \[\begin{equation} \hat{\beta}_0 = \overline{Y} - \hat{\beta}_1 \times \overline{X} \tag{4.4} \end{equation}\] for the intercept, and \[\begin{equation} \hat{\beta}_1 = \frac{\sum_{i=1}^n (X_i - \overline{X})(Y_i - \overline{Y})}{\sum_{i=1}^n (X_i - \overline{X})^2} \tag{4.5} \end{equation}\] for the slope. Note that to estimate the intercept, we first need the estimate of the slope. So let’s focus on this one first. The top part of the division (the numerator) contains a sum of deviations of the predictor values ($X_i$) from its average ($\overline{X}$) multiplied by deviations of the values $Y_i$ of the dependent variable from its average ($\overline{Y}$). Let’s consider these multiplied deviations in more detail. Each deviation is positive (larger than 0) when the value is higher than the average, and negative (smaller than 0), when the value is lower than the average. So the multiplied deviations are positive whenever both values are larger than their average, and whenever both values are below their average (a negative value multiplied by another negative value is positive). If we were to divide the sum of the multiplied deviations by $n$ (the number of observations), we’d get the average of these multiplied deviations. This average is also called the covariance between $X$ and $Y$: \[\text{Cov}(X,Y) = \frac{\sum_{i=1}^n (X_i - \overline{X})(Y_i - \overline{Y})}{n}\] Note that, as an estimator, this provides biased estimates of the true covariance. An unbiased estimator of the true covariance is obtained by dividing by $n-1$ instead of $n$ (just as for the variance). Going back to our example, the covariance between Trump votes and hate groups would be positive whenever states with higher-than-average Trump votes are generally also states with higher-than-average hate groups, and whenever states with lower-than-average Trump votes are generally also states with lower-than-average hate groups. The covariance would be negative, on the other hand, whenever states with higher-than-average Trump votes are generally states with lower-than-average hate groups, and whenever states with lower-than-average Trump votes are generally states with higher-than-average hate groups. A positive or negative covariance is indicative of a relation between $X$ and $Y$. Indeed, the well-known Pearson correlation coefficient is a standardized covariance, where the standardization scales the correlation to always be between -1 and 1 and involves dividing the covariance by the product of the standard deviations of both variables: \[\begin{align} r_{X,Y} &= \frac{\text{Cov}(X,Y)}{S_X \times S_Y} \\ &= \frac{\frac{\sum_{i=1}^n (X_i - \overline{X})(Y_i - \overline{Y})}{n}}{\sqrt{\frac{\sum_{i=1}^n (X_i - \overline{X})^2}{n}} \times \sqrt{\frac{\sum_{i=1}^n (Y_i - \overline{Y})^2}{n}}} \end{align}\]

Going back to the estimate of the slope (Equation (4.5)), we can view this as a different way of standardizing the covariance. Looking at the bottom part of the division (the denominator), we can see it consists of the sum of squared deviations of the predictor values from its mean. If we were to divide this sum by the number of observations, we’d get the variance of $X$. As dividing both the top part (numerator) and bottom part (denominator) in a division by the same value does not affect the outcome of the division (i.e. $\frac{a}{b} = \frac{a/c}{b/c}$), we can choose to divide both by $n$ so the numerator becomes the covariance and the denominator the variance (we could also divide both by $n-1$ so they become unbiased estimators of the covariance and variance). So an alternative way of computing the estimate of the slope is \[\hat{\beta}_1 = \frac{\text{Cov}(X,Y)}{S^2_X}\] Note that the variance of $X$ equals the product of the standard deviation of $X$ and itself, as $S^2_X = S_X \times S_X$. So the slope estimate looks quite a bit like the correlation coefficient, where instead of the standard deviation of $Y$, we use the standard deviation of $X$ twice. With a little algebraic manipulation, we can also state the slope estimate in terms of the correlation as \[\hat{\beta}_1 = \frac{S_Y}{S_X} r_{X,Y}\] The reason for going into these alternative formulations is to show you that the slope tells us something about the relation between $X$ and $Y$, just like the covariance and correlation do. If the sample correlation is 0, then so is the estimated slope. It is important to realise that we have been discussing the estimate of the slope, not the true value itself. But the same relations hold for the true values. If we denote the true correlation as $\rho_{X,Y}$, then the true value of the slope can be defined as \[\beta_1 = \frac{\sigma_Y}{\sigma_X} \rho_{X,Y}\] The true value of the slope is 0 when the true correlation between $X$ and $Y$ equals $\rho_{X,Y} = 0$. It would also be 0 if the true standard deviation of $Y$ equals $\sigma_Y = 0$, but this implies that $Y$ is a constant and that is not a very interesting situation.

That was perhaps a little tortuous, and we haven’t even discussed the estimate of the intercept! Remember that the intercept represents the mean value of $Y$ for all cases where $X=0$. This average is often not so interesting itself, although in our example, we might be interested in what the average percentage of votes for Trump would be in the absence of any hate groups. Equation (4.4) shows that we can estimate this value by adjusting the sample average $\overline{Y}$ by subtracting $\hat{\beta}_1 \times \overline{X}$ from it. How come? Well, it can be shown that the regression line always passes through the point $(\overline{X},\overline{Y})$. The derivation for showing that this has to be the case is not that interesting, so you’ll just have to trust me, or look it up elsewhere. But this implies that \[\overline{Y} = \beta_0 + \beta_1 \times \overline{X}\] and then we can simply subtract $\beta_1 \times \overline{X}$ from both sides to get $\overline{Y} - \beta_1 \times \overline{X} = \beta_0$.

Finally, we can also estimate $\sigma_\epsilon$, the standard deviation of the error. As usual for variances and standard deviations, the maximum likelihood estimate is biased, so we’ll focus on an unbiased estimator. Like before, an estimate of the variance is computed from a sum of squared deviations from an estimated mean. In this case, we need to use the estimated conditional means \[\hat{\mu}_{Y|X_i} = \hat{\beta}_0 + \hat{\beta}_1 \times X_i\] to compute the following estimate of the error variance: \[\begin{equation} \hat{\sigma}^2_\epsilon = \frac{\sum_{i=1}^n (Y_i - \hat{\mu}_{Y|X_i})^2}{n-2} \tag{4.6} \end{equation}\] Note that we are dividing by $n-2$ here, rather than by $n-1$ as we did when estimating the variance of $Y$ in a model without predictors. The reason for this is that we are now using two noisy parameter estimates (i.e. $\hat{\beta}_0$ and $\hat{\beta}_1$), rather than just one. As usual, to get the estimate of $\hat{\sigma}_\epsilon$, we can just take the square root of $\hat{\sigma}^2_\epsilon$.

4.3.1 Estimating the relation between Trump votes and hate groups

So, what are the estimates of the model predicting Trump votes by hate groups? We can calculate the estimates relatively easily by first computing the sample means of predictor and dependent variable, their sample standard deviations, and the covariance between the predictor and dependent variable:

$\overline{\mathtt{hate}}$	$\overline{\mathtt{votes}}$	$S^2_\mathtt{hate}$	$S^2_\mathtt{votes}$	$\text{Cov}(\mathtt{hate},\mathtt{votes})$
3.03	49.9	3.71	99.8	8.53

The estimate of the slope is then \[\hat{\beta}_1 = \frac{8.527}{3.707} = 2.3\] and the estimate of the intercept is \[\hat{\beta}_0 = 49.862 - 2.3 \times 3.028 = 42.897\]

Of course, we would not normally bother with calculating these estimates “by hand” in this way. We would rely on statistical software such as R or JASP to calculate the estimates for us. In any case, we can write the estimated model as:

\[\texttt{votes}_i = 42.897 + 2.3 \times \texttt{hate}_i + e_i\] According to this model, the average percentage of Trump votes in states without any hate groups is 42.897. For every one additional hate group (per million citizens), the percentage of Trump votes increases by 2.3. You can view the resulting regression line (the conditional means of votes for Trump as predicted by the model) in Figure 4.3.

Figure 4.3: Percentage of votes for Trump in the 2016 elections for 50 US states and the number of hate groups per 1 million citizens with the estimated regression line.

4.4 Hypothesis testing

The estimated model indicates a positive relation between the number of hate groups and votes for Donald Trump. But the slope is an estimated parameter. It might be that, in reality, there is no relation between hate groups and Trump votes. But because we have limited and noisy data, sometimes the estimated slope might be positive, and other times it might be negative. So the question is whether we have enough evidence to reject the null hypothesis that $\beta_1 = 0$.

As before, there are two main ways in which to look at testing whether the parameters of the model are equal to a priori values. The first is to consider how variable the parameter estimates are under the assumption that the true parameter is identical to the a priori value. The second way is to compare two models, one in which the parameter is fixed to the a priori value, and one where it is freely estimated. Both of these ways will provide us with the same outcome. The model comparison approach is more flexible, however, as it also allows you to test multiple parameters simultaneously. We’ll start with discussing the first method, but it will be the last time until a while. After that, we’ll focus on model comparison throughout.

4.4.1 Sampling distribution of estimates

Remember, an estimate (whether of the mean or another parameter) is a noisy reflection of the true value of that parameter. The noise comes from having access only to limited data, not all the data the data generating process can produce. Suppose in reality there is no relation between hate groups and Trump votes, so that the true slope is $\beta_1 = 0$. In that case, the model becomes \[\begin{align} Y_i &= \beta_0 + 0 \times X_i + \epsilon \\ &= \beta_0 + \epsilon_i \end{align}\] which is identical to the simple model in Equation, renaming $\beta_0 = \mu$ and $\sigma_\epsilon = \sigma$. Then, the true value of the intercept would be $\beta_0 = \mu$. If we’d know the true value of the standard deviation $\sigma_\epsilon$, we’d have a fully specified model (a Normal distribution) which we can use to generate as many alternative data sets as we’d like. By generating these data sets, and estimating the slope of our model for each, we can get an overview of the variability of the estimates when the true slope equals $\beta_1 = 0$. An unbiased estimator ensures that on average, these estimates equal the true value. The main thing of interest is then the variability around this value. A consistent estimator ensures that this variability becomes smaller with larger data sets. But to understand how well we can estimate the parameter for the present data, we would simulate data sets with the same number of observations (so $n=50$). Unfortunately, we don’t know the true value of $\sigma_\epsilon$. Our data provides an estimate of $\sigma_\epsilon$, but we know this estimate is noisy itself. Thinking in the same way about the sampling distribution of $\hat{\sigma}_\epsilon$, we could first sample values of $\sigma_\epsilon$, and then use each of these to generate a data set to estimate $\beta_1$. Doing this many (many many!) times would give us a good overview of the variability of those estimates. As before, we don’t have to actually simulate the data sets. If the model corresponding to the null hypothesis is true, then we can derive that the sampling distribution of the estimates follows a t-distribution. Thus, we should look for the $t$ value our data provides, and evaluate this value within the context of the sampling distribution derived under the model where $\beta_1 = 0$. For both parameters (intercept and slope), the same logic applies.

Actually, the same procedure can be used to test the hypothesis that a parameter takes any given value, i.e. $H_0: \beta_j = \underline{\beta}_j$. Often, the chosen null-value is $\underline{\beta}_j = 0$, but that does not need to be the case. In general then, for parameters $\beta_j$ (where $j = 0$ or 1), the $t$-statistic to test the null-hypothesis $H_0: \beta_j = \underline{\beta}_j$ is computed as \[\begin{equation} t = \frac{\hat{\beta}_j - \underline{\beta}_j}{\text{SE}(\hat{\beta}_j)} \quad \quad \quad t \sim \mathbf{T}(n-2) \end{equation}\] where $\mathbf{T}(n-2)$ denotes a standard Student t-distribution with $n-2$ degrees of freedom, and $\text{SE}(\hat{\beta}_j)$ is the standard error of the estimate, which you should remember is the standard deviation of the sampling distribution of the estimates. I won’t bore you with how to compute this standard error; the equations aren’t overly insightful, and statistical software does a good job at computing standard errors. One thing to realise though is that the computed standard error is valid for datasets with exactly the same values on the predictor. In other words, it assumes the predictor values are fixed (a part of the Data Generating Process). If you’d collect a different dataset with different values for e.g. the number of hate groups, the standard error would also be different.

R would for instance provide the following results for this regression model:

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	42.9	2.409	17.80	0.000
hate_groups_per_million	2.3	0.672	3.43	0.001

The values listed for (intercept) concern $\beta_0$, and the values listed for hate_groups_per_million concern the slope $\beta_1$. You can see the parameter estimates are identical to those computed earlier (phew, no mistakes there!). You can see that the standard errors for the two parameters are quite different in magnitude. This is not so surprising, as they reflect quite different things, the intercept being a particular value of the dependent variable, while the slope represents an increase in the dependent variable for an increase in the predictor. If you’d change the scale of the predictor (e.g., from hate groups per million to hate groups per 10,000 citizens), the slope would change, as well as the corresponding standard error. Changing the scale in this way would not affect the standard error of the intercept. The values of the $t$-statistic are those for tests where the null-hypotheses are $\beta_0 = 0$ and $\beta_1 = 0$. These test statistics are obtained by simply dividing each estimate by the corresponding standard error (for a different test value $\underline{\beta}_j \neq 0$, you would first subtract this value from the estimate, and then divide this difference by the standard error.

In a manner analogous to that depicted in Figure 3.6, you then determine critical values for the $t$-statistic (based on the degrees of freedom and the significance level $\alpha$), and determine whether the value you computed for the parameters lies within the critical range. If it does, the test result is called significant, and you reject the null hypothesis. If not, then the result is called not significant, and you don’t reject the null hypothesis. Now that these are easy to compute with statistical software, it is more common to check the $p$-values. Remember, the $p$-value is the probability of the computed $t$-statistic or a more extreme value, assuming the null hypothesis is true. If the $p$-value is smaller than the chosen significance level (e.g. $\alpha = .05$), that means that the test is significant, and the null hypothesis rejected. In the results above, you can see that both $p$-values are below the significance level $\alpha = .05$. Hence, we can reject the null hypothesis that $\beta_0 = 0$ and the null hypothesis that $\beta_1 = 0$. In other words, there is good evidence that, in the absence of hate groups, the percentage of people voting for Trump is not equal to 0.⁸ Also, there is good evidence that there is a relation between the number of hate groups and Trump votes.

Personally, I find the $t$ statistic quite intuitive in the context of a one-sample t-test. Generalizing the concept to a standardized estimate (dividing the estimate by its standard error) is also reasonably intuitive. However, in a multi-parameter model such as here, the sampling distribution of a single parameter is dependent on the estimation of all the other parameters. For instance, the test of the null hypothesis $H_0: \beta_1 = 0$ (i.e., no effect of hate groups on Trump votes) is based on deriving the sampling distribution of $\hat{\beta}_1$ in a model where $\beta_1 = \underline{\beta}_1$, but all the other parameters (i.e. $\beta_0$ and $\sigma_\epsilon$) are not assumed known. Hence, for each possible sample, these values would need to be estimated. This uncertainty is dealt with similarly as before, resulting in a $t$-distribution, but now there are two sources of uncertainty (two parameters to estimate). Hence, the degrees of freedom are $n-2$ here.

4.4.2 Model comparison

The fact that, in a multi-parameter model, a test of one parameter is not conducted in isolation, but rather in the context of all the other parameters in the model, is more explicit in the model comparison approach. When we compare two models, we have to make clear what the parameters are in each: what are the unknown quantities which we will have to estimate, and what are quantities which we can assume a precise value for?

As before, we will consider comparing nested models, in which a restricted MODEL R is a special case of a more general MODEL G. Sticking to simple linear regression models, the most general model we have is that of Equation (4.2).

Just like for the simple model of the previous chapter, it turns out we can compute the likelihood of a model as a function of a sum of squared deviations. We will not go through the derivation of this again, as it is rather similar and equally tedious. But, because we are now calling these deviations errors, will start to refer to them as a Sum of Squared Errors (SSE). The SSE is an overall measure of model error, whilst the likelihood is an overall measure of model fit. The SSE is inversely related to the likelihood: the higher the SSE, the lower the likelihood of the model.

Let’s call the model of Equation (4.2) MODEL G. Before we go on, I want to warn you that what we call MODEL G, and what we call MODEL R, can change from situation to situation. Basically, the identity of MODEL G and MODEL R are “local” to the particular model comparison. You can think of MODEL G as a parent, and MODEL R as a child. While the relation between them is similar, within an extended family, someone can be both a parent to one family member, and be the child of another. This is the form of flexibility that you will need when thinking about nested models. A model can be both more general than one model, and more restricted than another. I will come back to this soon.

First, let’s consider what the Sum of Squared Errors of a model is. The easiest way to define this is in terms of the predictions of each model. Recall that a regression model has a structural and random part. The structural part defines the conditional mean of the dependent variable, while the random part concerns the random variation of the actual observations around the conditional means. As the random part is by definition unpredictable, there is not much we can do with that in terms of forming predictions. So we’re stuck with the structural part. Although not a universal principle, there are many situations in which it makes sense to predict outcomes by the (conditional) mean. If we do this, then we can predict the outcomes with MODEL G as

\[\hat{Y}_{G,i} = \hat{\beta}_{G,0} + \hat{\beta}_{G,1} \times X_i\] where $\hat{Y}$ stands for a predicted value. We’re using the same “hat” for this as an estimated value, because it is really also a good estimate of what the value of $Y_i$ might have been if it was another observation with the same value for $X$. We are also assigning the subscript “G” to all the estimates, to distinguish them from those of a different model. Now let’s consider a MODEL R in which we assume that there is no relation between Trump votes and hate groups, so that $\beta_1 = 0$. The predictions for this model would be

\[\hat{Y}_{R,i} = \hat{\beta}_{R,0} + 0 \times X_i = \hat{\beta}_{R,0}\] Now we have two models to make predictions, we can write the corresponding Sum of Squared Error of each as: \[\begin{equation} \text{SSE}(M) = \sum_{i=1}^n \left(Y_i - \hat{Y}_{M,i} \right)^2 \tag{4.7} \end{equation}\] where we can replace the general letter $M$ (for Model) with either $G$ of $R$, to get $\text{SSE}(G)$ or $\text{SSE}(R)$, respectively. So the Sum of Squares is based on a difference between each observation and the model prediction for that observation. These are thus the prediction errors. If the prediction was equal to the the prediction of the true model (e.g. $\beta_0 + \beta_1 X_{i}$), then these would be equal to the true error terms $\epsilon_i$. But because we only have an estimated model, they are effectively estimates of the true errors. You can see these (unsquared) errors for the two models in Figure 4.4.

$Estimated regression lines for MODEL G (left) and MODEL R (right) with $\beta_1 = 0$ and the errors.$

Figure 4.4: Estimated regression lines for MODEL G (left) and MODEL R (right) with $\beta_1 = 0$ and the errors.

When you compare the errors between the models, you can see that MODEL G does not provide a better prediction for each observation. Sometimes the distance from an observation to the regression line is larger for MODEL G than for MODEL R. However, MODEL G does appear to provide a better prediction for most observations. This is unsurprising. MODEL G is estimated by maximising the likelihood, and for models with Normal-distributed errors, maximising the likelihood is equivalent to minimising the Sum of Squared Errors. As the likelihood of MODEL G can never be lower than that of MODEL R, the Sum of Squared Errors of MODEL G can never be higher than that of MODEL R. In other words, it is always the case that \[\text{SSE}(R) \geq \text{SSE}(G)\] The Sum of Squared Errors of MODEL G will never be higher than that of MODEL R. So we cannot just select the model with the lowest SSE, as this would mean we’d always select MODEL G. We need to find a way to determine whether $\text{SSE}(G)$ is sufficiently lower than $\text{SSE}(R)$ to make us believe that MODEL G is indeed superior to MODEL R.

In the previous chapter, we discussed that the $t$-statistic can be used to perform a test which is equivalent to the likelihood ratio test, and that this was useful because the sampling distribution of the $t$-statistic is known, while the sampling distribution of the likelihood ratio is difficult to determine. For the general linear models, there is a similar argument that leads to a new statistic, which can be viewed as a generalization of the $t$-statistic. This is the $F$-statistic. If was given the letter in honour of Sir Ronald A. Fisher, a rather brilliant and very influential statistician, who derived the statistic in the 1920s.

For comparing two linear models, where MODEL R is a special case of MODEL G, we can define the $F$-statistic as: \[\begin{equation} F = \frac{\frac{\text{SSE}(R) - \text{SSE}(G)}{\text{npar}(G) - \text{npar}(R)}}{\frac{\text{SSE}(G)}{n-\text{npar(G)}}} \tag{4.8} \end{equation}\] Here, $\text{npar}(G)$ stands for the number of parameters which are estimated in MODEL G, and $\text{npar}(R)$ for the number of parameters which are estimated in MODEL R. In counting the number of estimated parameters, we are excluding the standard deviation of the errors $\sigma_\epsilon$.⁹ In the present example, MODEL G then has two estimated parameters: $\hat{\beta}_{G,0}$ and $\hat{\beta}_{G,1}$, while MODEL R has just one: $\hat{\beta}_{R,0}$. If we compute the SSE for each model (i.e. taking the vertical lines to the regression line in Figure 4.4 and squaring them), we get $\text{SSE}(G) = 4011.398$ and $\text{SSE}(R) = 4992.178$. Putting the values into Equation (4.8) gives as the following value for the $F$-statistic:

\[\begin{aligned} F &= \frac{\frac{4992.178 - 4011.398}{2 - 1}}{\frac{4011.398}{50 - 2}} \\ &= \frac{980.779}{83.571} \\ &= 11.736 \end{aligned}\]

If MODEL R is true, the sampling distribution of the $F$-statistic follows an $F$-distribution. This distribution has two parameters, and both are degrees of freedom. We will refer to these as $\text{df}_1$ and $\text{df}_2$. The first one reflects the difference in the number of estimated parameters between the models \[\begin{equation} \text{df}_1 = \text{npar}(G) - \text{npar}(R) \tag{4.9} \end{equation}\] and the second one the number of observations minus the number of estimated parameters of the more general model \[\begin{equation} \text{df}_2 = n - \text{npar}(G) \tag{4.10} \end{equation}\]

$The F distribution with $\text{df}_1 = 1$ and $\text{df}_2 = 48$, and the critical value for $\alpha = .05$$

Figure 4.5: The F distribution with $\text{df}_1 = 1$ and $\text{df}_2 = 48$, and the critical value for $\alpha = .05$

Note that the value of the $F$-statistic can never be negative; the distribution is thus defined only over positive values of $F$. High values of $F$ indicate that the MODEL G has substantially less error than MODEL R. Another thing to note is that, whenever $\text{df}_1 = 1$, there is a direct relation between the $F$ statistic and the $t$ statistic: $F = t^2$, or conversely, $\sqrt{F} = t$. It is easy to check that this is indeed the case here: $\sqrt{11.736} = 3.426$, which is the value of the $t$ statistic we computed earlier.

The critical value of the $F$ statistic, with $\alpha = .05$ and $\text{df}_1 = 2-1 = 1$ and $\text{df}_2 = 50 - 2 = 48$, is 4.043, and any value of $F$ above this critical value would result in a rejection of the null hypothesis. So, because the $F$ value we computed was 11.736, which is larger than the critical value, we reject the null hypothesis $H_0: \beta_1 = 0$, and conclude that there is evidence of a relation between the number of hate groups and Trump votes. Instead of checking whether the $F$ value is larger than a critical value, we can also compute the $p$-value, which is the probability of an $F$ value equal to the sample value 11.736 in the distribution depicted in Figure 4.5. The $p$-value of this test can be stated as $p(F \geq 11.736 | \text{df}_1 = 1, \text{df}_2 = 48) = 0.00127$. As this probability is smaller than the significance level $\alpha = .05$, this again implies we reject the null hypothesis.

We can also compare MODEL G to a different MODEL R, in which we assume the intercept equals $\beta_0 = 0$, whilst allowing the slope to take any value. This model forces the regression line to go through the (0,0) point. Estimating this MODEL R gives¹⁰ \[\begin{aligned} Y_i &= 0 + \hat{\beta}_{R,1} \times X_i \\ &= 12.389 \times X_i \end{aligned}\] You can see the resulting regression line and error terms in Figure 4.6.

$Estimated regression lines for MODEL G (left) and MODEL R (right) with $\beta_1 = 0$ and the errors.$

Figure 4.6: Estimated regression lines for MODEL G (left) and MODEL R (right) with $\beta_1 = 0$ and the errors.

Visually, this alternative MODEL R seems clearly inferior to MODEL G. The SSE of this model is $\text{SSE}(R) = 30501.73$. Computing the $F$-statistic gives \[ \begin{aligned} F &= \frac{\frac{30501.73 - 4011.398}{2 - 1}}{\frac{4011.398}{50 - 2}} \\ &= \frac{26490.33}{83.571} \\ &= 316.981 \end{aligned} \] Because the test involves the same degrees of freedom, the critical value is the same as before. So we reject the null hypothesis again, which here is $H_0: \beta_0 = 0$. The $p$-value now is $p(F \geq 316.981 | \text{df}_1 = 1, \text{df}_2 = 48) <.0001.$

A nice thing about the $F$-statistic is that it is very general, and can be used to compare any two nested linear models. For instance, we could also compare MODEL G to a model where we assume both $\beta_0 = 0$ and $\beta_1 = 0$. This model would assume that the dependent variable follows a Normal distribution with a mean of 0. That doesn’t make much sense here, so we won’t compute this test. But the generality of the $F$-test to allow testing of multiple parameters simultaneously comes in very handy in the later chapters.

4.4.3 Confidence intervals

The way to compute and interpret confidence intervals for the parameters of a simple linear regression model is analoguous to that for the one sample $t$-test (see Section 3.5). The formula to compute confidence intervals for the two parameters can be written as: \[\hat{\beta}_j \pm t_{n-2; 1-\tfrac{\alpha}{2}} \times \text{SE}(\hat{\beta}_j)\] where $t_{n-2; 1-\tfrac{\alpha}{2}}$ is the upper critical value in a $t$-distribution with $n-2$ degrees of freedom and a significance level of $\alpha$. Using $\alpha=.05$ gives us the conventional 95%-confidence interval.

4.5 Summary

A simple linear regression is a model of the relation between two variables: the dependent variable $Y$ and a predictor variable $X$. The model uses a straight line to associate conditional means, which are expected or average values of the dependent variable, to each possible value of the predictor variable. This straight line has two important parameters. The first one is the intercept ($\beta_0$), which is the conditional mean ($\mu_{Y|X}$) of the dependent variable for cases where the predictor variable has the value $X=0$. The second is the slope $\beta_1$, which reflects the increase in the conditional mean for every one-unit increase in the predictor variable. This is then related to the steepness of the line. The model assumes variability in the values of the dependent variable around these conditional means to follow a Normal distribution, with a mean of 0 and a constant standard deviation ($\sigma_\epsilon$).

Hypothesis testing for the parameters $\beta_0$ and $\beta_1$ can be based on the sampling distribution of the estimates of these parameters under the null-hypothesis, which leads to a t-test. Alternatively, you can perform these hypothesis tests by model comparison, comparing the Sum of Squared Error of each model with an $F$-test. This is equivalent to a likelihood ratio test, but Sums of Squared Errors and the resulting $F$-distribution are easier to work with.

A simple (bivariate) regression model is a special case of a multiple regression model, which we will discuss next. As many things that apply to multiple regression models apply to simple regression models as well, we will discuss things like effect sizes and assessing assumptions, in the next chapter.

References

Schaffner, B. F., Macwilliams, M., & Nteta, T. (2018). Understanding white polarization in the 2016 vote for president: The sobering role of racism and sexism. Political Science Quarterly, 133, 9–34.

If you are wondering why this formation provides the same model as in the previous chapter, you can show this using the properties of the Normal distribution. In the second formulation, you can view $Y$ as a linear transformation of the error $\epsilon$: $Y_i = a + b \epsilon$, with $a = \mu$ and $b=1$. The mean of $Y$ is then $a + \mu_\epsilon = \mu + 0 = \mu$, and the standard deviation of $Y$ equals $\sqrt{|b| \sigma} = 1 \times \sigma = \sigma$, which are the same parameters as when directly specifying the Normal distribution for $Y$.↩︎
It’s probably unwise to mix statistics and politics, but wouldn’t that have been an utopia?↩︎
Although $\sigma_\epsilon$ is really a parameter in the models, it is generally not of direct interest. It is a so-called “nuisance parameter”, something that we need to take into account, but we’d rather forget about.↩︎
Note that we cannot use Equation (4.5) to estimate the slope now. There is no simple formula for the slope estimate when you fix the intercept to 0, and so you will have to rely on statistical software to do this.↩︎