Chapter 12 Generalized Linear Models

Up to now, we have considered models for which the residuals (model errors) could be assumed to follow a Normal distribution. This is equivalent to modelling the conditional distribution of a dependent variable $Y$, given the values of predictor variables $X_j$, as a Normal distribution. Many of the techniques and ideas of such models can be generalised to situations where we can assume that the distribution of a dependent variable, conditional upon the values of predictor variables, follows a different distribution from a class of probability distributions known as the exponential family. Such models are known as generalized linear models (McCullagh & Nelder, 2019). This name makes for easy confusion with the general linear model, so make sure you don’t ignore the presence or absence of “-ized”!

The Normal distribution is part of the exponential family of distributions, as are many other distributions. These other distributions include the Binomial, Poisson, and Gamma distributions. Some of these distributions concern non-negative integer variables (i.e. variables with values $0, 1, 2, \ldots$), others positive continuous variables (i.e. variables with values $>0$), or unbounded continuous variables (i.e. variables with values between $-\infty$ and $\infty$). Exponential-family distributions can all be expressed in the same form, but we won’t go into the details of that here.³⁰ What is important is that there are a wide variety of statistical distributions beyond the Normal distribution. And that there are many variables which are not continuous and unbounded. It generally does not make sense to assume a Normal distribution for such variables. But because exponential-family distributions share important characteristics with the Normal distribution, assuming a different exponential-family distributions allows you to choose a distribution which is more natural for the variable in question, whilst still being able to formulate useful models for it. This is what we can do with generalized linear models. A generalized linear model consists of three components:

A random part, which describes the variability in the data by means of an exponential-family probability distribution with a conditional mean $\mu_{Y|X_1, \ldots, X_m}$.
A structural part, which is a linear model of the form $\beta_0 + \sum_{j=1}^m \beta_j \times X_j$.
A link function, which connects the structural and random part.

Recall that the general linear model also consists of a random and structural part. What is new for generalized linear models is the link function.

In the remainder of this Chapter, we will first discuss link functions and the common methods of estimation and inference for generalized linear models. We will then focus on widely used instances of generalized linear models: logistic regression, Poisson regression, and loglinear models. We then discuss two extensions (multinomial logistic regression and generalized linear mixed-effects models).

12.1 Link functions

When we previously discussed applying the general linear model to “tricky” data, we considered transforming the dependent variable and/or the predictor variables, in order to “force” the data to approximately conform to the assumptions of the GLM (see Section 5.7.1). Generalized linear models also involve the use transformations. These transformations are performed via the link functions. However, in generalized linear models, these link functions are not applied the variables themselves, but rather to the predictions of a linear model.

In the general linear model, we model the conditional mean of a Normal-distributed variable as a linear function of predictor variables: \[\mu_{Y|X_1, \ldots, X_m} = \beta_0 + \sum_{j=1}^m \beta_j \times X_j\] In a generalized linear model, we model a transformation of the conditional mean of a wider class of distributions as a linear function of predictor variables: \[\begin{equation} g(\mu_{Y|X_1, \ldots, X_K}) = \beta_0 + \sum_{j=1}^m \beta_j \times X_j \tag{12.1} \end{equation}\] where $g$ denotes the link function. This link function can take many forms. Common examples are

Identity: $g(y) = y$
Log: $g(y) = \log(y)$
Logit $g(y) = \log \left(\frac{y}{1-y}\right)$

The key thing is that the right-hand-side of Equation (12.1) is a linear function that can, in principle, take any value between $-\infty$ and $\infty$. For variables that are bounded (e.g., non-negative variables $Y > 0$), such a linear model could provide predictions that are outside the bounds. An appropriate link function essentially makes such out-of-bounds predictions impossible. This might be easier to understand when we consider the inverse transformation: \[\begin{equation} \mu_{Y|X_1, \ldots, X_K} = h(\beta_0 + \sum_{j=1}^m \beta_j \times X_j) \tag{12.2} \end{equation}\] where $h$ denotes the so-called inverse link function. Inverse link-functions corresponding to the list of common link functions above are:

Inverse-Identity: $h(x) = x$
Inverse-Log (or Exponential): $h(x) = \exp(x)$
Inverse-Logit (or Logistic): $h(x) = \frac{\exp(x)}{1+ \exp (x)}$

Take for example the inverse-log (or exponential) function above. No matter what the value of $x$, the value of $\exp(x)$ will never be below 0. The value will approach 0 (from above) when $x$ approaches $-\infty$ (in other words, for very large negative values $x$, $\exp(x)$ will be be a positive value very close to 0). So this inverse-link function allows one to transform the unbounded predictions from a linear model into bounded predictions suitable for a non-negative continuous variable.

Intuitively, you can think of a link function and the inverse-link function as follows: Suppose you have ten British pounds and need to convert this into US dollars. The conversion rate (at the time of writing this) is £1 = $1.08.³¹ So if the function $g$ converts pounds ($x$) into dollars ($y$), we can write this as \[y = g(x) = 1.08 \times x\] So $g$ is a function that multiplies its argument by 1.08. If we were to convert dollars back to pounds, we would need a different function, which is the inverse (it “reverses” the effect) of our function $g$. Calling this $h$, we would need \[x = h(y) = \frac{1}{1.08} \times y \approx 0.926 \times y\] Such transformation functions $g$ and $h$ are also called one-to-one-mappings: for every amount in pounds, there is one corresponding amount in dollars, and vice versa. If I know the amount in one currency, there is only a single amount in the second currency that this refers to. The link function needs to be such a one-to-one mapping. But otherwise, we are free to choose it. For a given link function, there is a unique inverse-link function. In the currency-exchange example, the link function is a linear function, but we can also choose a nonlinear function, e.g. \[y = g(x) = \exp x\] This function has the unique inverse-link function: \[x = h(y) = \log y\] You can also think of the inverse link function as “undoing” the link function, in the sense that applying the inverse link function to the link function itself, we get the values put into the link function: \[h(g(x)) = x\] Whilst the choice of a link function is essentially up to the analyst, it is common to use canonical link functions. Canonical link functions are, roughly put, link functions $g$ that are the most direct way to express $h(\mu_{Y|X_1,\ldots,X_m})$ for the particular distribution in question.³²

12.2 Estimation

The important parameters of a generalized linear model (Equation (12.1)) are the intercept and slopes $\beta_j$, $j=0,\ldots,m$. As before, we will focus on maximum likelihood estimation of the model parameters. Unlike for the parameters of the general linear model, analytical solutions for maximum likelihood parameters of generalized linear models are usually not available. However, parameter estimates can be obtained by an iterative numerical procedure called Iteratively Reweighted Least Squares (IRLS). You don’t need to understand the details of this. For present purposes, it is enough to know that this method starts with a “guess” of the parameter estimates, and then updates this guess repeatedly to increase the likelihood of the estimated model, until no further improvement is obtained, at which point the algorithm is said to have converged. Whilst modern statistical software generally has robust implementations of this procedure, estimation difficulties may arise in cases (e.g. when using non-standard link functions). This may be due to the model being a poor representation of the data, but potentially also because of a poor initial guess for the parameter values.

12.3 Inference in generalized linear models

As usual, we are generally not just interested in the estimates of the model parameters. We know data is noisy and parameter estimates are likely to differ from the true parameter values of the Data Generating Process. As for the general linear model, there are different ways to test whether a parameter $\beta$ differs from a hypothesised value $\underline{\beta}$. A first method focusses on the sampling distribution of parameter estimates, assuming the true value is $\underline{\beta}$. Alternatively, we can compare a MODEL R where we restrict the parameter to take the hypothesised value, to a MODEL G where we assume the value of the parameter is unknown, requiring us to estimate that parameter. These two methods are implemented via the Wald test and likelihood-ratio test, respectively.

12.3.1 Wald test

For large samples, the test statistic \[\begin{equation} z = \frac{\hat{\beta_j} - \underline{\beta}}{\text{SE}(\hat{\beta}_j)} \end{equation}\] approximately follows a standard Normal distribution (i.e. a Normal distribution with mean $\mu=0$ and standard deviation $\sigma=1$). Note that this $z$ statistic is similar to the $t$-statistic for the General Linear Model. What is missing however are the degrees of freedom. For large samples, as the number of observations $n$ approaches infinity, the $t$-distribution converges to the standard Normal distribution. For large-enough $n$, the difference between the $t$-distribution and the Normal distribution is so small that there is little difference when using one or the other to compute things like $p$-values (but technically, the use of a Normal distribution is only truly correct when the number of observations is infinite).

Sometimes, the Wald test is presented in an alternative form: \[\begin{equation} z^2 = \left(\frac{\hat{\beta_j}}{\text{SE}(\beta_j)}\right)^2 \end{equation}\] which approximately (for large sample sizes) follows a Chi-squared distribution with $\text{df} = 1$. This is due to a mathematical fact that if a variable $z$ follows a standard Normal distribution, then $z^2$ follows a Chi-squared distribution with $\text{df}=1$ degrees of freedom. Therefore, both statistics give entirely equivalent results. However, the $z$-statistic can be used to obtain one-sided tests, whilst the $z^2$ can only be used to obtain two-sided tests. Given the additional flexibility of the $z$-statistics, this is what we will use for our Wald tests. But remember: both statistics are approximate and only exact for an infinite sample size. The Wald test (in either form) requires more data than the tests discussed below.

12.3.2 Likelihood-ratio test

The Wald test is easy to compute and works well for large-enough samples. A more powerful and reliable test for smaller samples is the likelihood-ratio test. To test whether a parameter differs from a hypothesized value, this involves a comparison between a MODEL R where the parameter is fixed to that hypothesised value, and a MODEL G where the parameter is estimated. For general linear models, the $F$-statistic could be used to perform such likelihood ratio tests. Unfortunately, the wider-class of distributions covered by generalized linear models does not allow for these precise tests. Instead, we will need to rely on the the approximate Chi-squared test that we introduced in the context of linear mixed-effects models.

The test statistic used for a likelihood-ratio test is, as in Chapter 11, equal to minus two times the log likelihood ratio: \[\begin{aligned} G^2 = -2 \log \left(\text{likelihood-ratio}\right) &= -2 \log \left( \frac{p(\text{DATA}|\text{MODEL R})}{p(\text{DATA}|\text{MODEL G})} \right) \\ &= -2 \log p(\text{DATA}|\text{MODEL R}) - (-2 \log p(\text{DATA}|\text{MODEL G})) \end{aligned}\] Note that we denote the “minus 2 log-likelihood ratio” with the symbol $G^2$, as this is commonly used in the literature. Note that the letter $G$ in this notation does not stand for our MODEL G! For large-enough samples, the sampling distribution of this statistic is approximately equal to a Chi-squared distribution: \[G^2 \sim \mathbf{chi-squared}(\text{npar}(G) - \text{npar}(R))\] with degrees-of-freedom equal to the difference in the number of estimated parameters ($\text{npar}(G) - \text{npar}(R)$) between the models.

12.3.3 Confidence intervals

Another option for inference is to compute confidence intervals. A straightforward way to compute (approximate) confidence intervals is to rely on the asymptotically Normal distribution of parameter estimates (as in the $z$-statistic for testing parameter values). The confidence interval is computed by adding or subtracting a multiple of the standard error of the estimate from the estimate: \[\begin{equation} \hat{\beta}_j \pm z_{1-\tfrac{1}{2}\alpha} \times \text{SE}(\hat{\beta}_j) \tag{12.3} \end{equation}\] Here, $z_{1-\tfrac{1}{2}\alpha}$ is the $1 - \tfrac{1}{2}\alpha$ quantile of the standard Normal distribution (e.g., the value $z$ such that $p(Z \leq z) = 1 - \tfrac{1}{2}\alpha$). For example, for a 95% confidence interval, we use $\alpha = .05$ and the computation would be \[\hat{\beta}_j \pm z_{.975} \times \text{SE}(\hat{\beta}_j) \approx \hat{\beta}_j \pm 1.96 \times \text{SE}(\hat{\beta}_j)\] More accurate (although still approximate) confidence intervals for small samples can be computed through the profile likelihood (Venzon & Moolgavkar, 1988). Roughly, this method computes a confidence interval for one parameter, e.g. $\beta_j$, by constructing a profile log-likelihood function \[l_p(\beta_j) = \log p(\text{DATA}|\beta_j, \hat{\theta}_{\neg j|\beta_j})\] where $\beta_j$ is free to vary, and the remaining parameters of the model, denoted as $\hat{\theta}_{\neg j|\beta_j}$, are replaced by their maximum likelihood estimates given $\beta_j$. We can alternatively write this as \[l_p(\beta_j) = \max_{\theta_{\neg j}} \left(\log p(\text{DATA}|\beta_j, \theta_{\neg j})\right)\]

This profile log likelihood function can then be used to compute possible likelihood ratio tests, comparing MODEL G where all parameters are estimated by maximum likelihood, to a model where the parameter of interest $\beta_j$ is fixed to some value: \[\begin{aligned} -2 \log (\text{profile-likelihood-ratio}) &= -2 \log l_p(\beta_j) - (-2 \log l(\hat{\theta})) \\ &= 2 \left(l(\hat{\theta}) - l_p(\beta_j) \right) \end{aligned}\] where $l(\hat{\theta}) = \log p(\text{DATA}|\hat{\theta})$ denotes the maximum of the log likelihood over all parameters (including $\beta_j$). As the likelihood-ratio statistic is approximately Chi-squared distributed, we can use the Chi-square distribution (with $\text{df} = 1$ degree of freedom) to determine a critical value, e.g. $\chi^2_{1; 1 - \alpha}$. We can then finally find the left and right bounds of the confidence interval by finding values $\beta_\text{left}$ and $\beta_\text{right}$ such that \[2 \left(l(\hat{\theta}) - l_p(\beta_\text{left}) \right) = 2 \left(l(\hat{\theta}) - l_p(\beta_\text{right}) \right) = \chi^2_{1; 1 - \alpha}\] In contrast to the confidence interval based on the Normal approximation (Equation (12.3)), confidence intervals based on the profile-likelihood can be non-symmetrical (i.e. the left- and right-bound don’t necessarily have the same distance from the estimate). For non-linear models such as generalized linear models, confidence intervals will often not be symmetrical, unless the number of observations in large. As such, confidence intervals based on the profile likelihood are likely to be more accurate. Yet another way to compute confidence intervals is by a parametric bootstrap.³³

12.4 Assessing model fit

For the General Linear Model, the $R^2$ statistic reflects the “proportion of variance explained” and provides a useful way to assess the fit of the model to the data. For generalized linear models, a straightforward measure of “proportion of variance explained” is not available.³⁴

For generalized linear models, instead of computing something like an $R^2$ measure, it is more common to test overall model fit. This is done by comparing an estimated MODEL R to a saturated MODEL G. A saturated model, which we will denote as MODEL S, has as many parameters as unique patterns in the data, and therefore fits the data perfectly. Whilst a model that always fits the data perfectly is not very interesting from a modelling point-of-view, it provides a useful criterion to compare less complex models to. The question asked in this comparison can be stated as: how well does my model perform compared to a perfect model of the Data Generating Process?.

The $-2 \log(\text{likelihood-ratio})$ comparing an estimated MODEL R to the saturated MODEL S is called the residual deviance of MODEL R: \[\begin{equation} D_R = -2 \log \frac{p(\text{DATA}|\text{MODEL R})}{p(\text{DATA}|\text{MODEL S})} \tag{12.4} \end{equation}\] Note that the deviance is effectively a $G^2$ statistic (the difference is that for the deviance the MODEL G in question is always the saturated MODEL S). As such, under the null-hypothesis that MODEL R is the “true model”, the deviance approximately follows a Chi-Squared distribution with degrees of freedom $\text{df} = \text{npar}(S) - \text{npar}(R)$. As the number of parameters of the saturated model is often equal to the number of observations, this often is identical to $\text{df} = n-\text{npar}(G)$. This equivalence does not always hold however: For some models (e.g. loglinear models), the minimum number of parameters needed to acquire perfect fit may be less than the number of observations. A deviance test of overall model fit concerns the null-hypothesis that MODEL R is the true model. Therefore, a significant test results indicate that MODEL R is not the true model, and that a more complex model will fit the data better.

You can think of the residual deviance similarly to the Sum of Squared Errors (SSE) of a model: the higher the deviance, the lesser the fit of a model to the data. And just like the SSE, the residual deviance can also be written as a sum of the deviance for individual data points.³⁵ Whilst the residual deviance values for individual observations can be used to indicate potential outliers, there are other forms of residuals more useful for this purpose. One of the most widely-used residuals for generalized linear models are so-called standardized Pearson residuals.³⁶ If the model fits the data well, then the standardized Pearson residuals are approximately standard-Normal-distributed. As such, a standardized Pearson residual with an absolute value larger than 2 may indicate a potential outlier or otherwise problematic data point.

All in all, the residual deviance is to generalized linear models what the SSE is to general linear models. We can actually rewrite the likelihood-ratio test statistics to compare MODEL R to MODEL G in terms of the residual deviance of both models: \[G^2 = D_R - D_G\] Why? Well, because we can divide both the numerator and denominator in the likelihood ratio by the same term ($p(\text{DATA}|\text{MODEL S})$) without changing its value: \[\begin{aligned} G^2 &=-2 \log \left( \frac{p(\text{DATA}|\text{MODEL R})}{p(\text{DATA}|\text{MODEL G})} \right) \\ &= -2 \log \left( \frac{p(\text{DATA}|\text{MODEL R})/p(\text{DATA}|\text{MODEL S})}{p(\text{DATA}|\text{MODEL G})/p(\text{DATA}|\text{MODEL S})} \right) \\ &= -2 \log \left(p(\text{DATA}|\text{MODEL R})/p(\text{DATA}|\text{MODEL S})\right) \\ &\quad - (-2 \log \left( p(\text{DATA}|\text{MODEL G})/p(\text{DATA}|\text{MODEL S})) \right) \\ &= D_R - D_G \end{aligned}\]

12.5 Logistic regression

Logistic regression is used for dichotomous or binary dependent variables. Such variables can take one of two possible values. An example is a variable indicating whether a question is answered correctly ($Y = 1$) or incorrectly ($Y=0$). As a binary dependent variable can take only two values, assuming it follows a Normal distribution is not sensible. The same can be said for assuming the model errors (residuals) follow a Normal distribution.

Binary data can be assumed to follow a Bernoulli distribution: \[\begin{equation} p(Y = k) = \theta^k \times (1-\theta)^{1-k} \tag{12.5} \end{equation}\] with $k=0, 1$. Note that this is a simple case of the Binomial distribution we encountered in the Chapter 2.³⁷ Also note that as $\theta^0 = 1$, and as $k$ can only be 0 or 1, Equation (12.5)) simply states that \[\begin{aligned} p(Y = 1) &= \theta^1 \times (1-\theta)^{0} = \theta \times 1 \\ &= \theta \\ p(Y = 0) &= \theta^0 \times (1-\theta)^{1} = 1 \times (1 - \theta) \\ &= 1 - \theta \end{aligned}\]

The mean of a Bernoulli-distributed variable $Y$ is $\mu_Y = p(Y=1) = \theta$. The variance is $\text{Var}(Y) = \theta \times (1-\theta)$. Note that, in contrast to the Normal distribution, where the mean and variance are determined by separate parameters ($\mu$ and $\sigma$, respectively), the Bernoulli distribution is determined by a single parameter ($\theta$), and there is a direct relation between the mean ($\theta$) and the variance ($\theta \times (1-\theta)$). In many generalized linear models, there is a relation between the mean and the variance. That implies there is no homogeneity of variance. However, as long as the variance depends on the same parameter as the mean, this is no problem, as the heteroscedastic variance is determined by the mean, and hence known when the mean is known.

Modelling the mean with predictor variables $X_j$, the conditional mean of a Bernoulli-distributed variable can be written as: \[\begin{aligned} \mu_{Y|X_1, \ldots, X_m} &= p(Y=1|X_1, \ldots, X_m) = \pi_1 \end{aligned}\] Note that we introduced the shorthand $\pi_1$ to denote the (conditional) probability that $Y=1$.

The canonical link function for Bernoulli-distributed data is the so-called logit link function: \[\begin{equation} g(\mu_{Y|X_1, \ldots, X_m}) = \log \left( \frac{\mu_{Y|X_1, \ldots, X_K}}{1-\mu_{Y|X_1, \ldots, X_K}} \right) \tag{12.6} \end{equation}\] This may be a bit easier to read after using our shorthand notation $\pi_1 = \mu_{Y|X_1, \ldots, X_m}$: \[\begin{equation} g(\pi_1) = \log \left( \frac{\pi_1}{1-\pi_1} \right) \tag{12.7} \end{equation}\] The logistic regression model is then \[\begin{equation} \log \left( \frac{\pi_1}{1-\pi_1} \right) = \beta_0 + \beta_1 \times X_1 + \ldots + \beta_m \times X_m \tag{12.8} \end{equation}\]

The inverse link function is called the inverse-logit or logistic function: \[\begin{equation} h(\beta_0 + \sum_{j=1}^m \beta_j) = \frac{\exp \{ \beta_0 + \sum_{j=1}^m \beta_j X_{j} \}}{1 + \exp \left( \beta_0 + \sum_{j=1}^m \beta_j X_{j} \right)} \tag{12.9} \end{equation}\] So the inverse of Equation (12.6) is: \[\begin{equation} \mu_{Y|X_1,\ldots,X_m} = \pi_1 = \frac{\exp \left( \beta_0 + \sum_{j=1}^m \beta_j X_{j} \right)}{1 + \exp \left( \beta_0 + \sum_{j=1}^m \beta_j X_{j} \right)} \end{equation}\]

Some examples of the relation between a single predictor and different values of $\beta_0$ and $\beta_1$ are shown in Figure 12.1.

$Examples of the logistic link function for a logistic regression model with a single predictor, and different values for $\beta_0$ and $\beta_1$$

Figure 12.1: Examples of the logistic link function for a logistic regression model with a single predictor, and different values for $\beta_0$ and $\beta_1$

As usual, the intercept $\beta_0$ reflects the model prediction when all predictors have the value $X_j = 0$. It is important to keep in mind that generalized linear models make predictions not on the scale of the dependent variable $Y$, but on the scale of the link function $g(Y)$. We can use the inverse link-function to assess the predicted value on the scale of the dependent variable. In the case of logistic regression, this is done via the logistic function, $\frac{\exp(\beta_0)}{1+\exp(\beta_0)}$. For a single-predictor model, the value $-\frac{\beta_0}{\beta_1}$ is equal to value of the predictor at which $\pi_1 = 0.5$.³⁸ This is also the point at which the logistic curve is the steepest. The slope $\beta_1$ reflects the maximum steepness of the curve, as well as its direction. Positive values of $\beta_1$ indicate that the probability $\pi_1$ increases with the value of $x$, whilst negative values of $\beta_1$ indicate it decreases. The higher the absolute value of $\beta_1$, the more quickly the probability increases or decreases (i.e. the steeper the logistic curve).

12.5.1 Parameter interpretation

In the general linear model, $\beta_j$, the slope of a predictor variable $X_j$, reflects the predicted change in the dependent variable for a one-unit increase in $X_j$. Because the model is linear, the predicted change is the same no matter what starting value of $X_j$ we choose for this one-unit increase. The dependent variable $Y$ is predicted to change by $\beta_j$, whether we increase $X_j$ from 0 to 1, or from 100 to 101. This is not the case when we consider the predicted change in $p(Y=1|X_j)$ for different values of $X_j$. For example, for a model with $\beta_0 = 0$ and $\beta_1 = 1$, Figure 12.1 shows that th probability increases more rapidly when we move from $X_1 = 0$ to $X_1 = 1$ than when we move from $X_1 = 3$ to $X_1 = 4$.

A slope $\beta_j$ in a logistic regression model reflects a change in the logit $\log \left( \frac{\pi_1}{1-\pi_1}\right)$ for a one-unit increase in a predictor value. This change in the logit is the same, not matter the starting value of $X_j$ we choose. But one the scale of $\pi_1$, the change in $\pi_1$ due to a one-unit increase in $X_j$ is different for different starting values of $X_j$. As such, the slope $\beta_j$ does not have a straightforward interpretation on the scale of $\pi_1$. But an exponential transformation of this slope does, to some extent. More precisely, the slope has a constant effect on the odds $\frac{\pi_1}{1-\pi_1}$. To see how this works, we will make use of the following properties of exponents: \[\begin{aligned} a^{(b + c)} &= a^b \times a^c \\ a^{(b \times c)} &= \left(a^{b}\right)^c = \left(a^{c}\right)^b . \end{aligned}\] We also need to remember that $\exp (a) = e^a$, where $e$ denotes the natural number, and $\log \left(\exp(a)\right) = a$. Using these properties, we can rewrite the model predictions as \[\begin{aligned} \log\left(\frac{\pi_1}{1-\pi_1}\right) &= \beta_0 + \beta_1 \times X \\ \exp \left( \log\left(\frac{\pi_1}{1-\pi_1}\right)\right) &= \exp\left(\beta_0 + \beta_1 \times X\right) \\ \frac{\pi_1}{1-\pi_1} &= \exp\left(\beta_0 + \beta_1 \times X\right) \\ &= \exp(\beta_0) \times \exp\left(\beta_1 \times X\right) \\ &= \exp(\beta_0) \times \left(\exp(\beta_1) \right)^{X} \end{aligned}\] In the final line of the above derivation, you can see that every one-unit increase in $X_1$ has the effect of multiplying the odds by $\exp(\beta_1)$. For example, the odds at $X_1 = 1$ is \[\exp(\beta_0) \times \left(\exp(\beta_1) \right)^{1} = \exp(\beta_0) \times \exp(\beta_1)\] and at $X_1 = 2$ it is \[\exp(\beta_0) \times \left(\exp(\beta_1) \right)^{2} = \exp(\beta_0) \times \exp(\beta_1) \times \exp(\beta_1)\] So the odds at $X = 2$ is the odds at $X = 1$ multiplied by $\exp(\beta_1)$. Similarly, the odds at $X = 2$ is the odds at $X = 1$ multiplied by $\exp(\beta_1)$.

On the scale of log-odds or logits, every one-unit increase in $X$ adds $\beta_1$ to the log-odds. On the scale of odds, every one-unit increase in $X$ multiplies the odds by $\exp(\beta_1)$. In terms of interpreting the effect of $X$, you can choose whether you find this easier to do in the scale of log-odds, or in the scale of odds. The most natural scale is probably the scale of the (probability of) response $\pi_1 = p(Y=1|X)$ itself. Unfortunately, the non-linear effect of $X$ on the response prevents a straightforward interpretation on this scale.

12.5.2 Example: Metacognition in visual perception

Metacognition broadly refers to awareness of our thought processes. One aspect of metacognition is the ability to ascertain how (un)certain we are in our inferences about the world. For example, Rausch & Zehetleitner (2016) asked participants to judge whether visually presented stimuli (so-called grating patterns) had a horizontal to vertical direction. To make this task difficult for their participants, they presented each stimulus only briefly (200 ms) and on a variety of relatively low-contrast displays. They also asked participants to indicate their confidence (on a scale from 0 to 100) that their judgement was correct. Participants each completed a total of 378 trials. If metacognition is accurate, then participants should be more confident when they made more accurate responses.

For two participants in the experiment, Figure 12.2 shows the proportion of correct responses as a function of different levels of stated confidence. For Participant 1, there appears to be quite a good correspondence between confidence and correctness, with accuracy increasing for higher levels of confidence. For Participant 2, the relation does not look so neat, with average correctness higher in the lowest range of confidence, and then decreasing in the next two ranges, before rapidly increasing to almost 100% for the highest two confidence ranges. However, as is evident from the width of the confidence bands, this participant gave relatively few confidence ratings between 20 and 60.

Confidence and proportion correct for two participants in the study by @rausch2016visibility on metacognition.

Figure 12.2: Confidence and proportion correct for two participants in the study by Rausch & Zehetleitner (2016) on metacognition.

As the relation between confidence and accuracy likely differs between participants (e.g. participants might interpret the confidence scale differently, or have different levels of metacognition), estimating a single logistic regression model for both participants seems inappropriate. Hence, we will estimate separate logistic regression models for the two participants. In our model, we will allow the effect of confidence to be moderated by contrast (i.e., we allow metacognition to be affected by contrast). For easier interpretation, we first center both variables, and then estimate the following logistic regression model:

\[\log \left( \frac{p(Y_i = \text{correct})}{p(Y_i = \text{incorrect})} \right) = \beta_0 + \beta_\texttt{conf} \times \texttt{conf}_i + \beta_\text{contr} \times \texttt{contr}_i + \beta_{\texttt{cc}} \times (\texttt{conf} \times \texttt{contr})_i\]

Table 12.1 shows the results for Participant 1. Each participant completed 378 trials, so the number of observations in this subset of data equals $n=378$. For didactic purposes, we present results from both the Wald ($z$) and likelihood-ratio ($G^2$) tests. You would usually just provide one of these.

Table 12.1: Results of a logistic regression model of correct identification by confidence for Participant 1.
	$\hat{\beta}$	$\text{SE}(\hat{\beta})$	$z$	$p(\geq \lvert z \rvert)$	$G^2$	$\text{df}$	$p(\geq \lvert G^2 \rvert)$
Intercept	1.851	0.234	7.92	< .001	94.43	1	< .001
Confidence	0.038	0.007	5.11	< .001	29.82	1	< .001
Contrast	0.503	0.108	4.65	< .001	24.04	1	< .001
Confidence $\times$ Contrast	0.008	0.003	2.52	.012	6.65	1	.010

Regardless of whether we rely on the Wald or likelihood-ratio test, the results show evidence for a positive interaction between confidence and contrast. We also find evidence of positive “simple effects” for confidence and contrast. As we centered the predictors, the simple effect of confidence reflects its effect at the average contrast value. Vice versa, the simple effect of contrast reflects its effect at the average confidence value. We can therefore conclude that, at average levels of confidence, higher levels of contrast result in more accurate responses. We can also conclude that, at average levels of contrast, higher levels of confidence result in more accurate responses. This is thus evidence for accurate metacognition. The interaction indicates that meta-cognition might be more pronounced for higher levels of contrast (i.e., the relation between confidence and accuracy is stronger for higher values of contrast). Finally, the intercept is also significant. This indicates that, for average values of confidence and contrast (such that centered $\texttt{conf}$ and $\texttt{contr}$ are both equal to 0), the predicted probability of a correct response is different from .5. The estimated probability is: \[\begin{aligned} p(Y_i = \text{correct}) &= \frac{\exp (1.851)}{1 + \exp (1.851)} \\ &= 0.864\end{aligned}\] As for the general linear model, tests of the intercept are usually not of interest.

The residual deviance of the model is $D_R = 273.664$. We can use the deviance to test whether the model is equal to the true model, using a Chi-Squared distribution with $n-\text{npar}(R)$ degrees of freedom (here, the saturated model needs $n$ parameters to fit the data perfectly). Hence, the test of overall model fit is $X^2(374) = 273.664$, $p = 1.00$. As this is not significant, the model can be said to fit the data well.

Table 12.2 shows results from the same analysis applied to the data from Participant 2. Whilst the data seemed less neat than the data of Participant 1, we get rather similar results: significant and positive simple effects of confidence and contrast, as well as a significant positive interaction between them.

Table 12.2: Results of a logistic regression model of correct identification by confidence for Participant 2.
	$\hat{\beta}$	$\text{SE}(\hat{\beta})$	$z$	$p(\geq \lvert z \rvert)$	$G^2$	$\text{df}$	$p(\geq \lvert G^2 \rvert)$
Intercept	1.871	0.253	7.38	< .001	74.03	1	< .001
Confidence	0.021	0.005	4.03	< .001	19.71	1	< .001
Contrast	0.384	0.121	3.16	.002	8.66	1	.003
Confidence $\times$ Contrast	0.005	0.002	2.13	.033	4.24	1	.039

The test of the deviance of the model for Participant 2 is $X^2(374) = 281.013$, $p = 1.00$. As this is again not significant, the model can be said to fit the data well.

To get a better idea of the relation between confidence and accuracy, Figure 12.3 depicts the observed data and model predictions for the logistic regression models of Table 12.1 and 12.2. Note that the observed data are binary values indicating correct (1) or incorrect (0) responses. The model predictions however are probabilities of a correct ($Y=1$) response. Both the data and model predictions indicate more correct responses for higher levels of contrast. For lower levels of contrast, we can see that low confidence generally implies a low probability of a correct response, whilst higher confidence a higher probability of a correct response. This patterns seems more pronounced for Participant 1. Participant 2 tends to indicate low confidence for low contrast displays, and high confidence for high contrast displays. For this participant, contrast appears a much stronger predictor of correctness. Nevertheless, Table 12.2 indicates that, for an average level of contrast, confidence has a significant unique effect on the probability of a correct response. As such, this participant appears able to recognize when they made an incorrect response.

Observed and model-predicted response accuracy as a function of confidence and contrast, for two participants in the study by @rausch2016visibility on metacognition.

Figure 12.3: Observed and model-predicted response accuracy as a function of confidence and contrast, for two participants in the study by Rausch & Zehetleitner (2016) on metacognition.

12.5.3 Using a different link function: Probit regression

The logit link is the canonical link function for the Bernoulli and Binomial distribution. But you don’t necessarily have to use the canonical link function. In principle, for binary data, you can pick any (monotone and one-to-one) function $g(Y)$ that transforms variable $Y$ which is bounded between 0 and 1 to an unbounded variable with possible values between $-\infty$ and $\infty$. For example, a common alternative to the logit link is the so-called Probit link function: \[g(y) = \text{standard-Normal-quantile}(y)\] The Probit link function returns the $y$-th quantile of the standard Normal distribution. Recall that a $y$-th quantile is a distribution-dependent value $x$ such that a random variable $Z$ has a probability being equal to or smaller than it which equals $y$. Quantiles rely on the cumulative probability distribution (see Figure for an example involving the standard normal distribution).

The Probit link function equals the standard Normal cumulative distribution function. THe solid dark line shows the standard Normal cumulative distribution function, with arrows depicting the .05, .5, .7, and .975 quantiles.

Figure 12.4: The Probit link function equals the standard Normal cumulative distribution function. THe solid dark line shows the standard Normal cumulative distribution function, with arrows depicting the .05, .5, .7, and .975 quantiles.

The inverse-link function for the Probit link function can be written as \[\begin{aligned} h(x) = p(Z \leq x) && Z \sim \mathbf{Normal}(\mu = 0, \sigma = 1) \end{aligned}\] i.e., the inverse link function is, for a value $x$, the probability that a standard Normal variable has a value equal or less than it.

The Probit link function is an S-shaped function that looks rather similar to the logistic link function. Figure 12.5 shows best matching Probit link function to logistic ones. As can be seen, the functions are very similar, although the Probit function converges a little more quickly to probabilities of 0 and 1. The best match between the functions is obtained when the parameters for the Probit and logistic models are related as \[\beta_j^{(\text{Probit})} \approx 0.588 \times \beta_j^{(\text{logistic})}\]

$A comparison of logistic and Probit link functions for a model with a single predictor, and best matching values for $\beta_0$ and $\beta_1$$

Figure 12.5: A comparison of logistic and Probit link functions for a model with a single predictor, and best matching values for $\beta_0$ and $\beta_1$

Whilst using a logit or Probit link function will generally provide very similar results, the Probit link function may have a more natural justification than the logit link. This is because the Probit link function can be justified in terms of a “hidden” or latent Normal-distributed variable. For example, suppose that someone is presented with two lines, and has to judge whether the one on the right is longer than the one on the left. Denote the true lengths as $\tau_l$ and $\tau_r$ for the line on the left and right respectively. Due to perceptual noise, vision is not perfect. Assume that the perceived length of the lines, denoted as $P_l$ and $P_r$, are both independent and Normal-distributed around the true line length: \[ \begin{aligned} P_l &= \tau_l + \epsilon_r && \epsilon_l \sim \mathbf{Normal}(0,\sigma_\epsilon) \\ P_r &= \tau_r + \epsilon_r && \epsilon_r \sim \mathbf{Normal}(0,\sigma_\epsilon) \end{aligned} \] Also assume the person decides the right line is longer whenever the perceived line on the right is longer than the perceived line on the left: $P_r > P_l$. Depending on the difference between the true lengths, $\tau_r - \tau_l$, and the level of perceptual noise $\sigma_\epsilon$, the shorter line may actually look longer than the longer line. As the perceptual noise terms $\epsilon_l$ and $\epsilon_r$ are independent and Normal-distributed, the perceived difference between the line-lengths is also a Normal-distributed variable: \[ P_r - P_l \sim \textbf{Normal}(\tau_r - \tau_l, \sqrt{2} \sigma_{\epsilon}) \] We can easily turn this perceived difference into a standard-Normal variable \[ \frac{(P_r - P_l) - (\tau_r - \tau_l)}{\sqrt{2} \sigma_{\epsilon}} \sim \textbf{Normal}(0, 1) \] This allows us to use a Probit function to determine the probability that the person decides the right line is larger than the left, as a function of the true difference between the line-lengths. Letting $X = \tau_r - \tau_l$, and the model $\text{Probit}\left(p(\text{right})\right) = \beta_0 + \beta_1 \times X$, this would imply the following true parameters: $\beta_0 = 0$ and $\beta_1 = \frac{1}{\sqrt{2}\sigma_\epsilon}$. As such, the slope of this Probit regression model would allow us to estimate the perceptual noise $\sigma_\epsilon$. The model also allows to estimate a possible response bias (e.g., where person is biased by being more likely to respond right, even if the true line lengths are equal) via the intercept $\beta_0$. A construction like this links the model to widely-used frameworks like Signal Detection Theory (Green & Swets, 1966). It should be noted that the logit link function can be provided with a similar latent variable justification, if it is assumed that the difference in errors $\epsilon_l - \epsilon_r$ is logistic-distributed (McFadden, 1973)

Table 12.3 shows the results of applying a Probit regression to the data of Participant 1. The estimated parameters are around 0.588 times those of the logistic regression model, and while the test results are not identical to those in Table 12.1, they are mostly similar. The test of the deviance of this model is $X^2(374) = 272.289$, $p = 1.00$, indicating a good fit of the model. The main purpose of this additional example is to illustrate that other link functions can be used. The logit and Probit models imply different Data Generating Processes, and so their distinction may be interesting from a theoretical viewpoint. From a data-analytic perspective, however, they are very similar and as such the choice may seem somewhat arbitrary and up to personal preference.

Table 12.3: Results of a Probit regression model of correct identification by confidence for Participant 1.
	$\hat{\beta}$	$\text{SE}(\hat{\beta})$	$z$	$p(\geq \lvert z \rvert)$	$G^2$	$\text{df}$	$p(\geq \lvert G^2 \rvert)$
Intercept	1.041	0.120	8.64	< .001
Confidence	0.020	0.004	5.02	< .001	27.28	1	< .001
Contrast	0.280	0.061	4.58	< .001	23.44	1	< .001
Confidence $\times$ Contrast	0.004	0.002	2.26	.024	5.28	1	.022

12.5.4 Welcome back Paul!

As a final example of logistic regression, I can’t resist getting Paul back to the stage. We can also use logistic regression to test the hypothesis test from Chapter 2 that Paul was randomly guessing, that is $H_0: p(\text{correct}) = .5$. In this case, we have a simple model without a predictor. But we can test whether the intercept equals $\beta_0 = 0$, as this value implies that $p(\text{correct}) = .5$. For this model comparison, we will compare the following models: \[\begin{aligned} \text{MODEL R:}&& \log\left(\frac{\pi_1}{1-\pi_1}\right) &= 0 \\ \text{MODEL G:} && \log\left(\frac{\pi_1}{1-\pi_1}\right) &= \beta_0 \end{aligned}\] For the data at hand (12 correct and 2 incorrect predictions by Paul), estimate of the intercept is $\hat{\beta}_0 = 1.792$ for MODEL G, which corresponds to a predicted probability of $\hat{\pi}_1 = \frac{\exp(1.792)}{1 + \exp(1.792)} = 0.857$. The test result of the model comparison is $G^2(1) = 7.925$, $p = .005$. Hence, in correspondence with the results from Chapter 2, we can reject the null-hypothesis that Paul was randomly guessing.

12.6 Poisson regression

Poisson regression is useful for count data where the total number of possible occurrences of an event is not given. For example, suppose you give participants one hour to solve as many algebra problems as they can. Some participants may be able to (on average) solve a problem correctly every minute, whilst others may need longer (e.g. 5 minutes per problem, on average), or shorter. In this case, the total number of correctly solved algebra problems depends on the (average) rate at which a person can solve algebra problems. If the rate is unknown and can, in principle, take any possible value, then there is no theoretical maximum to the number of algebra problems that can be solved in an hour. Only the minimum is known (which is 0). This is a rather different situation than those to which the Binomial distribution applies, which counts the number of correct responses $k$ out of a known maximum $n$.

The Poisson distribution is defined as: \[\begin{equation} p(Y=k) = \frac{\lambda^k \exp (-\lambda)}{k!} \tag{12.10} \end{equation}\] with $k=0, 1, 2, \ldots$ denoting the number of occurrences (the count), which is a non-negative integer, and ! denoting the factorial function (i.e. $x! = x \times (x-1) \times \ldots \times 1$). The parameter $\lambda$ is equal to the mean of $Y$, so $\mu_Y = \lambda$. The variance of this distribution is equal to the mean, so $\text{Var}(Y) = \lambda$. As for the Bernoulli distribution in logistic regression, the Poisson distribution used in Poisson regression does not include an additional parameter for the spread of the distribution.

The canonical link function for Poisson-distributed data is the log link function: \[g(\mu_{Y|X_1, \ldots, X_K}) = \log\left(\mu_{Y|X_1, \ldots, X_K}\right)\] The Poisson regression model is thus \[\log\left(\mu_{Y|X_1, \ldots, X_K}\right) = \beta_0 + \sum_{j=1}^m \beta_j \times X_{j}\] The inverse link function is the inverse-log or exponential function: \[h\left(\beta_0 + \sum_{j=1}^m \beta_j \times X_{j}\right) = \exp \left( \beta_0 + \sum_{j=1}^m \beta_j \times X_{j} \right)\]

12.6.1 Example: Gestures in different social contexts

As an example of Poisson regression, we will consider data also analysed by Winter & Bürkner (2021). The data is from a study on the use of hand gestures by 27 participants (Catalan or Korean speakers) each conversing with either a friend (equal-status person) or a professor (higher-status person). The duration of the conversation was free and not experimentally controlled. The variable of interest is the count of the number of gestures used by the participants throughout the conversation Figure 12.6 shows the numbers of gestures for the different conditions in the experiment. This Figure doesn’t immediately provide a clear idea of differences between the conditions.

Figure 12.6: Counts of the number of hand gestures during conversations with a friend or professor for 14 Catalan and 13 Korean speakers.

To analyse the data with a Poisson regression model, we need to take into account that the conversations had different durations. If, for example, conversations with friends generally last longer than conversations with professors, we could expect less gestures in total when participants converse with professors. But this might be entirely due to the length of the conversations, rather than participants’ tendency to use gestures during their conversation. What we are really interested in is the rate of gesture use (e.g. the number of gestures per minute). If we let $r$ denote the average number of gestures per minute, and $t_i$ the duration (in minutes) of a conversation $i$, then the expected number of gestures in conversation $i$ would be $\mu_i = r \times t_i$. A Poisson regression model predicts $\log(\mu)$, the logarithm of the number of occurrences, which we can write in terms of the rate $r$ and duration $t_i$ as: \[\begin{aligned} \log (\mu_i) &= \log\left(r \times t_i\right) \\ &= \log(r) + \log(t_i) \end{aligned}\] In the study, we are interested in how social context and language affect the rate ($r$) of gestures. So ideally we would like to model the (log) rate $\log(r)$, and not the expected number of gestures $\log(\mu_i)$, as a linear function of the predictors of interest. We can do this by replacing $\log(r)$ above with the linear model predictions. We then have the following model for the expected counts: \[\log (\mu_i) = \log(t_i) + \beta_0 + \sum_{j=1}^m \beta_j X_{j,i}\] Note that the term $\log(t_i)$ above has no slope. You can think of $\log(t_i)$ as a predictor with a fixed slope of 1. Such a term added to a model (without needing to estimate a corresponding parameter) is also called an offset. Here, the offset is a vital component of the model to allow us to model counts with different durations.

Apart from the choice of link function and distribution, and the resulting implications for parameter interpretation, Poisson regression is very similar to other generalized linear models. Table 12.4 shows the results of a Poisson regression, using effect coded predictors for Context (friend = 1, professor = $-1$), Language (Catalan = 1, Korean = $-1$), and Gender (Female = 1, Male = $-1$). It should be noted that this analysis does not account for the fact that each participant contributes two data points (one for each Context). Later in this chapter we will provide results from a generalized linear mixed-effects model, which does so.

Table 12.4: Results of a Poisson regression model for the number of gestures in conversations.
	$\hat{\beta}$	$\text{SE}(\hat{\beta})$	$z$	$p(\geq \lvert z \rvert)$	$G^2$	$\text{df}$	$p(\geq \lvert G^2 \rvert)$
Intercept	-0.966	0.02	-48.70	< .001
Context	0.046	0.02	2.32	.020	5.39	1	.020
Language	0.029	0.02	1.45	.147	2.11	1	.146
Gender	0.062	0.02	3.14	.002	9.88	1	.002
Context $\times$ Language	-0.036	0.02	-1.82	.069	3.32	1	.068
Context $\times$ Gender	-0.026	0.02	-1.31	.190	1.72	1	.190
Language $\times$ Gender	-0.042	0.02	-2.10	.035	4.44	1	.035
Context $\times$ Language $\times$ Gender	0.037	0.02	1.88	.060	3.53	1	.060

As before, we provide results for both the Wald and likelihood-ratio tests, whilst you would normally just report one of these. Given the relatively small size of the dataset, the use of the likelihood-ratio test is recommended over the Wald test. Nevertheless, the results are very similar for both tests: both show a significant and positive effect of Context, indicating more gestures in conversations with friends. There is also a significant and positive effect of Gender, indicating more gestures by females compared to males. Finally, there is a significant and negative interaction between Language and Gender. This indicates that the difference between females and males is reduced for Catalan speakers as compared to Korean speakers. Recall that due to the coding scheme used for (Gender: Female = 1, Male = $-1$; Language: Catalan = 1, Korean = $-1$), we can work out the conditional slope of Gender for the two Language conditions as: \[\beta_{\text{Gender}|\text{Language}} = \begin{cases} 0.062 + (-0.042) = 0.026 && \text{Catalan} \\ 0.062 - (-0.042) = 0.098 && \text{Korean} \end{cases}\] Due to the log link function, interpretation of the parameters in this Poisson regression model is reasonably straightforward. The value of $\exp (\beta_j)$ reflects the multiplicative increase of the dependent variable (count) for a one-unit increase in predictor $X_j$. For example, being female would increase the predicted number of gestures by a factor of $\exp(0.062) \approx 1.064$.

Table 12.5 shows the 95% confidence intervals, computed by the profile likelihood method, for the parameters of the Poisson regression model. As can be seen there, only the confidence intervals of Context, Gender, and the Language by Gender interaction exclude 0. This leads to similar conclusions as the test results discussed above.

Table 12.5: 95% profile likelihood confidence intervals for the parameters of the Poisson regression model applied to the gestures data.
	2.5 %	97.5 %
Intercept	-1.005	-0.927
Context	0.007	0.085
Language	-0.010	0.068
Gender	0.023	0.101
Context $\times$ Language	-0.075	0.003
Context $\times$ Gender	-0.065	0.013
Language $\times$ Gender	-0.081	-0.003
Context $\times$ Language $\times$ Gender	-0.002	0.076

The deviance of the model is $D_R = 303.957$, with residual degrees of freedom equal to $n - \text{npar}(R) = 54 - 8 = 46$. Hence, the test of overall model fit is $X^2(46) = 303.957$, $p < .001$. As this test is significant, the model does not fit the data well. We will explore why in the following section.

12.6.2 Overdispersion

The overall model fit test above indicates that we can reject the hypothesis that the model is equal to the “true model” (i.e. the probability distribution provided by the estimated model diverges substantially from the observed distribution in the data). This indicates the model is misspecified. Misspecification can be due to not including all relevant predictors in a model, or other factors. For count data, a common cause of misspecification is overdispersion. In a Poisson distribution, the (conditional) variance is identical to the (conditional) mean. In real data, the (conditional) variance is often larger than the (conditional) mean. When the variance is larger than theoretically expected, this is called overdispersion. Underdispersion (where the variance is smaller than expected) is also possible, but much less common in practice.

Overdispersion can lead to large residuals. Figure 12.7 shows a histogram of the standardized Pearson residuals for the Poisson regression model estimated above. Under the null-hypothesis that the model is equal to the true model, these should be standard-Normal-distributed. The overlaid Normal density shows the expected distribution according to the model. Comparing the predicted and observed distribution, it is clear there are more large residuals than expected.

Figure 12.7: Hisogram of standardized Pearson residuals for the Poisson regression model

Overdispersion does not bias the parameter estimates $\hat{\beta}$. So, on average, the parameter estimates are equal to the true parameters. However, overdispersion makes parameter estimates less reliable (i.e. the variance of the sampling distribution of the parameter estimates is larger than without overdispersion). As a result, the standard errors of the parameter estimates reported in Table 12.4 are incorrect, as they assume no overdispersion (or underdispersion for that matter). When there is overdispersion, the true standard errors are higher. As a result, the $p$-values of the Wald test are also incorrect. And as data with overdispersion is not truly Poisson-distributed, the same applies to the $p$-values of the likelihood ratio tests.

Whilst not exactly straightforward, there are ways to account for overdispersion in generalized linear models. In quasi-Poisson regression, it is assumed that the variance is not equal to the mean $\mu$, but scaled by a dispersion parameter $\phi$, such that $\text{Var}(Y) = \phi \times \mu$. Note that usual Poisson regression is nested in this more general model. When $\phi = 1$, then there is no overdispersion, and we revert to the standard Poisson regression model. If $\phi > 1$, there is overdispersion. By estimating the dispersion parameter, quasi-Poisson regression allows to account for overdispersion and computation of more accurate standard errors. A downside of quasi-Poisson regression is that the distribution of the dependent variable is no longer a member of the exponential family. In fact, quasi-Poisson regression does not specify a clear probability distribution for the dependent variable at all. As a result, the likelihood of the parameters, $p(\text{DATA}|\theta)$ is not properly defined either. What remains is a quasi-likelihood, where only the conditional means and variances are specified. In the absence of a likelihood function, parameters can not be estimated by maximum likelihood; rather, they are estimated by maximum quasi-likelihood. Nevertheless, maximum quasi-likelihood estimators share many aspects with maximum likelihood estimators. In particular, the quasi-likelihood estimators are unbiased for the regression parameters $\beta_j$.

For the gestures data, the estimated dispersion parameter is $\hat{\phi} = 5.913$, which is clearly much larger than 1. Results of the quasi-Poisson regression are provided in Table 12.6. Note that this Table contains Wald and $F$-tests. Wald tests are justified by the asymptotic properties of quasi-likelihood estimation. The use of $F$-tests when the dispersion is estimated is recommended by Dunn & Smyth (2018).

Table 12.6: Results of a quasi-Poisson regression model for the number of gestures in conversations.
	$\hat{\beta}$	$\text{SE}(\hat{\beta})$	$z$	$p(\geq \lvert z \rvert)$	$F$	$\text{df}_1$	$\text{df}_2$	$p(\geq \lvert F \rvert)$
Intercept	-0.966	0.048	-20.028	< .001
Context	0.046	0.048	0.953	.346	0.912	1	46.00	0.345
Language	0.029	0.048	0.597	.554	0.357	1	46.00	0.553
Gender	0.062	0.048	1.291	.203	1.670	1	46.00	0.203
Context $\times$ Language	-0.036	0.048	-0.748	.458	0.562	1	46.00	0.457
Context $\times$ Gender	-0.026	0.048	-0.539	.593	0.291	1	46.00	0.592
Language $\times$ Gender	-0.042	0.048	-0.865	.392	0.751	1	46.00	0.391
Context $\times$ Language $\times$ Gender	0.037	0.048	0.772	.444	0.597	1	46.00	0.444

Table 12.6 indicates that none of the effects included in the model are significant. This is essentially because the standard errors of the parameters, corrected for overdispersion, are much larger than under the assumption that the Poisson distribution (without overdispersion) holds. As we will see later in this Chapter, however, the overdispersion is likely due to not properly accounting for the dependencies in the data: as participants contribute to both the friend and professor conditions, the data has repeated measures. Dealing with this properly requires the use of generalized linear mixed-effects models, which we will introduce in Section 12.9.

12.7 Log-linear models

Log-linear models are used to analyse dependency relations in multivariate categorical data. Such data can be expressed in the form of multi-way contingency tables. These are tables that contain counts for combinations of categorical classifications. For example, we might be interested in whether there is a relation between the newspaper people read and the political party they vote for. To investigate this, we can count how often each combination of newspaper and political party occurs in a random sample of surveyed people. In this case, we would have a two-way contingency table, where the first categorical dimension reflects the newspaper a person reads, and the second the political party they vote for.

To model a two-way contingency table, let $\mu_{i,j}$ denote the expected frequency of the combination where the first categorical variable has value $A = i$ and the second the value $B = j$. We will denote the probability that this single randomly drawn observation has the values equal to this combination as $\pi_{i,j} = P(A = i, B = j)$. For a total of $n$ independent observations, the expected frequency is then: \[\mu_{i,j} = n \times \pi_{i,j}\] Now, if the two categorical variables were independent, so that the value of $A$ does not depend on the value of $B$, we could rewrite the probability of the combination of $i$ and $j$ for a single observation as: \[\pi_{i,j} = P(A = i, B = j) = P(A = i) \times P(B=j) = \pi_{i,\cdot} \times \pi_{\cdot,j}\] where $\pi_{i,\cdot}$ and $\pi_{\cdot,j}$ denote marginal probabilities: \[\begin{aligned} \pi_{i,\cdot} &= \sum_{j} \pi_{i,j} \\ \pi_{\cdot,j} &= \sum_{i} \pi_{i,j} \end{aligned}\]

Furthermore, using a log link function and assuming independence, we can write the logarithm of the expected counts as: \[\begin{aligned} \log \mu_{ij} &= \log(n) + \log(\pi_{i,\cdot}) + \log(\pi_{\cdot,j}) \\ &= \lambda + \lambda_i^{(A)} + \lambda_j^{(B)} \end{aligned}\] where $\lambda = \log(n)$, $\lambda^{(A)}_i = \log(\pi_{i,\cdot})$, and $\lambda^{(B)}_j = \log(\pi_{\cdot,j})$. If independence does not hold, then the above equation is incorrect, and we would need an additional term to correct for this. In that case, we could write \[\begin{aligned} \log \mu_{ij} &= \log(n) + \log(\pi_{i,j}) \\ &= \log(n) + \log(\pi_{i,\cdot}) + \log(\pi_{\cdot,j}) + \left( \log(\pi_{i,j} - \log(\pi_{i,\cdot}) - \log(\pi_{\cdot,j})\right) \\ &= \lambda + \lambda_i^{(A)} + \lambda_j^{(B)} + \lambda_{i,j}^{(A,B)}\end{aligned}\] where $\lambda_{i,j}^{(A,B)} = \left( \log(\pi_{i,j} - \log(\pi_{i,\cdot}) - \log(\pi_{\cdot,j}) \right)$. Log-linear models use such constructions to test for independence. Under independence, the “correction terms” above would all be equal to $\lambda_{i,j}^{(A,B)} = 0$.

Log-linear models are similar in spirit to a factorial ANOVA, in the sense that the counts are decomposed into main effects, two-way interactions, three-way interactions, etc. For a two-way contingency table, the counts $\mu_{i,j}$ can always be decomposed exactly as \[\log \mu_{i,j} = \lambda + \lambda^{(A)}_i + \lambda^{(B)}_j + \lambda^{(AB)}_{i,j}\] where $\lambda$ acts as an intercept, $\lambda^{(A)}_i$ as a “main effect” for factor $A$, $\lambda^{(B)}_j$ as a “main effect” for factor $B$, and $\lambda^{(AB)}_{i,j}$ as an “interaction effect” between factors $A$ and $B$. Just like a factorial ANOVA model can perfectly fit the sample means, this log-linear model can perfectly fit the sample counts. This is hence an example of a saturated model, where the total number of parameters equals the total number of possible combinations minus 1.³⁹

Although they seem rather different, log-linear models can be formulated in terms of equivalent Poisson regression models (by using contrast-coded predictors to account for the cells or combinations of categories in the multi-way contingency table). As such, they are an instance of generalized linear models, and all the tools available for the latter apply to log-linear models (Dobson & Barnett, 2018).

12.7.1 Example: Newspapers and voting

After the 2017 UK general elections, a large survey was conducted where people were asked which party they voted for, and amongst other things, the newspaper they most often read. Table 12.7 shows the results. In the UK, many newspapers have strong links to political parties. We could ask whether such links also exist for their readers.

Table 12.7: Party voted for and most-read newspapers for the 2017 UK general election.
	Express	Daily Mail	Mirror	Daily Star	Sun	Daily Telegraph	Financial Times	Guardian	Independent	Times
Did not vote	212	1568	483	83	1022	542	133	1772	474	632
Conservative	578	3675	171	36	557	1949	108	605	324	1670
Labour	113	845	610	46	283	296	105	5513	1424	691
Liberal Democrat	15	149	27	6	29	148	38	907	259	404
SNP	0	50	54	3	29	0	3	227	65	29
UKIP	23	149	18	3	29	25	6	0	22	29
Green	15	50	9	1	10	25	9	227	44	58
Other	0	50	9	0	10	25	3	76	22	29

For each combination of political party $i = 1, \ldots, 8$ (where we include “Did not vote” as a special party) and newspaper $j = 1, \ldots, 10$, Table 12.7 shows the corresponding count $n_{i,j}$. The total number is $n = \sum_i \sum_j n_{i,j} = 29938$, reflecting the sample size of the survey. The loglinear model concerns the expected counts for each combination, $\mu_{i,j} = n \times P(\text{party} = i, \text{newspaper} = j)$. These expected counts can be estimated by the observed counts $\hat{\mu}_{ij} = n_{i,j}$.

The full model \[\log \mu_{i,j} = \lambda + \lambda^{(\text{party})}_i + \lambda^{(\text{newspaper})}_j + \lambda^{(\text{party} \times \text{newspaper})}_{i,j}\] has $8-1 = 7$ parameters $\lambda^{(\text{party})}_i$ for the “main effect” of party, $10-1 = 9$ parameters $\lambda^{(\text{newspaper})}_j$ for the “main effect” of newspaper, and $7 \times 9 = 63$ parameters $\lambda^{(\text{party} \times \text{newspaper})}_{i,j}$ for the “interaction” between party and newspaper. Including the intercept $\lambda$, the total number of parameters is thus $1 + 7 + 9 + 63 = 80$, which equals the number of observed counts (i.e. combinations of party and newspaper). This saturated model therefore will fit the observed counts perfectly. Its main purpose is to serve as a comparison for simpler models. For this example, the comparison model of interest is a model in which political party voted-for is independent of the newspaper read:

\[\log \mu_{i,j} = \lambda + \lambda^{(\text{party})}_i + \lambda^{(\text{newspaper})}_j\]

The test for independence consists of a model comparison between the saturated and independence model. The result of the likelihood ratio test is $G^2 = 12211$, $\text{df} = 63$, $p < .001$. This is significant, and hence we reject the null hypothesis that the party voted for is independent of the newspaper most read.

12.7.2 A three-way table example: Rock-Paper-Scissors

Guennouni & Speekenbrink (2022) let participants repeatedly play the well-known game of Rock-Paper-Scissors against artificial (AI) opponents. The AI agents were endowed with different human-like strategies. The “level-1” AI player would expect the human player to repeat their last chosen action, and would pick the action that beats this. For example, if the human player had previously chosen “Rock”, then they would expect the human player to play “Rock” again, therefore choosing “Paper” as this is the action that beats “Rock”. The “level-2” AI player would expect the human player to adopt a “level-1” strategy, choosing actions to beat this. For example, if the AI player had previously chosen “Rock”, the human would be expected to next play “Paper”, as this beats “Rock”. The AI player would therefore play “Scissors”, as this beats “Paper” (the expected action of the human player). Results showed that people could pick-up on the strategy of the AI player, using this to their advantage. Moreover, they could transfer this learned strategy to other games.

If we focus on the “level-1” AI player, we can say that their actions depend completely on the previous action of the human player.⁴⁰ If the human player is sensitive to the AI player’s strategy and able to exploit the weaknesses of the AI player, then the humans’ strategy should also completely depend on their previous actions. For example, they should choose to play “Scissors” after playing “Rock”, as they should expect the AI player to play “Paper”. If this is the case, then we might expect the current action of the human player to be independent of the previous action of the AI player. We can test this hypothesis with log-linear models.

Table 12.8 shows the actions chosen by the human player (in the last half of the game) as a function of the previous actions of the AI and human player.

Table 12.8: Action chosen by human players in Rock-Paper-Scissors by previous actions chosen by both human and the level-1 AI player.
		Current human action
Previous human	Previous AI	Rock	Paper	Scissors
rock	rock	10	19	15
rock	paper	5	9	12
rock	scissors	5	40	107
paper	rock	94	2	10
paper	paper	45	12	11
paper	scissors	17	7	8
scissors	rock	5	8	5
scissors	paper	24	94	5
scissors	scissors	15	11	5

There are 9 possible models that can be used for this three-way contingency table. One possible model includes only main effects for the previous human action (H), the previous AI action (A) and the current human action (C). Such a model can be denoted as $(H, A, C)$, where elements separated by a comma are assumed independent: \[\begin{aligned}(H,A,C)&& \log(\mu_{i,j,k}) = \lambda + \lambda_i^{(H)} + \lambda_j^{(A)} + \lambda_k^{(C)}\end{aligned}\] If humans played the optimal strategy, then we would expect a dependency between the previous human action and the current human action, but these to be independent from the previous AI action. This corresponds to a model with main effects for $H$, $A$, and $C$, and a two-way interaction between $H$ and $C$. This model can be denoted as $(HC,A)$, to indicate dependence between $H$ and $C$, and independence of the factors from $A$: \[\begin{aligned}(HC,A) && \log(\mu_{ijk}) = \lambda + \lambda_i^{(H)} + \lambda_j^{(A)} + \lambda_k^{(C)} + \lambda_{i,k}^{(H,C)}\end{aligned}\] Table 12.9 lists the results of all possible loglinear models for this data.

Table 12.9: Results from all possible log-linear models for the Rock-Paper-Scissors data of Table 12.8
	$\text{df}$	$G^2$	$p(\gt G^2)$
$(H, A, C)$	20	678.4	< .001
$(HA, C)$	16	417.9	< .001
$(HC,A)$	16	341.5	< .001
$(AC, H)$	16	506.2	< .001
$(HA, AC)$	12	245.7	< .001
$(HA, HC)$	12	81.0	< .001
$(HC, AC)$	12	169.3	< .001
$(HA, HC, AC)$	8	35.6	< .001
$(HAC)$	0	0.0	1.00

All the models (apart from the saturated model) are rejected. As such, it seems like the current actions of humans depends both on their previous actions, as well as those of their opponent. That implies they likely have not found the optimal strategy.

12.7.3 Sparse data and empty cells

Problems can arise when tables have many small counts. This can happen when the total sample size is small, or when the number of cells in the table is relatively large compared to the total number of observations. A particular problem is when cells are empty (have a count of 0). As $\log(0) = -\infty$ this may lead to problems estimating the parameters (the estimates may become infinite). A structural zero refers to a case where the expected count for a cell is truly zero, because a combination of values can not occur (i.e. the probability of that combination equals 0). Such cells should be ignored (removed) from the analysis. A sampling zero occurs when the expected count is larger than 0, but a particular combination has not been observed in the data due to chance. For such cells, a correction factor may be applied. When zero-count cells are present, Agresti (2018) recommends to add a value of $\tfrac{1}{2}$ to all cells of saturated models (including cells with non-zero counts). For non-saturated models, this correction is too large however, and Agresti (2018) recommends to add a much smaller value to each cell (e.g. .00000001, or .0001).

12.8 Multinomial logistic regression

Logistic regression concerns modelling a dichotomous or binary dependent variable. There are many situations in which we would like to model a polytomous (i.e. non-binary) categorical dependent variable. For instance, in a marketing application, we might be interested in determining the factors that underlie consumers’ choice between multiple products (e.g. A, B, C, or D).

A simple, but suboptimal solution would be to treat the data as a set of binary variables. For instance, if we consider four products, we might construct 4 binary variables, where the first one encodes whether product $A$ was chosen ($Y_{1,i} = 1$) or not ($Y_{1,i} = 0$), the second one whether product $B$ was chosen ($Y_{2,i} = 1$) or not ($Y_{2,i} = 0$), etc. For each of these variables, we might construct a logistic regression model with the same set of predictors. A problem with this approach is that it ignores a fundamental aspect: these variables are not independent! If we know that $Y_{1,i} = 1$, then all the other variables ($Y_{2,i}, Y_{3,i}, \ldots$) have to be 0. In multinomial logistic regression, a set of binary logistic regression models is estimated simultaneously, and the dependencies between these models are taken into account appropriately.

12.8.1 Baseline category logit

The most common multinomial logistic regression model uses a so-called baseline category logit link function. This approach takes one of the categories (or levels) of the dependent variable as a baseline, and compares all other levels to that baseline category.

\[\log \left( \frac{\pi_j}{\pi_k} \right) = \beta_{j,0} + \beta_{j,1} \times X_1 + \ldots + \beta_{j,m} \times X_m \quad \quad \text{for all } j \neq k\] We could also parametrise the comparison of the baseline category against itself, i.e. $\log \left( \frac{\pi_k}{\pi_k} \right)$. But as $\log \left( \frac{\pi_k}{\pi_k} \right) = 0$ always, we can just treat the parameters of this logit function as fixed to 0, i.e $\beta_{k,0} = \beta_{k,1} = \ldots = \beta_{k,m} = 0$.

12.8.2 Example: Rock-Paper-Scissors

As an example, we re-analyse the Rock-Paper-Scissors data, focusing on a model of the human player’s current action as a function of the previous actions of the human and AI player. We choose “paper” as the baseline category, and use effect-coding for the two predictors (human and AI previous action). As each predictor has three possible values, we need two contrast codes for each. We will estimate effects for previous actions “rock” and “scissors” (again, treating “paper” as a baseline). Our two logits are then

\[\begin{aligned} \log \left(\frac{p(\text{rock})}{p(\text{paper})} \right) &= \beta_{R, 0} + \beta_{R, HR} \times C_{H: R} + \beta_{R, HS} \times C_{H: S} + \beta_{R, AR} \times C_{A: R} + \beta_{R, AS} \times C_{A: S} \\ \log \left(\frac{p(\text{scissors})}{p(\text{paper})} \right) &= \beta_{S, 0} + \beta_{S, HR} \times C_{H: R} + \beta_{S, HS} \times C_{H: S} + \beta_{S, AR} \times C_{A: R} + \beta_{S, AS} \times C_{A: S} \end{aligned}\] Parameter estimates and Wald tests are provided in Table 12.10.

Table 12.10: Results of a baseline category multinomial logistic regression model for human current actions as a function of human and AI previous actions.
	C: rock				C: scissors
	$\hat{\beta}$	$\text{SE}(\hat{\beta})$	$z$	$P(\geq \lvert z \rvert)$	$\hat{\beta}$	$\text{SE}(\hat{\beta})$	$z$	$P(\geq \lvert z \rvert)$
Intercept	0.131	0.084	1.549	.121	-0.090	0.089	-1.01	.311
H: rock	-1.013	0.127	-7.991	< .001	0.257	0.114	2.25	.024
H: scissors	-0.474	0.112	-4.232	< .001	-0.922	0.128	-7.22	< .001
A: rock	0.334	0.117	2.863	.004	-0.129	0.123	-1.05	.294
A: scissors	0.032	0.124	0.257	.797	0.734	0.113	6.52	< .001

Instead of the Wald test for the individual coefficients in Table 12.10, we may wonder whether a predictor has any effect on the dependent variable. For example, if the AI’s previous action has no effect on the human’s current action, that would mean that the null-hypothesis \[H_0: \beta_{R, AR} = \beta_{R, AS} = \beta_{S, AR} = \beta_{S, AS} = 0\] is true. We can test this hypothesis by a model comparison, comparing the full model to one where we fix these parameters to 0 (effectively removing the AI previous action predictor from the model). This model comparison provides the following test results: $-2 \log(\text{likelihood-ratio}) = 252.269$, $\text{df} = 4$, $p < .001$. Hence, we can conclude that the human players’ current actions do depend on the AI’s previous actions. This confirms the results of the log-linear models used earlier.

12.8.3 Reconstructing probabilities of responses

After estimating a multinomial logistic regression model, we are often interested in the predicted probabilities of each category. These can be derived through the inverse link function: \[\hat{\pi_j} = \frac{\exp \left( \beta_{j,0} + \beta_{j,1} \times X_1 + \ldots + \beta_{j,m} \times X_m \right)}{\sum_{l=1}^K \exp \left(\beta_{l,0} + \beta_{l,1} \times X_1 + \ldots + \beta_{l,m} \times X_m \right)}\] Note that we also sum over the baseline category $k$ in the denominator, setting the corresponding parameters to 0. For example, if the human player previously played “Rock” ($C_{H: R} = 1$, $C_{H: S} = 0$) and the AI previously player “Paper” ($C_{A: R} = -1$, $C_{A: S} = -1$), we can work out the probability of each possible current action by computing the following quantities:

action	linear equation	value	exp	prob
C: R	$0.131 + (-1.013) \times 1 + (-0.474) \times 0 + 0.334\times -1 + 0.032\times -1$	-1.248	0.287	0.149
C: P	0	0.000	1.000	0.518
C: S	$(-0.09) + 0.257 \times 1 + (-0.922) \times 0 + (-0.129) \times -1 + 0.734 \times -1$	-0.438	0.645	0.334

For each action, we start with computing the value of the linear equation. These values are then transformed by taking the natural exponent. Finally, we can divide each of these exponentiated values by the sum of all exponentiated values to arrive a the calculated probability. For example, the probability that the action is “Rock” is computed as \[\begin{aligned} p(\text{rock}) &= \frac{\exp(-1.248)}{\exp(-1.248) + \exp(0) + \exp(-0.438)} \\ &= 0.149 \end{aligned}\]

12.8.4 Alternative logit models for ordinal categories

When the categories of the dependent variable can be meaningfully ordered, this ordering can be used to construct a different set of logit models. All these formulations work with odds, but these odds are differentially defined. Some common examples are:

Cumulative logit. $\log \left( \frac{\pi_1}{\pi_2 + \pi_3 + \ldots} \right)$, $\log \left( \frac{\pi + \pi_2}{\pi_3 + \pi_4 + \ldots} \right)$, $\log \left( \frac{\pi_1 + \pi_2 + \pi_3}{\pi_4 + \ldots} \right)$, etc.
Adjacent category logit. $\log \left( \frac{\pi_1}{\pi_2} \right)$, $\log \left( \frac{\pi_2}{\pi_3} \right)$, $\log \left( \frac{\pi_3}{\pi_4} \right)$, etc.
Continuation ratio logit. $\log \left( \frac{\pi_1}{\pi_2} \right)$, $\log \left( \frac{\pi_1 + \pi_2}{\pi_3} \right)$, $\log \left( \frac{\pi_1 + \pi_2 + \pi_3}{\pi_4} \right)$, etc.

We can choose any set of logits, as long as it is possible to reconstruct the probability of each outcome from the set. In principle, we could also use one of these logit formulations for unordered categories. The model as a whole would fit the data equally well, no matter what valid set of logits is chosen. Similarly to the approach we took in repeated-measures ANOVA, these logits essentially define contrasts on a dependent variable. Any valid contrast will do, but choosing a good set of contrasts will provide a more informative set of parameters.

The main usefulness of alternative logit formulations for ordered categories is when using these in combination with restrictions on the parameters. In particular, assuming that the slope of predictors is identical for each logit. For example, the Proportional Odds Model uses the cumulative logit formulation and assumes the effect of a predictor to be identical for each of these cumulative logits:

\[\log \left( \frac{\sum_{k=1}^l \pi_k}{\sum_{k=l+1}^K \pi_k} \right) = \beta_0 + \sum_{j=1}^m \beta_j \times X_j \quad \quad l = 1, \ldots, K\] This reduces the number of parameters quite drastically compared to an unconstrained multinomial logistic regression model.

12.9 Generalized linear mixed-effects models

As in the linear model, the generalized linear model can be extended by including random effects. Whilst estimation of such generalized linear mixed-effects models is more difficult than for linear mixed-effects models, conceptually, the models are similar. Again, we can write each intercept and slope as consisting of a fixed and a random part, and assume that each random effect is drawn from a (multivariate) Normal distribution, with mean 0.

Table 12.11 shows the results for a logistic mixed-effects regression model, using data from all participants of the metacognition experiment (see Section 12.5.2). In addition to the fixed effects of Confidence, Contrast, and their interaction, the model includes participant-wise random intercepts and slopes for Confidence and Contrast. Random effects are assumed independent, and to aid estimation, Confidence and Contrast were scaled by a $Z$-transformation. Scaling predictors can be crucial when estimating generalized linear mixed-effects models.

Table 12.11: Estimates and tests for the fixed-effects of a logistic mixed-effects model with random intercepts and slopes for Confidence and Contrast. Note: Confidence and Contrast are scaled by a Z-transformation.
	$\hat{\beta}$	$\text{df}$	$G^2$	$p(\geq G^2)$
Intercept	1.723	1	45.6	< .001
Confidence	1.177	1	62.2	< .001
Contrast	1.210	1	49.5	< .001
Confidence$\times$Contrast	0.814	1	255.3	< .001

The results show significant and positive effects of Confidence and Contrast, as well as their interaction. This confirms the results obtained for models fitted to the data of the individual participants (see Table 12.1 and 12.2).

Table 12.12 shows the results for a Poisson mixed-effects model for the gestures data (see Section 12.6.1). Accounting for the dependence in the data by including random intercepts for each participant, we now only find a significant effect of Context. The effects of Gender and the Gender by Language interaction are no longer significant for this model.

Table 12.12: Estimates and tests for the fixed-effects of a Poisson mixed-effects model with random intercepts for participants.
	$\hat{\beta}$	$\text{df}$	$G^2$	$p(\geq G^2)$
Intercept	-1.031	1	59.31	< .001
Context	0.054	1	7.28	.007
Language	0.040	1	0.34	.561
Gender	0.060	1	0.74	.388
Context $\times$ Language	-0.037	1	3.43	.064
Context $\times$ Gender	-0.019	1	0.94	.331
Language $\times$ Gender	-0.016	1	0.06	.814
Context $\times$ Language $\times$ Gender	0.034	1	2.88	.090

12.10 In practice

Generalized linear models are more general than the General Linear Model (GLM), and are useful to analyse a wide variety of data. A linear model is used for the structural part. For this linear model, similar principles apply as for the GLM: We can use suitable contrast codes to incorporate categorical predictors, and use polynomial regression techniques to allow for non-linear relations. Regarding the latter, it is important to keep in mind that generalized linear models already involve a transformation of the model predictions (via the link function). The assumption of linearity applies to the relation between predictors and the link-transformed expectations of the dependent variable.

Steps involved in analysing data with generalized linear models are:

Choose a sensible distribution (family) for the random part of the model. This can usually be based on the nature of the dependent variable. For binary data, a Bernoulli distribution is generally the right choice. For count (frequency) data, if the maximum possible count is fixed by design, a Binomial distribution makes sense. If the maximum possible count is not fixed, a Poisson distribution makes sense. For polytomous dependent variables, a multinomial distribution is sensible. If you are interested in determining dependence relations between (polytomous) categorical variables, log-linear models are a natural choice.
Choose a useful link function. The link function defines how (a linear function of) the predictors relate to the conditional mean of the dependent variable. Hence, the choice is important. It is often safe to use to the canonical link function. But for specific cases, you may want to choose an alternative or define your own link function.
If there are repeated measures or other dependencies between data points, random effects can be added to the model to account for this. Random effects can account for the overdispersion that these dependencies give rise to. Overdispersion can also be dealt with by using a quasi-generalized-linear model. But if overdispersion is due to dependencies in the data, it will be better to deal with the added complexity of generalized linear mixed-effects models rather than the that of quasi-generalized-linear models.
Estimate and assess the fit of the model. Tests of the overall fit are important here, as they may signal overdispersion for instance. When a model does not fit the data well, check if excessive deviance is due to a few cases. If not, then consider a different link function, a different probability distribution, or even a “quasi” distribution (in this order).
Perform inferential tests for the model parameters. In almost all cases, it is advisable to use likelihood-ratio tests for this purpose, or profile or bootstrap confidence intervals, rather than Wald tests or confidence intervals based on the standard Normal approximation. The latter are easier to compute and useful in exploratory model building, but not so much in confirmatory tests.
Report the results. In reporting results, make sure you specify (1) the distribution assumed for the (conditional) dependent variable, (2) the link function used, (3) all the predictors used in the model.

12.11 Further reading

This chapter can only provide an introduction to generalized linear models. Whole books have been written on the topic. If you want to read more about generalized linear models, Agresti (2018) provides a good and relatively readable introduction. Dobson & Barnett (2018) provides a more succinct and technical text. Fox (2015) provides an extensive and unified account of the GLM and extensions into generalized linear models, covering much more ground than I am able to do here. There are also sources covering more specialized topics. Hosmer, Lemeshow, & Sturdivant (2013) offer a readable and practical book on logistic regression. Wickens (2014) provides a very extensive book on log-linear models.

References

Agresti, A. (2018). An introduction to categorical data analysis. John Wiley & Sons.

Dobson, A. J., & Barnett, A. G. (2018). An introduction to generalized linear models. Chapman; Hall/CRC.

Dunn, P. K., & Smyth, G. K. (2018). Generalized linear models with examples in R. Springer.

Fox, J. (2015). Applied regression analysis and generalized linear models. Sage Publications.

Green, D. M., & Swets, J. A. (1966). Signal detection theory and psychophysics. New York: Wiley.

Guennouni, I., & Speekenbrink, M. (2022). Transfer of learned opponent models in zero sum games. Computational Brain & Behavior, 5, 326--342.

Hosmer, D. W., Lemeshow, S., & Sturdivant, R. X. (2013). Applied logistic regression. John Wiley & Sons.

McCullagh, P., & Nelder, J. A. (2019). Generalized linear models. Routledge.

McFadden, D. (1973). Conditional logit analysis of qualitative choice behavior.

Rausch, M., & Zehetleitner, M. (2016). Visibility is not equivalent to confidence in a low contrast orientation discrimination task. Frontiers in Psychology, 7, 591.

Venzon, D., & Moolgavkar, S. (1988). A method for computing profile-likelihood-based confidence intervals. Journal of the Royal Statistical Society: Series C (Applied Statistics), 37, 87–94.

Wickens, T. D. (2014). Multiway contingency tables analysis for the social sciences. Psychology Press.

Winter, B., & Bürkner, P.-C. (2021). Poisson regression for linguists: A tutorial introduction to modelling count data with brms. Language and Linguistics Compass, 15, e12439.

If you really want to know, the probability density function of exponential family models can be written as \[p(y|\theta, \phi) = \exp \left( \frac{y \theta - b(\theta)}{a(\phi)} + c(y, \phi) \right),\] where $\theta$ and $\phi$ are parameters, and $a$, $b$, and $c$ are known functions. The parameter $\theta$ is called the canonical parameter, and can be written as a function of the mean of $Y$: $\theta = g(\mu)$. The function $g$ is called the canonical link function. The parameter $\phi > 0$ is the dispersion parameter. For some distributions, such as the Bernoulli and Poisson distribution, the dispersion parameter is fixed to $\phi = 1$.↩︎
The UK financial market was rather in turmoil on 29 September 2022 after the UK government released its “mini-budget”.↩︎
As mentioned in a previous footnote, the canonical link function is the function that relates the canonical parameter of the exponential family distribution in question to the mean, $\theta = g(\mu)$.↩︎
This involves repeatedly simulating data under the null-hypothesis and obtaining parameter estimates for each simulated data set. This leads to a set of parameter estimates, which can be ordered. You can then compute a bootstrap confidence interval as the range of estimates excluding the most extreme estimates. For example, a 95% confidence interval would be computed by excluding the 2.5% lowest and 2.5% highest parameter estimates; the lower confidence bound is then the lowest estimate, and the upper bound the maximum estimate, in the set that remains.↩︎
Pseudo-$R^2$ measures have been suggested for generalized linear models, such as McFadden’s, Cox and Snell’s, and Nagelkerke’s. These measures are based on the log likelihood ratio comparing the the fitted model and a null model with only an intercept. Whilst the aim of these measures is to have a statistic with a similar scale as the traditional $R^2$ for linear models, they don’t generally succeed in this. Hence, I don’t recommend their use. If you are interested, you can find more information on these measures here.↩︎
That is because, under the assumption of conditional independence of the observations: \[\begin{aligned} \frac{p(Y_1, \ldots, Y_n|\text{MODEL R})}{p(Y_1, \ldots, Y_n|\text{MODEL S})} &= \frac{\prod_{i=1}^n p(Y_i|\text{MODEL R})}{\prod_{i=1}^n p(Y_i|\text{MODEL S})} \\ &= \prod_{i=1}^n \frac{p(Y_i|\text{MODEL R})}{p(Y_i|\text{MODEL S})}\\ \log\left( \prod_{i=1}^n \frac{p(Y_i|\text{MODEL R})}{p(Y_i|\text{MODEL S})} \right) &= \sum_{i=1}^n \log \left( \frac{p(Y_i|\text{MODEL R})}{p(Y_i|\text{MODEL S})} \right) \\ &= \sum_{i=1}^n D_{R,i} \end{aligned}\]↩︎
Standardized Pearson residuals are defined as \[\begin{equation} R^\text{Pearson}_i = \frac{Y_i - \hat{\mu}_i}{\sqrt{\hat{V}(Y_i|X_{1,i}, \ldots, X_{m,i})(1-h_i)}} \end{equation}\] where $\hat{V}(Y_i|X_{1,i}, \ldots, X_{m,i})$ denotes the (estimated) variance of $Y_i$ for cases with predictor values $X_{1,i}, \ldots, X_{m,i}$, and $h_i$ is the leverage value.↩︎
As we are only considering a single outcome, $n = 1$ and we do not need to worry about how many possible outcomes could constitute $k = 0, 1, 2, \ldots$ correct ones, which is why we don’t need the $\left(\right)$ term.↩︎
This is because at $p(y=1) = .5$, \[\log \frac{p(Y=1)}{p(Y=0)} = \log \left(\frac{.5}{.5} \right) = \log(1) = 0\]. And $\beta_0 + \beta_1 \times X = 0$ when $X = -\frac{\beta_0}{\beta_1}$. ↩︎
We need one parameter less as the number of cells, as when $n$ is known, the last remaining cell can be filled in as $n$ minus the frequencies in the other cells.↩︎
In the experiment, some noise was added to the actions of the AI player, who would choose a completely random action on 10% of the rounds.↩︎

	\(\text{df}\)	\(G^2\)	\(p(\gt G^2)\)
\((H, A, C)\)	20	678.4	< .001
\((HA, C)\)	16	417.9	< .001
\((HC,A)\)	16	341.5	< .001
\((AC, H)\)	16	506.2	< .001
\((HA, AC)\)	12	245.7	< .001
\((HA, HC)\)	12	81.0	< .001
\((HC, AC)\)	12	169.3	< .001
\((HA, HC, AC)\)	8	35.6	< .001
\((HAC)\)	0	0.0	1.00

	C: rock				C: scissors
	\(\hat{\beta}\)	\(\text{SE}(\hat{\beta})\)	\(z\)	\(P(\geq \lvert z \rvert)\)	\(\hat{\beta}\)	\(\text{SE}(\hat{\beta})\)	\(z\)	\(P(\geq \lvert z \rvert)\)
Intercept	0.131	0.084	1.549	.121	-0.090	0.089	-1.01	.311
H: rock	-1.013	0.127	-7.991	< .001	0.257	0.114	2.25	.024
H: scissors	-0.474	0.112	-4.232	< .001	-0.922	0.128	-7.22	< .001
A: rock	0.334	0.117	2.863	.004	-0.129	0.123	-1.05	.294
A: scissors	0.032	0.124	0.257	.797	0.734	0.113	6.52	< .001

action	linear equation	value	exp	prob
C: R	\(0.131 + (-1.013) \times 1 + (-0.474) \times 0 + 0.334\times -1 + 0.032\times -1\)	-1.248	0.287	0.149
C: P	0	0.000	1.000	0.518
C: S	\((-0.09) + 0.257 \times 1 + (-0.922) \times 0 + (-0.129) \times -1 + 0.734 \times -1\)	-0.438	0.645	0.334