# Chapter 1 Introduction

In this chapter we will introduce some fundamental concepts, such as experiments, data, analysis, and modelling. We will also introduce you to Paul the octopus.

## 1.1 Paul the octopus

We know that all octopus have nine brains so we know he has exceptional powers.
Oliver Walenciak, Paul’s keeper
Source: Irish central

Paul the Octopus (26 January 2008 – 26 October 2010) was born in the Sea Life Centre in Weymouth, England, and subsequently moved to the aquarium chain’s centre in Oberhausen, Germany. It was there he found world-wide fame as a cephalopod oracle. Figure 1.1: Paul the octopus predicts Spain to win the final of the 2010 FIFA World Cup. Source: The Guardian

Paul began his career during the UEFA Euro 2008 football tournament. In the lead-up to Germany’s international football matches, Paul was presented with two clear plastic boxes, each containing food (a mussel or an oyster). Each container was marked with the flag of a team, one the flag of Germany, and the other the flag of Germany’s opponent. The box which Paul opened first (and ate its contents) was deemed to be the predicted winner of the match. Paul predicted Germany to win all of their games, a correct prediction in 4 out of 6 cases. He failed to predict their defeats by Croatia in the group stage, and by Spain in the championship’s final.

Table 1.1: Paul’s predictions for the UEFA Euro 2008
Match Prediction Result Outcome
Germany-Poland Germany 2-0 correct
Croatia-Germany Germany 2-1 incorrect
Austria-Germany Germany 0-1 correct
Portugal-Germany Germany 2-3 correct
Germany-Turkey Germany 3-2 correct
Germany-Spain Germany 0-1 incorrect

Two years later, during the 2010 FIFA World Cup football tournament, Paul obtained celebrity status and his divinations were broadcast live on German TV. This time, Paul made correct predictions for all matches in which Germany played, as well as the final between Spain and the Netherlands.

Table 1.2: Paul’s predictions for the 2010 FIFA World Cup
Match Prediction Result Outcome
Germany-Australia Germany 4-0 correct
Germany-Serbia Serbia 0-1 correct
Ghana-Germany Germany 0-1 correct
Germany-England Germany 4-1 correct
Argentina-Germany Germany 0-4 correct
Spain-Germany Spain 1-0 correct
Uruguay-Germany Germany 2-3 correct
Netherlands-Spain Spain 0-1 correct

Paul’s record of 8/8 correct predictions in the 2010 world cup is quite amazing. The common octopus (Octopus vulgaris) is indeed quite an intelligent species. But was Paul really psychic? Could he really foresee the future?

## 1.2 Experiments and observations

We can view Paul’s divinations as the result of an experiment to test his psychic abilities. Paul’s predictions were derived under more or less controlled conditions: The keeper didn’t put Paul’s favourite food in the box representing the team he’d hope to win, and didn’t always place the box representing Germany on one side of the tank. Had he done one or more of these things, he might have biased the results.

A paradigmatic example of an experiment is a randomized-control trial, as common in studies assessing the effectiveness of medication or other therapeutic interventions. In such experiments, a sample of participants is taken from a population of interest, and each participant is randomly assigned to either an experimental condition or control condition. For example, participants in the experimental condition might be given a new drug, while participants in the control condition receive a placebo. Ideally, this is done in a “double-blind” setting, where neither the participant nor the person delivering the treatment or otherwise interacting with the participant knows which condition the participant is assigned to. Randomization is key here: By letting chance determine who gets which treatment, characteristics of the participants (e.g. age, sex, severity of symptoms, etc.) will not be tied to the treatment they get. As such, it will be reasonable to assume that the experimental and control condition differ only in their treatment, such that any differences in outcome can be attributed to the treatment, and not to other differences between the groups of participants.

Data obtained under less stringent conditions is often referred to as observational data. For instance, rather than random assignment, we could let the doctor choose which treatment to give to which patient. The doctor might choose to give the new drug to the more severe cases, and the placebo to the less severe cases. Initial severity of disease is now confounded with the treatment, and any difference in outcome success observed between the experimental and control condition might be due to the treatment, or to the initial differences in severity. Or even to a third variable such as age, if it is itself related to both initial severity of disease as well as the effect of the drug. We wouldn’t be able to tell, and hence the conclusions we can draw about the effects of the treatment are much less straightforward. We are not completely helpless though. Using statistics, we can attempt to control for pre-existing differences in age and initial disease severity. As we will see in due course, we can attempt to build a model which includes both treatment and initial disease severity as predictors of outcome. In doing so, we can attempt to estimate the unique effect of treatment on outcome. By considering only those differences in outcome that cannot be attributed to initial severity or age, we can still say something meaningful about treatment effects. Nevertheless, there could always be other confounding factors that we haven’t though of. As such, conclusions about causality (whether it is truly the difference in medication that caused a difference in outcome), are not as straightforward in observational studies.

## 1.3 Data

Data are a collection of measured or otherwise observed or known values of variables. A variable is, as the name implies, something that can take a variety of values. In the medical example, it might be the severity of the symptoms after 3 weeks of treatment, while in Paul’s case, it might be the correctness of his predictions. In a statistical model, a dependent variable is a variable of interest, a variable which we aim to predict, explain, or describe. An independent variable is a variable which we use to help predict, explain, or describe the dependent variable. For instance, in the medical example above, the dependent variable would be the severity of symptoms, while independent variables would be the treatment (medication or placebo), and potentially other factors such as patients’ age, sex, etc. Whether a variable is a dependent or independent variable may differ from model to model. As such, it is an aspect of the analysis, and not so much of the design of a study.

Paul has provided us with two sets of data. The first contains his predictions for the UEFA Euro 2008 cup, and the second his predictions for the 2010 FIFA World Cup. The dependent variable in both sets is the accuracy of Paul’s predictions. Paul’s “predictions” were of course not directly predictions who was going to win each game. Paul made a choice which of two containers to open first. But we’re not really interested in whether Paul opened the left or right container in his tank, or whether Paul chose the container with the Spanish flag or the container with the Dutch flag. What we are interested in is whether the first container to be opened contained the flag of the team that would win the match played that day. If it did, Paul made a correct prediction, if not, Paul made an incorrect prediction. The data then contains information about what we deem relevant to our purposes. In a sense, looking just at the accuracy of Paul’s predictions, we have already abstracted quite a bit from the experiment. We ignore whether Paul went left or right first, how long it took him to open the first container, … These are all aspects of Paul’s behaviour which we could have focused on, and hence could be variables in our data. These variables have different characteristics. For instance, accuracy (correct or incorrect) and direction (left or right container) both have only two possible values, while the duration until opening the first container has many possible values. Variables with a finite (limited) number of values are called discrete, while variables with an infinite (unlimited) number of values are called continuous.2 In practice, all data is discrete. Even if we were to measure the time it took Paul to open the left or right box first, when we have to store the data, or otherwise make it available for analysis, we simply wouldn’t be able to express such a number in finite time or an otherwise finite manner which can be represented in a computer. It would take an infinite amount of time to read a continuous number aloud, and an infinite amount of memory to store this on a computer.

### 1.3.1 Measurement scales

In addition to whether data is discrete or continuous, it is useful to distinguish between different scales of measurement. The measurement scale of a variable determines how we can assign numbers to its values, and how meaningful computations with those numeric values are. For instance, if you determine whether people are male or female, you can assign numbers to this variable “biological sex”, such as $$\text{female} = 2$$ and $$\text{male} = 1$$, but it doesn’t make that much sense to treat these numbers in the usual way. For instance, that we assigned a higher value to females does not imply that they have more of “biological sex”. And if we calculate an average of these numbers for a group of people, it might be 1.67. But a statement like “the average biological sex in this group was 1.67” is pretty meaningless.

There are four scales of measurement:

• A nominal scale allows identifying values as “identical” or “different”, but nothing more. Examples of variables on a nominal scale are pizza toppings, eye colour, etc. While we can assign numbers to the values (e.g., brown = 1, blue = 2, green = 3), these are arbitrary and can be changed without much consequence (e.g. brown = 21, blue = 7, green = 1).
• An ordinal scale allows values to be identified as “smaller”, “identical”, or “larger”, but nothing more. For instance, first, second, or third to cross the finish line in a race. Variables on an ordinal scale can be meaningfully ordered or ranked. This means there is some constraint to numbers we can assign to values.
• An interval scale allows items to be assigned more meaningful numbers. On an interval scale, the differences between numbers can themselves be meaningfully ordered. Temperature is a common example. The difference between 16 and 14 degrees Celsius is smaller than the difference between 19 and 16 degrees Celsius. But we can not say that 30 degrees Celsius is twice as warm as 15 degrees Celsius. This is because an interval scale doesn’t have a fixed 0 point. For degrees Celsius, 0 degrees is the point at which water freezes. It is not the point at which there is no temperature.
• A ratio scale has all characteristics an interval scale has, with the addition of an absolute 0 point. Weight is a common example. Weight cannot be smaller than 0, and 0 indicates no weight at all. Not only can we say that the difference between 3 and 2 kilograms is smaller than the difference between 10 and 5 kilograms, we can also say that 10 kilograms is twice as heavy as 5 kilograms.

Sometimes it is tricky to determine what the measurement scale is. For instance, are “correct” and “incorrect” ordered (correct is better than incorrect) or nominal? Most of the time, a courser distinction is enough, namely whether data is categorical (nominal or ordinal) or metric (interval or ratio).

### 1.3.2 The Data Generating Process

In addition to considering the type or measurement scale of the data, you will generally also need to consider the process that in the end led to the data at hand. This process can be called the Data Generating Process.

As the section on experimental vs observational data indicated, the way the data were obtained can affect and limit the inferences that can be made. It is also important to consider to which group or population our data speaks. This is generally related to how we have selected the subjects of a study. For instance, in the case of Paul, we have data from a single octopus, and we are mostly interested in assessing whether he was psychic. We are not interested in Paul’s psychic abilities for just those games for which he made predictions though. If Paul indeed had psychic abilities (but let’s restrict that to the context of football games), we would expect him to make accurate predictions for other games as well. So here we would like to generalize beyond the data. We would like to generalize from the games for which Paul actually made predictions to all games in the UEFA Euro 2008 and 2010 FIFA World cups, or even to all international football matches. We then need to consider how representative the data is for the more general group we want to make claims about. As Paul mainly made predictions for games in which one of the teams was Germany, the sample is not a good representation of all the games played in both cups. If it is in one way or the other easier to determine the winner in games involving Germany than it is to determine the winner in other games, then the test of Paul’s abilities that was conducted is different than the test that would be conducted if we included other games. This, then, limits the claims we can make from the test. If we were interested in whether octopuses in general have psychic abilities, then using a single octopus would not be a great choice. Paul may be unique in many ways, so again he might not be representative of all octopuses.

As in experiments, where random assignment to conditions can avoid confounds, random sampling can be used to make it more likely that data collected is representative. If we are interested in Paul’s predictions for all international matches, it is practically undoable to get him to make predictions for all of these. But by randomly sampling a set of international matches, we will not be able to let this set depend on certain characteristics (e.g. Germany playing) which might bias the results.

Determining the representativeness of available data for a larger group of possible observations is usually not something that can be done with absolute certainty. And unless we have access to more data, it is generally also not something that statistical analysis can help with. Considerations of representativeness usually rest on (ideally sound) reasoning.

The Data Generating Process (or DGP for short) is a concept we will encounter repeatedly throughout this book. The DGP contains everything of importance when considering the variability of the data in the context of other data that could have been collected. This includes how subjects were sampled (e.g., if we randomly sample octopuses then other possible data sets would contain data from other octopuses), how subjects are assigned to conditions (if any), but most importantly also how subjects generate responses (or otherwise observable events), and any possible source of randomness in this generation. For instance, during a particular prediction, Paul might initially go for the box on the left, but then a ray of sunlight hits the box on the right, leading Paul to open that box first. Such small events can be considered a source of random noise in the experiment, that lead to variability in Paul’s behaviour, and ultimately to variability in the correctness of his predictions. As a concept, the purpose of the DGP is to describe all such sources of variability as they impact the variability of a particular variable of interest. The DGP then describes not only how the data set that we have access to was actually generated, but also how all possible other data sets would have been generated. This makes the DGP a rather abstract concept. Nevertheless, we see it is something real, something “out there in the world”, in contrast to statistical models, which inhabit the world of mathematics. For statistics, the most important aspect of the DGP is that it provides a distribution over the values of a variable if the DGP was “let loose” and generated all possible data sets (of which the actual data set we have access to is only one). For instance, this would be the frequency of correct and incorrect predictions that Paul makes in every possible version of the experiment (so every possible match, every possible assignment of teams to the left or right box, and perhaps every possible small disturbance in light conditions, etc.) This distribution is sometimes also referred to as the population distribution. If this all sounds a little too abstract at the moment, don’t worry. We will discuss the DGP in more detail later.

## 1.4 Exploring and describing data

Once you have collected data, you may feel the urge to immediately apply some statistical model or test to it. But really, a first step should be to explore the data. Statistical tests are designed to confirm or reject hypotheses, and part of confirmatory data analysis. They are a very important part of scientific research, but they always rely on assumptions about the data generating process. Before conducting these analyses, it is crucial to assess whether these assumptions are reasonable, and this is where exploratory data analysis can help. Tukey (1977), a major proponent of exploratory data analysis, points out that exploratory data analysis can also suggest new hypotheses about the causes of observed phenomena, and with that provide a basis for further experimentation and data collection.

As the distribution of Paul’s predictions is not so interesting to look into exploration of data, we will focus on other data. We will stay in the context of the 2010 FIFA World Cup, and consider team statistics such as the number of goals scored and the number of games played.

Table 1.3: Team statistics in the 2010 FIFA World cup
Team Matches played Goals for Goals against Yellow cards Direct red cards
Algeria 3 0 2 8 0
Argentina 5 10 6 7 0
Australia 3 3 6 7 2
Brazil 5 9 4 9 1
Cameroon 3 2 5 5 0
Chile 4 3 5 15 0
Côte d’Ivoire 3 4 3 5 0
Denmark 3 3 6 6 0
England 4 3 5 6 0
France 3 1 4 6 1
Germany 7 16 5 13 0
Ghana 5 5 4 11 0
Greece 3 2 5 5 0
Honduras 3 0 3 7 0
Italy 3 4 5 5 0
Japan 4 4 2 7 0
Korea DPR 3 1 12 2 0
Korea Republic 4 6 8 6 0
Mexico 4 4 5 9 0
Netherlands 7 12 6 24 0
New Zealand 3 2 2 6 0
Nigeria 3 3 5 5 1
Paraguay 5 3 2 9 0
Portugal 4 7 1 8 1
Serbia 3 2 3 10 0
Slovakia 4 5 7 11 0
Slovenia 3 3 3 9 0
South Africa 3 3 5 4 1
Spain 7 8 2 8 0
Switzerland 3 1 1 8 1
Uruguay 7 11 8 11 1
USA 4 5 5 9 0

### 1.4.1 Summary statistics

Summary statistics are calculated values that aim to summarize key aspects of the distribution of observed values of a variable. Two main aspects of such distributions are (a) where the values are generally located within the range of possible values, and (b) the variation in the values. Measures of the location of the values aim to reflect the central value, in some form or another, of a distribution. Measures of spread aim to reflect the level of variability around the central value.

#### 1.4.1.1 Measures of location

There are three common measures of location: the mean, median, and mode. Let’s consider each in turn.

The mean is the average of a set of values, computed by adding all the values up and then dividing this sum by the total number of values. More formally, the mean of a variable $$Y$$ is usually denoted by drawing a straight line over the variable (i.e. as $$\overline{Y}$$), and the formula for its computation is $\begin{equation} \overline{Y} = \frac{\sum_{i=1}^n Y_i}{n} \tag{1.1} \end{equation}$ Perhaps this is the first time you have seen the “summation operator” $$\Sigma$$, so let’s go over how to work with this. Basically, the summation operator states that you should sum up every value of a variable that is placed to the right of it. Below the summation operator (or on the lower-right side) is an iterator $$i$$, which gives more information about which values to sum. The iterator goes from the value it is given below or on the bottom-right side (here, the iterator is assigned the value 1), to the value on the top (or top-right) of the summation operator (in this case $$n$$). So here, the operator states that we should sum up all values $$Y_i$$, where $$i$$ goes from 1 to $$n$$, i.e. all values $$Y_1, Y_2, \ldots, Y_n$$: $\sum_{i=1}^n Y_i = Y_1 + Y_2 + Y_3 + \ldots + Y_n$

For the FIFA 2010 World Cup, we have $$n=32$$ teams, so to compute the mean number of goals scored, we would sum from $$Y_1$$ (the number of goals scored by the first team) to $$Y_{32}$$ (the number of goals scored by the final team), and get $\overline{Y} = \frac{0 + 10 + 3 + 9 + 2 + \ldots + 11 + 5}{32} = \frac{145}{32} = 4.531$

The median is defined as the value in the set of values such that 50% of the values are lower than it. It is easy to compute if the number of observations ($$n$$) is odd. We simply order the values in increasing or decreasing magnitude. The median is then simply the middle value in this ordered set. If $$n$$ is an even number, the median cannot be uniquely determined. But it is common to take the value that is halfway between.

The goals scored by each team, ordered in increasing magnitude are:

0, 0, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 6, 7, 8, 9, 10, 11, 12, 16

There are $$n=32$$ values, so the first 16 values in the ordered sequence above are the lowest 50% of the values, and the second 16 values are the highest 50% values. If we had one value less, i.e. if $$n=31$$, then the 16th value would be the median. In our case, the median is a value between the 16th and 17th value. If you go along the sequence and stop at the 16th value, you will see that it equals 3. The next value is the 17th value, which also equals 3. So in this case, the median must be 3, as this is the only value between 3 and 3. If the 17th value happened to be 4, then we would have determined the median as the value halfway between 3 and 4, i.e. as 3.5.

The mode is the most frequently occurring value. This measure is really only useful for relatively large data sets with a limited number of possible values. For a continuous variable, there is generally no value that occurs more than once in a data set, and hence all values could be considered a mode. If we look at the total goals scored, we see that the mode is 3, as this is the most frequent value (there are 8 teams who scored 3 points, while there are 4 teams who scored 2 points, and 4 teams who scored 4 points).

There are three common measures of spread: the inter-quartile range, the variance, and the standard deviation.

The inter-quartile range is the distance between the 25th and the 75th percentile. A percentile is a value such that a specified proportion of the values are lower than it. So the 25th percentile is the lowest value in the data such that 25% are even lower. If you go back to the definition of the median, you may notice that the median is actually the 50th percentile. The 25th, 50th, 75th and 100th percentiles are also respectively called the 1st, 2nd, 3rd, and 4th quartile, hence the name inter-quartile range. To compute percentiles, it is handy to order the values in increasing magnitude (just like we did for computing the median). A sample percentile can then be computed by determining the appropriate rank (the place of a value in the ordering) that corresponds to the required percentile: $\text{rank}(\text{percentile}) = \text{ceiling}\left(\frac{\text{percentile}}{100} \times n\right)$ Here, $$\text{percentile}$$ is the required percentile (e.g. 25, or 75). The ceiling function stands for rounding up to the nearest integer, so e.g. $$\text{ceiling}(2.1) = 3$$. To compute the inter-quartile range for total goals scored, we compute the needed ordinal ranks as $\text{rank}(25) = \text{ceiling}\left(\frac{25}{100} \times 32 \right) = 8$ which means we can take the 8th value in the ordered list as the 25th percentile, which equals 2, and $\text{rank}(75) = \text{ceiling}\left(\frac{75}{100} \times 32 \right) = 24$ which means we can take the 24th value in the ordered list, which equals 5. The inter-quartile range is then $\text{IQR} = [2, 5]$

The variance is the average squared deviation of values from the mean. One reason for using squared deviations rather than the average deviation itself is that the average deviation $\frac{\sum_{i=1}^n (Y_i - \overline{Y})}{n}$ is always 0. An alternative is of course to then calculate the average of the absolute deviations. While this is a sensible measure of the spread of values, there are technical reasons to prefer the average squared deviation (i.e. the variance), which we won’t go into now. Let’s just plough on with how to calculate the variance. The variance of a collection of observed values, which is also referred to as the sample variance, is usually denoted as $$S^2_Y$$, and defined as: $\begin{equation} S^2_{Y} = \frac{\sum_{i=1}^n (Y_i - \overline{Y})^2}{n} \tag{1.2} \end{equation}$ So you simply subtract from each value the average value (the sample mean), and then raise each deviation to the power of 2. Then you add those squared deviations up, and divide the sum by the total number of observations. For the total goals scored, the variance is calculated as: \begin{aligned} S^2_Y &= \frac{(0 - 4.531)^2 + (10 - 4.531)^2 + (3 - 4.531)^2 + \ldots + (11 - 4.531)^2 + (5 - 4.531)^2}{32} \\ &= \frac{(-4.531)^2 + (5.469)^2 + (-1.531)^2 + \ldots + (6.469)^2 + (0.469)^2}{32} \\ &= \frac{20.532 + 29.907 + 2.345 + \ldots + 41.845 + 0.22}{32} \\ &= \frac{423.969}{32} \\ &= 13.249 \end{aligned}

A downside of the variance is that its value is difficult to interpret. By squaring each deviation, the value of the variance is on a different scale of magnitude than the original values. The standard deviation aims to provide a more interpretable measure of spread, by transforming the variance back to the scale of the original values. The standard deviation of a collection of observed values, usually denoted as $$S_Y$$, is simply the square-root of the variance, so $S_Y = \sqrt{S^2_Y}$ or written out in more detail as $\begin{equation} S_{Y} = \sqrt{\frac{\sum_{i=1}^n (Y_i - \overline{Y})^2}{n}} \tag{1.3} \end{equation}$ For the total goals scored, the standard deviation is $S_Y = \sqrt{13.249} = 3.64$

### 1.4.2 Visual exploration

Exploring a data set is generally done visually, using various plots such as histograms, boxplots, scatterplots, etc. A good plot can immediately provide a lot more information than summary measures. So before proceeding with statistical analysis: plot your data! Visualizing data in the most effective and pleasant way is an art form in itself. But here are some standard plots that are helpful and relatively easy to produce:  Figure 1.2: A histogram and boxplot of the total goals scored by each team

A histogram (left plot in Figure 1.2) counts the number of occurrences of values, or ranges of values in the case of continuous variables with a lot of possible values. For the latter, it divides the range of possible values into “bins” (equally sized ranges of values). The count of observations in each bin is then plotted as the height of a bar, and the width of the bar reflects the range (in the case of discrete values, the range is defined by the midpoints between these values. A histogram is useful to depict the distribution of values of a variable. The size of the bins is important though, as it determines how smooth or rough the histogram looks, and often requires some tweaking. If you make the bins too wide, it may be hard to spot the intricacies in a distribution. If you make the bins too narrow, on the other hand, each bin will contain only a small number of values and you won’t summarize the distribution in a visually meaningful way. In the extreme case, each bin will either contain just one observation, or none.

A boxplot (right plot in Figure 1.2), also called a box and whiskers plot, depicts the inter-quartile range as a box, in which a line is placed to depict the median. Two lines extend from either side of the box. These are the “whiskers” and they depict the range of values in the data excluding any potential outliers. More precisely, the endpoints of the whiskers are the smallest and largest value in the data which are no more than $$1.5 \times \text{IQR}$$ away from the 25% and 75% quartiles, respectively. If there are values in the data which extend beyond these endpoints, they are considered “outliers”, and plotted as points outside of the whiskers. In the example boxplot, you can see that there are three such outliers.

A scatterplot is useful to see the relation between the values of two variables. For instance, it is likely that the total goals scored by each team will depend on the number of games they played. We can visually inspect this by plotting each team as a point with the value of the x-axis representing the value of one variable, and the value on the y-axis as the value of the other variable. This is done in Figure 1.3, from which it seems quite clear that those teams who played more games generally also scored more points. Figure 1.3: A scatter plot of total goals scored by total matches played

As all the plots above can tell us something meaningful about the data, and they often highlight different aspects of the data, it can be useful to combine them. One form of such a combination plot has been called a raincloud plot . A raincloud plot combines a boxplot with a scatterplot of the observed values and a type of continuous histogram called a nonparametric density plot. In my version of such plots (1.4), a boxplot is placed in the middle, with an additional point representing the sample mean. On the left are the observed values, where the points are “jittered” (randomly displaced on the x-axis) to try to avoid overlapping points. On the right of the boxplot is the nonparametric density plot. This part aims to depict the distribution in a smooth, continuous way. Figure 1.4: A raincloud plot of the average goals scored per match

## 1.5 Analysis and modelling

Analysing data can mean lots of things. According to WikiPedia “Data analysis is a process of inspecting, cleansing, transforming and modelling data with the goal of discovering useful information, informing conclusions and supporting decision-making.” That is a reasonably accurate, but also a quite lengthy and boring definition. An alternative to this I quite like is that data analysis is identifying a signal within noise.

Data is noisy for many reasons. For one thing, the data that we have access to is generally limited, in the sense that we could have obtained other and/or more data. Other sources of noise include measurement error. Such sources of noise imply that the data set we have only gives a limited, incomplete, and corrupted view on the process that generated it (the Data Generating Process). We are not interested solely in describing the data set we have, we generally want to infer something about the Data Generating Process. We don’t just want to state that Paul made 8 out of 8 accurate predictions in the 2010 FIFA World Cup, we want to use this data to assess whether Paul had psychic abilities. Whether, had Paul’s keeper allowed Paul to make predictions for other matches, he would have also been correct. Such inferences go beyond the data that we have. Perhaps Paul was just lucky in those 8 predictions, and his 100% hit rate is not a good reflection of his hit rate in general.

Statistical modelling can be viewed as describing the Data Generating Process in the language of mathematics. At its core, a statistical model defines the distribution of possible values of variables. In an ideal world, this model distribution is identical to the true distribution that follows from the Data Generating Process. In general, though, there will be some mismatch between the model distribution and the DGP distribution. Models are simplifications and idealizations. As statistician George Box famously noted “all models are wrong, but some are useful” . What do we mean by this? Consider the famous London tube map (Figure 1.5). This map is “wrong” because as any map, it is a simplification of reality. It does not provide a completely accurate representation of the world. But what is special about the tube map is that, unlike standard maps, it does not even attempt to reflect the geographical location of the different underground (or “tube”) lines and stations. Its initial designer, Harry Beck, realised that because the lines run mostly underground, the exact physical locations of the stations were largely irrelevant to the traveller wanting to know how to get from one station to another. What matters to them is the relation between the stations, and which lines connect them. Hence, he designed the map as a schematic diagram which clearly shows the lines connecting stations. If you used the map to determine the walking distance between stations, you’d be very disappointed. But to determine your underground journey, the map is very useful. Figure 1.5: London tube map. Source: TfL

Statistical models are a bit like the London tube map: their usefulness will depend on your goals. In order to derive properties of various statistical analyses and tests, we have to assume that a model is true (in the sense that it truly reflects the DGP distribution). While this assumption is inherently uncertain, the derived properties may hold, at least approximately, as long is the model distribution is an “accurate enough” representation of the DGP distribution. While in mathematics, statements are either true or false, in applied statistics, we have to deal with statements that are only approximately true.

## 1.6 Summary

The objective of statistical data analysis is to go beyond a given data set, and make claims about aspects of the process that generated the data (the Data Generating Process). To make inferences beyond the limited data we have collected, we need statistical models which aim to describe the distribution of all possible data that we could have collected. Data can be of different types. The measurement scale of a variable determines the way in which numbers assigned to values of the variable can be interpreted and manipulated. Aspects of the data collection, such as whether data was collected in an experiment with random assignment, or whether the data was observed in a more naturalistic setting, are important aspects of the Data Generating Process, and determine to what extent we can rule out confounds that may bias our conclusions. Paul was one amazing octopus.

### References

Allen, M., Poggiali, D., Whitaker, K., Marshall, T. R., & Kievit, R. A. (2019). Raincloud plots: A multi-platform tool for robust data visualization [version 1; peer review: 2 approved]. Wellcome Open Research, 4. https://doi.org/10.12688/wellcomeopenres.15191.1
Box, G. E. P., & Draper, N. R. (1987). Empirical model-building and response surfaces. New York, NY: John Wiley & Sons.
1. Actually, while a variable which can only have a finite number of possible values is necessarily discrete, it is not true that a variable with an infinite number of possible values is necessarily continuous. Both discrete and continuous variables can have an infinite number of values. For instance, a variable which can have as values any of the natural numbers $$0, 1, 2, \ldots$$ is discrete because it cannot take a value between say 1 and 2. But there are countably infinite natural numbers, so this variable will have an infinite number of possible values. The difference with a continuous variable is however that it is possible to count, or read aloud, the number of possible values in order of magnitude without ever skipping a possible value. This is not possible for continuous variables. For instance, if you would count $$0.0001, 0.0002, ...$$, you would miss $$0.00011$$, and $$0.00012$$, and many other possible values.↩︎