# Chapter 1 Introduction

## 1.1 What is R?^{1}

R is a statistical programming language that has rapidly gained popularity in many scientific fields. It was developed by Ross Ihaka and Robert Gentleman as an open source implementation of the “S” programming language. (Next time you need a fun fact, you can say “Did you know that S came before R?”) R is also the name of the software that uses this language for statistical computing. With a huge online support community and dedicated packages that provide extra functionality for virtually any application and field of study, there’s hardly anything you *can’t* do in R.

If you already know your way around statistical software like `JASP`

or `SPSS`

, the main difference is that `R`

has no **graphical user interface**, which means there are no buttons to click and no dropdown menus. `R`

can be run entirely by typing commands into a text interface (welcome to the Matrix!). This may seem a little daunting, but it also means a whole lot more flexibility, as you are not relying on a pre-determined toolkit for your analyses.

If you need any more convincing, why are we using R and not one of the many other statistical packages like JASP, SPSS, MATLAB, Minitab, or even Microsoft Excel? Well, R is great because:

- R is free and open source, and always will be! Anybody can use the code and see exactly how it works.
- Because R is a programming language rather than a graphical interface, the user can easily save scripts as small text files for use in the future, or share them with collaborators.
- R has a very active and helpful online community - normally a quick search is all it takes to find that somebody has already solved the problem you’re having.

## 1.2 Getting started

If you want to use R and RStudio, you should *first install R*, and after that install RStudio. R and RStudio are separate programs, and RStudio requires R to be installed.

### 1.2.1 Download R

You can download R from CRAN (The Comprehensive R Archive Network). Select the link appropriate for your operating system and follow the instructions. You will want to download the installer for the latest release (currently version 4.0.2) of the base R software.

As you can see, the CRAN website has a rather distinctive “old-school” look. Don’t let that fool you though. R itself is anything but old school.

### 1.2.2 Download R Studio

R does not come with a graphical interface by default. Most people nowadays interact with R through second-party graphical platforms that provide extra functionality. Probably the most popular graphical front-end to R is **RStudio**. This is actually a full “integrated development environment” (IDE), but mostly, we will use it as a place where we can keep scripts, plots, and R output together in one place.

Like R, RStudio is open source software and free to download for anyone that wants to. You can download RStudio from the RStudio website (select the free open source desktop version).

## 1.3 Working with RStudio

When you open RStudio, you will see something like Figure **??**. You will probably not see exactly the same layout, but once you click on `File`

in the top menu, and then `New File > R Script`

, you should be pretty close. You can get direct access to the R environment itself in the **console** panel. If you type in commands here, they will be interpreted by R, and possibly some output is given. Working directly within the R console is handy if you want to do simple calculations, or try out different commands and functions. When you are conducting analyses, you will want to keep the commands that produce useful results somewhere, so you don’t have to type in everything again if you want to rerun the analyses, or if you need to change something. That is where R *scripts* come in handy. Basically, these are text files where you store a collection of commands that can be interpreted by R. Within RStudio, you can select lines of the script, and by clicking on `Run`

, those lines will get pasted to the R console. R scripts should only contain working R commands. You can comment on your code by preceding a line (or the end of a line) by a hash-symbol (“#”). Anything after the has symbol is not evaluated by the R interpreter.

Another useful way to store R commands is in a different file format, called `R Markdown`

. R Markdown allows you to combine text, plots, and R commands all in a single file. This file can then be “parsed” to produce a variety of document formats, such as HTML, pdf, and even Microsoft Word. If you click on `File > New File > R Markdown`

in RStudio, you can see an example of such a file. As the name suggests, R Markdown is a combination of R and Markdown. Markdown is a lightweight markup language for creating formatted text documents with a plain-text editor. A markup language is, roughly put, a system which defines elements in a document by their role, for instance defining certain elements as titles or headers, and others as quoted text or test that should be emphasised. Common examples of markup languages are HTML and XML. If you know a little HTML, you might know that in modern implementations, it separates content and markup (HTML) from style (CSS). This separation allows you to easily create a variety of documents which are visually very different from the same HTML source file. When you use a word processor such as Microsoft Word, it creates a single document which specifies both content and style, and in a way which is specific to the word processor used. Markdown aims to provide a way to define a software-agnostic markup language, separating content from style, which can be used to produce a variety of output formats from the same source file. R markdown add to this an integration with R. Effectively, R Markdown first evaluates all the R code in an R Markdown file to create a “plain” markdown file, which can then be parsed into a variety of output formats. The great thing about this is that you can create automatically reproducible documents, and you don’t have to copy-paste results of analyses between R and your word processor, avoiding the common mistakes that this brings. And R Markdown is really flexible. For example, using a package like `bookdown`

, you can even write whole books in R Markdown (like this one)! We will discuss R Markdown in more detail at a later point. If you want to get started already, a very useful resource is R Markdown: The Definitive Guide.

## 1.4 Installing packages

Part of the popularity of R stems from the thousands of packages that extend the basic capabilities of R.

### 1.4.1 Installing the `sdamr`

package

The “Statistics: Data analysis and modelling” book has an associated R package which contains the data sets used as examples in the book, as well as some additional functions. It should be available on CRAN soon. Once it is available there, you can install it simply by typing

`install.packages("sdamr")`

If it is not available there, you will see a warning (`package ‘sdamr’ is not available for this version of R`

). If that happens, you can install the development version from GitHub (see below).

The source code of the `sdamr`

package is hosted in GitHub, and the package can be installed from there as well with help of the `remotes`

package. So you will first need to install that package, and then you can use the `install_github`

function to install the `sdamr`

package:

```
install.packages("remotes")
::install_github("mspeekenbrink/sdam-r") remotes
```

Note that by typing `remotes::`

before the function call, we are telling R that the function is available in the `remotes`

package. This avoids you having to load the package (i.e. by `library(remotes)`

) first.

To check whether the package is installed, type

`library(sdamr)`

If you can’t get this to work on your system, wait until the `sdamr`

package becomes available on CRAN.

## 1.5 Getting help

R may be tricky to master, especially at the start, but help is never far away:

- From within R
- If you want more information on a specific function, use a question mark before it (e.g.,
`?plot`

) or the`help`

function (e.g.,`help(plot)`

) - If you don’t know the function name, use two question marks (e.g.,
`??plot`

) or the`help.search`

function (e.g.,`help.search("plot")`

) - If you know a function is in a package, use search help on the package (e.g.,
`?ggplot2`

) - The
`RSiteSearch("keyword")`

function will will search for “keyword” in all functions available in R, associated packages, and the R-Help News groups (if desired).

- If you want more information on a specific function, use a question mark before it (e.g.,
- Online
- Stack Overflow is a platform in which you can ask questions about R and its many packages. Many questions will already have been asked, so its archive of questions and answers is particularly useful.
- The meta-search engine at www.rseek.org may also be handy.
- R has an active help mailing list as well, but when asking questions there, make sure you read the posting guide, as some people on there sometimes get a little grumpy.

## 1.6 First steps: R as a calculator

R can be used as a console-based calculator. Here are some examples.

You can add numbers using the `+`

operator, e.g \(2 + 11\):

`2 + 11 # addition`

`## [1] 13`

You can multiply numbers using the `*`

operator, e.g \(2 \times 11\):

`2 * 11 # multiplication`

`## [1] 22`

You can divide numbers using the `/`

operator, e.g. \(\frac{2}{11}\):

`2 / 11 # division`

`## [1] 0.1818182`

You can raise numbers to a power by using the `^`

operator, e.g. \(2^{11}\):

`2^(11) # exponentiation`

`## [1] 2048`

You can take a square-root by using the `sqrt`

function, e.g. \(\sqrt{2}\):

`sqrt(2) # square root`

`## [1] 1.414214`

which is mathematically equivalent to raising a number to the power of \(\tfrac{1}{2}\), e.g. \(2^{\tfrac{1}{2}}\):

`2^(1/2) # another way to compute the square root`

`## [1] 1.414214`

In performing arithmetical operations, it is important to realise that `R`

obeys the commonly accepted rules of precedence in performing arithmetic operations, which is:

- Brackets
- Powers
- Multiplication and division
- Addition and subtraction

This means that anything within brackets is evaluated first (following the order of the remaining operations), than any elements involving powers are calculated, and after this those involving multiplication or division, and finally addition and multiplication. So

`2 + 11*3`

`## [1] 35`

is evaluated as \((11 \times 3) + 2\), whilst

`2 + 11)*3 (`

`## [1] 39`

is evaluated as \((2+11) \times 3\). Another way to put this is that `R`

doesn’t evaluate expression left-to-right. It takes the whole expression, first computes the results within brackets, then computes any powers, then multiplies or divides the results of this, and finally adds or subtracts the results of that. Many common errors result from not using brackets properly, or forgetting about the precedence of arithmetic operations.

Unrelated to this, but also important: the hash symbol (“#”) is used for comments, such that anything following a “#” is not evaluated.

## 1.7 Data

You can load in data files that come with R packages by using the `data`

function, with as argument the name of the dataset you want to load (as a string, so make sure you use quotation signs). For instance, you can load the dataset `fifa2010teams`

from the `sdamr`

package as follows:

```
library(sdamr)
data("fifa2010teams")
```

A loaded dataset will show up in the Environment panel in RStudio. If you click on the name of the dataset, you can then see the data as a table in the Source panel. You can also view the data in the R console by simply typing the name of the dataset. This will often produce a lot of output. If you just want to view a part of the dataset, you can use the `head`

function, which will show the first 6 rows:

`head(fifa2010teams)`

```
## nr team matches_played goals_for goals_scored goals_against
## 1 1 Germany 7 16 16 5
## 2 2 Netherlands 7 12 11 6
## 3 3 Uruguay 7 11 11 8
## 4 4 Argentina 5 10 9 6
## 5 5 Brazil 5 9 9 4
## 6 6 Spain 7 8 8 2
## penalty_goal own_goals_for yellow_cards indirect_red_cards direct_red_cards
## 1 0 0 13 0 0
## 2 0 0 24 0 0
## 3 1 0 11 0 1
## 4 0 0 7 0 0
## 5 0 0 9 0 1
## 6 0 0 8 0 0
```

You can also get a quick summary of the characteristics of the variables in the data through the `summary`

function:

`summary(fifa2010teams)`

```
## nr team matches_played goals_for
## Min. : 1.00 Length:32 Min. :3.00 Min. : 0.000
## 1st Qu.: 8.75 Class :character 1st Qu.:3.00 1st Qu.: 2.000
## Median :16.50 Mode :character Median :3.50 Median : 3.000
## Mean :16.50 Mean :4.00 Mean : 4.531
## 3rd Qu.:24.25 3rd Qu.:4.25 3rd Qu.: 5.250
## Max. :32.00 Max. :7.00 Max. :16.000
## goals_scored goals_against penalty_goal own_goals_for
## Min. : 0.000 Min. : 1.000 Min. :0.0000 Min. :0
## 1st Qu.: 2.000 1st Qu.: 3.000 1st Qu.:0.0000 1st Qu.:0
## Median : 3.000 Median : 5.000 Median :0.0000 Median :0
## Mean : 4.469 Mean : 4.531 Mean :0.2812 Mean :0
## 3rd Qu.: 5.250 3rd Qu.: 5.250 3rd Qu.:0.2500 3rd Qu.:0
## Max. :16.000 Max. :12.000 Max. :2.0000 Max. :0
## yellow_cards indirect_red_cards direct_red_cards
## Min. : 2.000 Min. :0 Min. :0.0000
## 1st Qu.: 6.000 1st Qu.:0 1st Qu.:0.0000
## Median : 7.500 Median :0 Median :0.0000
## Mean : 8.156 Mean :0 Mean :0.2812
## 3rd Qu.: 9.000 3rd Qu.:0 3rd Qu.:0.2500
## Max. :24.000 Max. :0 Max. :2.0000
```

### 1.7.1 Data types

Data in R is generally stored in **vectors**, which are fixed-length collections of values of a particular data type. Common data types are

**logical**: values which can either be`TRUE`

or`FALSE`

**numeric**: numbers of all kinds, such as`1`

,`356`

, and`34.5782`

**character**: characters and strings, such as`q`

and`Hello`

You can combine values of a data type in a vector by using the `c()`

function (which stands for “combine”). For instance

`c(TRUE, FALSE, TRUE, TRUE)`

`## [1] TRUE FALSE TRUE TRUE`

`c(3,4,802.376)`

`## [1] 3.000 4.000 802.376`

`c("Coffee","now","please")`

`## [1] "Coffee" "now" "please"`

If you combine elements of different data types, then R will convert them to the most “general” type necessary. Combining a logical value with a numeric one, for instance, will convert logical value `TRUE`

to 1, and `FALSE`

to 0. Combining a character element with other elements, will convert everything to character elements:

`c(TRUE, FALSE, 12)`

`## [1] 1 0 12`

`c(TRUE, 5.67788, "let's see what happens")`

`## [1] "TRUE" "5.67788" "let's see what happens"`

### 1.7.2 Objects

Objects are named things that are stored in memory and available to functions etc. Objects can be vectors, such as discussed above, but also more general types, such as matrices, factors, and data frames. If you want to create an object, you use the assignment operator `<-`

, with on the left side the name you want to give to the object, and on the right side the content of the object. For instance, we can store a numeric vector as the object `my_vector`

as follows:

`<- c(1,2,10:20) my_vector `

Note a little trick above, where `10:20`

stands for a sequence of integers, i.e. \(10, 11, 12, \ldots, 20\). `my_vector`

is now an object in R memory (you should see it show up in the Environment panel), and can be called by name, as in:

` my_vector`

`## [1] 1 2 10 11 12 13 14 15 16 17 18 19 20`

A **matrix** is a collection of vectors of the same length, joined as columns or rows.

```
<- matrix(1:10,ncol=2)
mat # matrices are filled column-wise mat
```

```
## [,1] [,2]
## [1,] 1 6
## [2,] 2 7
## [3,] 3 8
## [4,] 4 9
## [5,] 5 10
```

`2] # select the second column (the result is a vector) mat[,`

`## [1] 6 7 8 9 10`

`3,1] # select the value in the third row and first column mat[`

`## [1] 3`

A **factor** is useful for nominal and ordinal data. A factor is a vector with integers, where each integer is provided with a unique label. For instance

```
# construct a factor by giving integer values and specifying the accompanying
# labels
<- factor(c(1,2,2,3),labels=c("red","green","blue"))
fact # display it fact
```

```
## [1] red green green blue
## Levels: red green blue
```

`== "green" # determine which elements equal (==) 'green' fact `

`## [1] FALSE TRUE TRUE FALSE`

A **list** is a collection of different R objects. This is a very general type of object, and the elements of a list can even be lists themselves. A list allows you to keep different types of information together, but you probably won’t need to use it much for the content discussed here. But let’s quickly look at some aspects of a list:

```
<- list(a=mat, b=fact) # construct a named list with a matrix and factor
lst lst
```

```
## $a
## [,1] [,2]
## [1,] 1 6
## [2,] 2 7
## [3,] 3 8
## [4,] 4 9
## [5,] 5 10
##
## $b
## [1] red green green blue
## Levels: red green blue
```

If a list is *named*, meaning the elements have names, like above, you can select elements from the list by using a dollar sign and then the name of the elements:

`$a lst`

```
## [,1] [,2]
## [1,] 1 6
## [2,] 2 7
## [3,] 3 8
## [4,] 4 9
## [5,] 5 10
```

`$b lst`

```
## [1] red green green blue
## Levels: red green blue
```

You can also select elements by an index number, which should go in between double square brackets. For instance, if you want to select the first element, you can type

`1]] lst[[`

- A
**data.frame**is probably one of the most useful features of R for data analysis. A`data.frame`

is like a matrix, in that it is a rectangular collection of data, but the columns are variables which can be of a different type (e.g., numeric, factors, or characters). You can construct data frames through the`data.frame`

function, for instance as

```
<- data.frame(var1 = 1:10, var2 = 10:1, var3 = rep(c("a","b"),times=5))
my_data_frame my_data_frame
```

```
## var1 var2 var3
## 1 1 10 a
## 2 2 9 b
## 3 3 8 a
## 4 4 7 b
## 5 5 6 a
## 6 6 5 b
## 7 7 4 a
## 8 8 3 b
## 9 9 2 a
## 10 10 1 b
```

Constructing a data frame looks similar to constructing a list, but all the (named) arguments should have the same length. If you for instance try

`data.frame(var1 = 1:10, var2 = 1:11)`

you will get an error. You won’t always get an error. If the length of longer elements are multiples of the length of shorter elements, R will fill in the missing values by repeating the shorter elements until they are of the same length as the longer elements. For instance

`data.frame(var1 = 1:10, var2 = 1:5, var3 = 1)`

```
## var1 var2 var3
## 1 1 1 1
## 2 2 2 1
## 3 3 3 1
## 4 4 4 1
## 5 5 5 1
## 6 6 1 1
## 7 7 2 1
## 8 8 3 1
## 9 9 4 1
## 10 10 5 1
```

This can be handy, but also risky, as sometimes you might not realise that R is filling in values for you, and your analyses might give rather unexpected results. I would therefore always ensure that you create a data frame with elements of the same length.

Most of the time, you won’t create data frames yourself within R, but you will load in external data as a data frame.

### 1.7.3 Importing data

R can load data in many formats. Personally, I mainly use data stored in “comma separated value” (CSV) format. This is one of the most portable ways of storing data, so that it can be used in a variety of programs like R, SPSS, JASP, Excel, *etc*. Data in a comma-separated value (CSV) format can be read through the `read.csv`

function.

A nice thing about R is that it can read data directly from the World Wide Web. So you don’t need to first download data, and then open it from within R. [TODO: example]

At some point, you will probably also come across data stored in Excel or SPSS format. regarding Excel, it is safest to first save a spreadsheet as a CSV file, and then load this file into R. Alternatively, the `xlsx`

package provides the function `read.xlsx`

to directly read a spreadsheet into R. To load data in SPSS format, the package `foreign`

package provides the `read.spss`

function.

## 1.8 Exploring data: Descriptive statistics

Measures of location and spread can be computed through specialized functions, namely `mean`

, `median`

, `IQR`

(inter-quartile range), `var`

(variance), and `sd`

(standard deviation). E.g.

`mean(fifa2010teams$goals_for)`

`## [1] 4.53125`

`median(fifa2010teams$goals_for)`

`## [1] 3`

will give you the mean and median of variable `goals_for`

in data.frame `fifa2010teams`

.

You can obtain the inter-quartile range as

`IQR(fifa2010teams$goals_for,type=1)`

`## [1] 3`

Note the use of the `type`

argument here. There are many ways in which to compute and estimate percentiles and quantiles. Using `type=1`

gives you the same result as the way I explained how to compute the IQR in the book. By default, R will use `type = 7`

, which gives different results (type `?quantile`

for more information).

The `var`

and `sd`

functions from base R do not actually provide the *sample* variance_ and *sample* standard deviation. Rather, they give unbiased estimates of the “true” (population) variance and standard deviation. To compute the variance and standard deviation of the sample data, you can use the `sample_var`

and `sample_sd`

functions in the `sdamr`

package:

`sample_var(fifa2010teams$goals_for)`

`## [1] 13.24902`

`sample_sd(fifa2010teams$goals_for)`

`## [1] 3.639921`

There is no function in base R to compute the mode^{2}, but the `sdamr`

package provides the function `sample_mode`

to do just that:

`sample_mode(fifa2010teams$goals_for)`

`## [1] 3`

## 1.9 Exploring data: Creating plots

There are two common ways to plot data with R. Base R has various plotting functions, such as `plot`

, `hist`

, `boxplot`

, which are useful for quick plots to explore your data. The resulting plots are not always the most aesthetically pleasing. The R package `ggplot2`

provides means to create a wide range of useful *and* beautiful plots. It is based on the idea of a “grammar of graphics,” which makes it extremely flexible, but also a little difficult to get your head around. In the following, I will show you how to use both base R and ggplot2.

### 1.9.1 Histogram

R has many built-in plotting functions. These tend to be a little basic, and much prettier plots can be made with packages such as `ggplot2`

(Wickham et al. 2021, my current favourite!). But for quick data exploration, the built-in plotting functions are faster. A histogram is plotted through the `hist`

function. In the following example, I first generate some random data, store it in an object called `dat`

and then plot a histogram:

`hist(fifa2010teams$goals_for)`

There are many parameters you can change. In the following, I give the plot a new title and x-axis labels, as well as request the number of bins to be 20:

`hist(fifa2010teams$goals_for,main="Histogram of points scored by teams in the FIFA 2010 World Cup", xlab="Goals for", breaks=20)`

To create a nicer looking plot, you can use `ggplot2`

.

```
library(ggplot2)
ggplot(fifa2010teams,aes(x=goals_for)) + geom_histogram()
```

`## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.`

Well, that’s not actually so pretty. We can make it better by changing some of the defaults:

```
library(ggplot2)
ggplot(fifa2010teams,aes(x=goals_for)) + geom_histogram(bins=10, colour="black", fill='#8C8279') + xlab("Goals scored")
```

Note that within the `geom_histogram`

function, I’ve specified to use 10 bins, and draw a black line around the bars, and fill the bars with colour specified by the hexadecimal colour code ‘#8C8279.’ Finally, I’m using the `xlab`

function to generate a better label for the x-axis. `ggplot2`

is very powerful and flexible, so there are many such adjustments you can make. A useful resource with practical guidance on creating a variety of plots with `ggplot2`

is the R graphics cookbook. A more thorough treatment of `ggplot2`

and the underlying ideas of a Grammar of Graphics, is ggplot2: elegant graphics for data analysis.

### 1.9.2 Boxplot

For a quick boxplot, you can use the base R function with the same name:

`boxplot(fifa2010teams$goals_for)`

`ggplot2`

also provides a boxplot through the `geom_boxplot`

function. Note that in the `aes`

specification, I’m now using `goals_for`

as the y-axis.

`ggplot(fifa2010teams,aes(y=goals_for)) + geom_boxplot()`

Not very pretty! A somewhat better version can be obtained by:

`ggplot(fifa2010teams,aes(x="",y=goals_for)) + geom_boxplot(width=.2) + xlab("")`

## 1.10 Scatterplot

A quick scatterplot can be created with the `plot`

function, in which you specify the variable to show on the x-axis and the variable on the y-axis:

`plot(x=fifa2010teams$matches_played, y=fifa2010teams$goals_for)`

To get a similar plot with `ggplot2`

, you can use the `geom_point`

function:

`ggplot(fifa2010teams, aes(x=matches_played, y=goals_for)) + geom_point()`

In `ggplot2`

, specification of the variables on the x-axis and y-axis is done within the aesthetics specification (`aes`

) within the initial call to `ggplot`

. The later functions, such as `geom_point`

, inherit the specifications provided there.

## 1.11 Raincloud plot

A basic (but reasonably flexible) function to create a raincloud plot is provided in the `sdamr`

package through the `plot_raincloud`

function. The `data`

argument expects a data frame, and the `y`

argument expects the name of the variable for which you want to create the plot. Note that as the jitter applied to the plot is random, to get exactly the same plot again, you need to set the random number seed through `set.seed`

before. This is only necessary if you want to recreate a plot exactly. We’ll talk more about random number generation later.

```
set.seed(467)
plot_raincloud(data=fifa2010teams, y=goals_for)
```

`## Warning: Removed 1 rows containing missing values (geom_segment).`

### References

*Ggplot2: Create Elegant Data Visualisations Using the Grammar of Graphics*. https://CRAN.R-project.org/package=ggplot2.

This section contains material adapted from https://ourcodingclub.github.io/tutorials/intro-to-r/↩︎

There is a function

`mode`

, but this does something rather different!↩︎