Chapter 17 Being a responsible data analyst

In this chapter, we will consider some general principles which will hopefully allow you to be a good and responsible “data science citizen”. In the preceding chapters, we have covered just a small set of the possible statistical models which you can apply to data, and which allow you to test hypotheses about important characteristics of the Data Generating Process. My aim has been – and is – to provide you with the knowledge and tools to be flexible in your approach to data analysis problems, and to choose the model of the Data Generating Process that is suitable for your goals and the characteristics of the DGP. I have not given you a set of rules to follow blindly, because this would be pointless and restrictive. In the end, it is up to you to decide how to approach a problem. That freedom is perhaps both a gift and a curse. Just like coming up with a useful and meaningful explanatory theory, or designing an informative experiment, coming up with a useful way to analyse the resulting data is often not an easy task. It requires good knowledge of the possibilities and limitations of various statistical analyses. Such knowledge generally comes with practice. Keep an open mind and keep learning. You will undoubtedly make mistakes along the way. We all do!

In the following, I aim to provide some general suggestions and tips which may help to avoid some of these mistakes. And if not avoided, they may help you document and learn from your mistakes. This chapter is really about the ethics of data analysis, and my suggestions can be roughly summarized as: “be honest to yourself and your audience”.

17.1 Consider analysis before data collection

Before you embark on a study, you should give consideration to how you will analyse the data. How will you test the main hypotheses of interest? Will the data you collect be sufficiently informative for your goals? When designing an experiment or observational study, there are many choices to make: How many participants will you test? Will experimental manipulations be within or between participants? What will you manipulate? Will you use a double-blind procedure? What are the outcome measures? These design choices ultimately determine what you can conclude from a study. But this also depends at least partly on the analysis you will use.

Ideally, experimental design and statistical analysis go hand in hand. If you can make reasonably precise predictions regarding participants’ behaviour from your theory, you can simulate possible datasets beforehand, and assess to what extent your proposed analyses will reflect your theoretical predictions. When possible, you can and should use this to conduct an a priori power analysis for your hypothesis tests. Unfortunately, when you’re not conducting an exact replication of an earlier study, this is generally difficult, to say the least. Without prior research or precise process models linking sensory input to observable output (i.e. behavioural responses), making precise numerical predictions about the effects of manipulations on behavioural/cognitive/perceptual tasks or self-report measures is pretty much impossible. Personally, I don’t think predicting the value of standardized effect-size measures (as is generally done in a priori power analysis) is any easier. That said, I do think it is useful to determine which effect size will provide you with e.g. a power of $P(\text{reject } H_0| H_0 \text{ false}) = .8$ with a planned number of participants. If that effect size seems unreasonably large, you may want to reconsider running the study as planned, and recruit more participants or simplify the design. If at all possible, I would advise on running a small pilot study, so you can get a sense of the data that you might obtain in a full study. You can then use this pilot data as a basis to simulate larger datasets to assess the possible results of the full study.

In the absence of pilot data, there are some general considerations which can increase the power of hypothesis tests for a study. Initially, try to keep analyses as simple as possible. When resources are limited, it may be better to choose reliability over generalizability. Focus on a single main effect, rather than including all possible moderators in a complicated design. Those moderators can be looked at later, after you have assessed that a main effect is actually relevant. Generally, a two-group comparison is more powerful than a $k$-group comparison. And when considering possible measurements of that effect, choose the most precise possible. Generally, (almost) continuous dependent variables are more powerful than discrete ones.

17.2 Explore the data

Before viewing your data through the lens of a statistical model, get a sense of what the “raw data” looks like. Visual exploration, whether through raincloud plots, histograms, or boxplots, is very useful in this respect. Outliers and otherwise unusual data points are of particular concern, as they may have a dramatic impact on the results of further analyses. Removal of outliers should be done carefully and with reason. There is always a danger of “fitting the data to your hypothesis”. I usually run analyses both with all data, and with the “cleaned” data after removal of outliers. If the results are robust against the outliers, I tend to report the results of the full dataset. If the results are substantially affected by outliers, I mainly focus on the results of the analyses after removal of outliers, but report the results for the full data in e.g. an appendix or supporting materials. This will allow your audience to consider the effect of your exclusions, and determine whether these are reasonable to them. Another important consideration is homogeneity of variance within groups, although this is often best assessed within the context of a particular statistical model. Missing data is important too. If some conditions in an experiment lead to substantially more missing data than others, this can indicate that there is a problem in the design, where data is not “missing at random”. For instance, if you were to assess the effects of a new medicine, and all participants for whom the medicine had an adverse effect on their health were to drop out of a study, the results for those people completing the study would not reflect the adverse effects the medicine had on the participants who dropped out. As such, the results of the study would be obviously biased.

17.3 Evaluate the assumptions underlying your analyses

In most of the analyses covered here, which were (extensions of) the General Linear Model, we assumed the residuals were independent and Normal-distributed with a mean of 0 and a constant standard deviation $\sigma_\epsilon$. A Normal distribution concerns continuous variables, and in reality, we do not have infinite precision in measurement. Hence, the assumption of continuity is strictly false. That doesn’t necessarily mean that the General Linear Model is invalid. As long as the assumptions are approximately true, we can quite safely use the GLM without worrying that the assumed Type 1 error (if staying within a Frequentist viewpoint) is far from the actual Type 1 error rate. “Approximately true” is obviously a vague statement. The general consensus is that ANOVAs are reasonably robust against deviations from Normal-distributed errors, and when sample sizes are equal, also to violations of homoscedasticity (see Lix et al., 1996; Maxwell et al., 2017). As indicated in the earlier chapters of this book, when the largest variance in a group is less than four times the smallest variance of another group, you may not have to worry too much about violating the homoscedasticity assumption. In cases where the variance of a dependent variable increases with the mean (such as is often the case with analysing reaction times), it may be fruitful to transform the dependent variable with for instance a logarithmic transformation. One issue with transforming the dependent variable is that it will change the relation between predictors and the dependent variable. If a relation is linear between a predictor and the dependent variable, the relation between the predictor and the transformed dependent variable will be nonlinear. This is not a concern in pure ANOVA type models, but may prove a thorny issue in linear regression models. Transformation of variables is therefore something that should be considered carefully. And you should always remember that the results reflect effects on a transformed variable.

Violation of the assumption of independent errors is much more problematic than a violation of Normality or homoscedasticity. This requires the use of an appropriate model (e.g. a linear mixed-effects model or repeated-measures ANOVA) that properly accounts for the dependence in the data.

While there are many formal statistical tests to assess whether the assumption of Normality (e.g. the Shapiro-Wilk test, or the Kolmogorov-Smirnov test) or homoscedasticity (e.g. the Levene test) holds, I don’t advice the judicious use of these tests. A main issue here is that if you have a sufficient sample size, such tests are very powerful, and will often result in a rejection of the null hypothesis of a Normal distribution or homoscedasticity, even if the departure from these assumptions is only slight and hence not anything to really worry about. Relying on graphical procedures to detect substantial departures is generally more sensible. In any case, it is important to remember that the assumptions (at least in the Frequentist framework) ensure that the distribution of a test statistic is identical to an assumed distribution. But as test statistics aggregate many observations into a single value, results such as the Central Limit Theorem will often imply that the statistic will (at least approximately) follow the assumed distribution, even when the original data does not. When in doubt, it will be useful to use a nonparametric bootstrap for the test statistic to evaluate the assumptions. Another option is to use methods with less stringent assumptions. For example, when conducting a repeated-measures ANOVA, you might always apply a correction on the degrees of freedom. This will result in a loss of power, but also a reduction in Type 1 errors. Balancing Type 1 and Type 2 errors is a tricky, but less of an issue when you have sufficient data. Where you can, collect more data, rather than making more restrictive assumptions about a smaller datasset.

17.4 Distinguish between confirmatory and exploratory analyses

When embarking on a data analysis journey, you will often take a planned route, but also take some unplanned paths. It is important to distinguish between these, as unplanned routes may have been guided by random noise or other idiosyncrasies in the dataset collected. Andrew Gelman quite nicely describes the various choices made during an analysis as a “garden of forking paths” (Gelman & Loken, 2013). Other authors talk of “researcher degrees of freedom” (Simmons et al., 2011). Consider the following hypothetical example:

A researcher is interested in differences between Democrats and Republicans in how they perform in a short mathematics test when it is expressed in two different contexts, either involving health care or the military. The research hypothesis is that context matters, and one would expect Democrats to do better in the health-care context and Republicans in the military context. Party identification measured on a standard 7-point scale and various demographic information also available. At this point there is a huge number of possible comparisons that can be performed–all consistent with the data. For example, the pattern could be found (with statistical significance) among men and not among women–explicable under the theory that men are more ideological than women. Or the pattern could be found among women but not among men–explicable under the theory that women are more sensitive to context, compared to men. Or the pattern could be statistically significant for neither group, but the difference could be significant (still fitting the theory, as described above). Or the effect might only appear among men who are being asked the questions by female interviewers. We might see a difference between sexes in the health-care context but not the military context; this would make sense given that health care is currently a highly politically salient issue and the military is not. There are degrees of freedom in the classification of respondents into Democrats and Republicans from a 7-point scale. And how are independents and nonpartisans handled? They could be excluded entirely. Or perhaps the key pattern is between partisans and nonpartisans? And so on.
(Gelman & Loken, 2013, p. 3)

As this example illustrates, there are many potential ways to show evidence for a seemingly straightforward hypothesis regarding an interaction between political affiliation and context, which all can be justified by theory. An unscrupulous analyst hunting for a significant result might perform all of these analyses, and choose to report just the comparison with the strongest test result. This is also called “p-hacking” (Simmons et al., 2011), and will result in a proliferation of Type 1 errors in the scientific literature. So don’t be tempted to hack a $p$-value. But even highly conscientious analysts can be led astray when allowing hypothesis tests to be inspired by patterns in the data. Therefore, the results of unplanned analyses inspired by the results of other analyses should be treated with caution. Data is inherently noisy, and patterns which may seem obvious and easy to provide a post-hoc explanation for are noisy too. Whilst Bayesian hypothesis testing is sometimes portrayed as being immune to researcher intentions, this is really only the case for confirmatory analyses where the priors are determined before data collection. Bayesian analyses, like Frequentist ones, can be “hacked” too (Simonsohn, 2014).

To diminish the likelihood of being led astray, it is important to clearly distinguish between confirmatory and exploratory analyses. Confirmatory analyses are those that are planned before seeing the data. To be truly confirmatory, you should not only decide exactly which particular model to use, and which parameters to test, but also what exclusion criteria (if any) to apply to data points (e.g., what are considered to be outliers). If all is decided before seeing the data, these confirmatory analyses can be preregistered. Such preregistration is useful because by committing to a particular analysis, you avoid the temptation to deviate from your initial plans in seemingly innocuous ways, which nevertheless will increase the chance of Type 1 errors. Preregistration of analyses does not preclude further exploratory analyses. But these can then be clearly identified as such, and their results treated with the additional caution they deserve. When you don’t have pilot data, preregistration is difficult however, as there are often unforeseen issues in the data (e.g., heteroscedasticity which may require a transformation of the data, or unforeseen outliers). In responding to such unforeseen issues, it may be necessary to change the planned analyses, which in principle would render them “exploratory”. I don’t think it is always necessary to be this strict. But it will be up to the analyst to provide a strong justification for the deviation from the analysis plan. Simply providing such a justification is already a substantial advance over not clearly identifying data-driven choices which may have affected the results.

17.5 Aim for openness and reproducibility

In clearly distinguishing between confirmatory and exploratory analyses, you will be honest to yourself and to your audience. You should be open about the choices you have made when analysing the data, and why you made them. This will allow your audience to determine whether they agree with your choices or not, and hence whether they would likely reach the same conclusions as you. Transparency and openness are extremely important principles in the scientific process. Where possible, you should make the data of studies publicly accessible, so that other scientists have the opportunity to perform their own analyses or to combine your data with that of others to perform meta-analyses. When sharing the data, it is also very useful to share your analysis scripts (or e.g. JASP output file), so that other researchers can exactly replicate your results. By sharing your full analysis scripts, you provide a clear document of the process that transformed the raw data as collected to the results presented in a scientific paper. This benefits not only other scientists, but also yourself. In some months or years after performing your analyses, you will often forget what you did or why. Having a documented analysis script available will help you remember. Moreover, scripts and data tend to get lost if not stored safely and permanently on an external server. Keeping files secure and available as you upgrade computers is not easy. Luckily, there are useful and freely accessible platforms such as the Open Science Framework (https://osf.iof), figshare (https://figshare.com/), and Zenodo (https://zenodo.org/), which are specifically designed to openly share research outputs such as data and analysis scripts. Another popular choice is GitHub (https://github.com/), which is what I tend to use.

17.6 Communicate clearly and concisely

When writing up the results of statistical analyses, you should aim for clarity and conciseness. Foremost, that means providing all the relevant details of statistical analyses. The American Psychological Association (APA) provides extensive guidelines for reporting results of psychological studies in the APA publication manual (American Psychological Association, 2020). Some of the APA rules can be a little tedious, but they are meant to provide standards for effective scientific communication. A useful summary of the main APA style guide can be found here and brief guidelines on reporting numbers and statistics can be found here. Andy Fields has also written a useful guide to writing research reports.

Here are some of my recommendations for describing analyses in the results section of a paper. They are mostly consistent with the APA guidelines, although I must admit the last version of the APA publication manual I read was the 5th edition from 2001.

I will start by listing some general guidelines, and then provide two example write-ups with appropriate figures, one for the multiple regression analysis that we focused on in Chapter 5, and another for the factorial ANOVA that we focused on in Chapter 8.

Start by describing the main objectives of an analysis.
Identify the model underlying the analysis (e.g. a multiple regression model, an ANOVA model, or a linear mixed-effects model).
Present the data on which the analysis is based. This is generally best done in a graph.
Report the results of statistical tests. Provide the key determinants of the distribution of the test statistic, such as the degrees of freedom for $t$-tests and $F$ tests, as well as the actual statistic (rounded to two decimals), and the precise $p$-value (rounded to three decimals), unless the $p$-value is smaller than .001. For example, $t(47) = 2.58$, $p = .013$, and $F(3, 396) = 6.31$, $p < .001$. For ANOVA results, it is also common to provide the Mean Squared Error (MSE) value, which is the unbiased estimate of the error variance, i.e. $\text{MSE} = \text{SSE}(\text{MODEL G})/(n - \text{npar}(G))$.
Where possible and relevant, provide estimated effect sizes and/or confidence intervals in addition to test results. For a multiple regression model, you can provide confidence intervals for the slopes, as well as the $R^2$ of the full model. For effects in an ANOVA model, a common effect size measure is the coefficient of partial determination, i.e. partial $\eta^2$ (Equation (5.8)).
Describe what the effects mean in terms of the dependent variable. For instance, in a multiple regression model, you should indicate what a significant slope indicates in terms of increasing or decreasing the (predicted value of the) dependent variable. For main effects in an ANOVA model, you should describe what a significant effect indicates about the (marginal) means of the dependent variable.
When the interpretation of a test result is not immediately clear (e.g. for an omnibus test in an ANOVA), describe the appropriate follow-up tests which provide clarity about the interpretation of that test result.

17.6.1 Example of reporting a multiple regression analysis

An example of how I might communicate the results of the multiple regression analysis conducted to determine the effect of hate groups on votes for Donald Trumpm is as follows:

To determine the effect of hate groups on votes for Donald Trump in the 2016 US elections, we conducted a multiple regression analysis. The dependent variable was the percentage of votes for Donald Trump in each of the $n=50$ US states, and to account for variations in the size of the population of each state, the number of hate groups was measured per million citizens. As education level is expected to be related to voting behaviour, the regression model also included a measure of education level (as the percentage of citizens with a Bachelors degree or higher) as an additional predictor. This allowed us to determine the unique effect of hate groups on voting behaviour whilst controlling for the possibly confounding effect of education. Pairwise plots depicting the pairwise relations between percentage Trump votes, number of hate groups, and education level, are shown in Figure 17.1. The model accounted for a significant proportion of the variance of voting behaviour ($R^2 = .58$, $F(2, 47) = 33.06$, $p < .001$). The analysis showed a significant unique effect of hate groups on votes for Donald Trump ($b = 1.31$, 95% CI $[0.29, 2.34]$, $t(47) = 2.58$, $p = .013$). For every additional hate group per million citizens, votes for Donald Trump are predicted to increase by 1.31%. In addition, the analysis showed a significant effect of education level ($b = -1.22$, 95% CI $[-1.59, -0.85]$, $t(47) = -6.63$, $p < .001$). For every additional one percent of citizens with at least a Bachelor’s degree, votes for Donald Trump are predicted to decrease by 1.22%.

$Pairwise scatterplots, nonparametric density plots, and pairwise correlations for the percentage of votes for Donald Trump, hate groups (per million citizens) and education level ($\text{\%}$ of citizens with a Bachelors degree or higher).$

Figure 17.1: Pairwise scatterplots, nonparametric density plots, and pairwise correlations for the percentage of votes for Donald Trump, hate groups (per million citizens) and education level ($\text{\%}$ of citizens with a Bachelors degree or higher).

17.6.2 Example of reporting a factorial ANOVA

An example of how I might communicate the results of the factorial ANOVA assessing the effects of power prime and experimenter belief on approach advantage scores, is as follows:

To assess the effect of power prime and experimenter belief on participants’ speed in approaching and avoiding virtual targets, a 2 (power prime: low power vs high power) by 2 (experimenter belief: low power vs high power) factorial ANOVA was conducted. The dependent variable was the “approach advantage” score, computed as the difference in the average time (in milliseconds) between to making an approach vs an avoid response. The approach advantage scores in the four conditions are depicted in Figure 17.2. The analysis showed a significant main effect of experimenter belief ($F(1, 396) = 17.82$, $\mathit{MSE} = 45,974.87$, $p < .001$, $\hat{\eta}^2_p = .043$). The approach advantage score was higher when the experimenter was made to believe participants were provided with a high power prime ($M = 66.72$) compared to when they were made to believe participants were provided with a low power prime ($M = -23.79$). The main effect of power prime was not significant ($F(1, 396) = 0.31$, $\mathit{MSE} = 45,974.87$, $p = .577$, $\hat{\eta}^2_p = .001$). The interaction between experimenter belief and power prime was also not significant ($F(1, 396) = 0.77$, $\mathit{MSE} = 45,974.87$, $p = .381$, $\hat{\eta}^2_p = .002$).

$Approach advantage scores by Experimenter belief and Power prime. Means and $\text{95\%}$ confidence intervals are shown as black dots and bands respectively. Raw data points are shown in grey.$

Figure 17.2: Approach advantage scores by Experimenter belief and Power prime. Means and $\text{95\%}$ confidence intervals are shown as black dots and bands respectively. Raw data points are shown in grey.

The APA guidelines indicate that the results section should detail the results of statistical analyses in an “objective” manner, without directly noting what they imply in terms of support for your theory. While you should avoid statements such as “This clearly shows our theory is true”, it is good practice to provide some guidance on how to interpret the results. In the above, I tried to clearly indicate what a significant effect indicates in terms of the marginal means of the the dependent variable. This is good practice, and does not mention anything about a theory. In addition, it is fine to add qualifiers like “As predicted, …” in the results section. More general evaluation of the evidence for your theory is generally left to the Discussion section of manuscripts, however. So in the discussion, we might interpret the results of this analysis:

The results of our analysis clearly show that experimenter beliefs have an effect on participants’ behaviour. When experimenters believed participants were primed to take a high- or low-power role, those participants behaved according to the expectations from social priming theory. However, the actual prime provided to participants appeared to have little effect on their behaviour. This suggests that previous results might be due to experimenter expectations, rather than a direct effect of the power priming manipulations on participants’ behaviour. The precise mechanism by which experimenter expectations influence participants’ behaviour may be subtle and were not directly addressed in this study. If anything, our results indicate that future investigations of power priming should adopt a double-blind procedure, where neither participants nor experimenters are aware of the condition assigned to participants.

References

American Psychological Association. (2020). Publication manual of the American Psychological Association (7th ed). Washington, DC: American Psychological Association.

Gelman, A., & Loken, E. (2013). The garden of forking paths: Why multiple comparisons can be a problem, even when there is no "fishing expedition" or "p-hacking" and the research hypothesis was posited ahead of time. Department of Statistics, Columbia University.

Lix, L. M., Keselman, J. C., & Keselman, H. J. (1996). Consequences of assumption violations revisited: A quantitative review of alternatives to the one-way analysis of variance F test. Review of Educational Research, 66, 579–619.

Maxwell, S. E., Delaney, H. D., & Kelley, K. (2017). Designing experiments and analyzing data: A model comparison perspective. Routledge.

Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22, 1359–1366.

Simonsohn, U. (2014). Posterior-hacking: Selective reporting invalidates Bayesian results also. Retrieved from https://dx.doi.org/10.2139/ssrn.2374040