# Statistics

We offer a large number of different individual modules that can be flexibly combined. We will be happy to support you in putting together your individual training.

R is a programming language written specifically for statistics. For this reason, we have a whole category of modules dedicated to statistical analyses in R. Module 2.1 gives a general introduction to inferential statistics, whereas the remaining modules cover the application of different statistical methods in R.

**Module 2.1** Introduction to Inferential Statistics

*Prerequisites: None*

Inferential statistics use data from a sample to make inferences about the population. In this module we lay the basis for statistical testing, covering the theory of building hypotheses, significance levels, rules for the construction of confidence intervals, and finally, the interpretation of test results. At the end of this module, terms like \( \alpha \) error, critical value, p-value, population and sample, and null hypothesis will sound more familiar to you.

Duration: approx. 3 hour

**Module 2.2** Common Statistical Tests

*Prerequisites: Inferential Statistics (module 2.1 or equivalent skills)*

The choice of a suitable test mainly depends on the question to be answered, the number of samples, their relationship to each other, and further distributional assumptions. We present the most common tests, their underlying assumptions, and how to calculate them in R. This includes, for example, the one-sample t-test and its nonparametric counterpart, the two-sample t-test, the chi-square test, ANOVA for multiple samples, etc.

Duration: approx. 3 hours

**Module 2.3** Linear Regression

*Prerequisites: Introduction to R (module 1.1 or equivalent skills), Inferential Statistics (module 2.1
or equivalent skills); experience with ggplot2 (module 4.1) is of advantage*

Linear Regression uses information on various variables to predict the outcome of another variable of interest. We show you how to compute a linear regression in R, interpret the results, and use it to make predictions. This module also covers potential challenges that can arise in linear regression, and how to deal with them: extrapolation (beyond the data range of your sample data), control of third variables, overfitting, dummy variables, goodness of model fit, and model comparison.

Duration: approx. 3,5 hours

**Module 2.4** Logistic Regression

*Prerequisites: Introduction to R (module 1.1 or equivalent skills), Inferential Statistics (module 2.1
or equivalent skills), Linear Regression (module 2.3 or equivalent skills)*

While simple linear regression is perfectly suited for continouos outcomes, it is inappropriate for binary outcomes such as "0/1" or "yes/no". For this, you need to know how to calculate a logistic regression in R, and how to interpret the results.

Duration: approx. 2.5 hours

**Module 2.5** Overfitting, Out-of-sample Fit, and Model Comparison

*Prerequisites: Introduction to R (module 1.1 or equivalent skills), Inferential Statistics (module 2.1
or equivalent skills), Linear Regression (module 2.3 or equivalent skills)*

If a statistical model is too flexible, it can sometimes perform much worse on new data than on the data used to calibrate it. This unpleasant surprise is called "overfitting". We present methods to check for overfitting, strategies to avoid it, how to test if the model works well on new data, and ways to compare multiple models.

Duration: approx. 2 hours

**Module 2.6** Cluster Analysis

*Prerequisites: Introduction to R (module 1.1 or equivalent skills)*

Cluster analysis structures data (such as customers) into groups. There are several decisions to be made in a cluster analysis: for example, determining the appropriate "distance measure" or clustering algorithm. We cover hierarchical, partitioning, and model-based cluster analyses, and explain how to find the appropriate method depending on the data and the question at hand.

Duration: approx. 3 hours

**Module 2.7** Dimension Reduction: Factor Analysis and Principal Component Analysis

*Prerequisites: Introduction to R (module 1.1 or equivalent skills)*

Factor analysis and principal component analysis are two very common methods for dimension reduction (limiting the number of variables to a more manageable quantity). We explain the underlying theory, and apply both methods to example data. Since the two methods are frequently confused, we explain the differences between them, and discuss which circumstances call for each method.

Duration: approx. 2 hours

**Module 2.8** Questionnaire Design and Goodness of Scales

*Prerequisites: Introduction to R (module 1.1 or equivalent skills)*

Creating a questionnare seems easy, but there are a lot of pitfalls that can render the collected data useless. This module is about collecting complete, valid, and unbiased data. This includes the three important concepts of objectivity, reliability, and validity. In this context we cover Cronbach’s \( \alpha \), a measure for a questionnaire’s internal consistency.

Duration: approx. 1.5 hours