• INWT
  • Blog: Data Science

Continuous Integration: Introduction to Jenkins

Mon 15 Jun 2020·by Michelle Golchert

Jenkins is currently the leading open source automation server and programmed in Java. It is distributed under the MIT license. Jenkins is absolutely free and very flexible because it allows the use of a wide range of version control systems and offers more than 1,500 plugins. In this blog article we would like to introduce the CI tool Jenkins and the essential aspects of its user interface.

Continuous Integration: What it is, Why it Matters, and Tools to Get Started

Mon 15 Jun 2020·by Sarah Wagner

This article provides a theoretical introduction to Continuous Integration and an overview of the pros and cons of using CI Tools. A selection of different tools for getting started will be presented.

Understanding and Handling Missing Data

Tue 31 Mar 2020·by Marina Wyss

Missing or incomplete data can have a huge negative impact on any data science project. In this blog we explore what kinds of missing data exist, and how we can go about overcoming the challenges they present. 

shinyMatrix - Matrix Input for Shiny Apps

Thu 13 Feb 2020·by Andreas Neudecker

In this post we’d like to introduce you to our new R package shinyMatrix. It provides you with an editable matrix input field for shiny apps.

Building a Strong Data Science Team from the Ground Up

Fri 27 Dec 2019·by Marina Wyss

Business is changing as a result of the increasing quantity and variety of data available. Significant new opportunities can be realized by harnessing the knowledge contained in these data - if you know where to look. A data science team can help to bring raw data through the analysis process and derive insights that are critical in today’s technologically-competitive environment.

Data Visualization in R vs. Python

Mon 23 Dec 2019·by Michelle Golchert

Visualization tools in R and Python offer support for projects in different ways. If you are still unsure which language is right for you, this article could be of interest to you and offer support in decision-making. Common packages of both languages ​​are presented and sample graphics are created.

Debugging in R: How to Easily and Efficiently Conquer Errors in Your Code

Tue 19 Nov 2019·by Marina Wyss

When you write code, you’re sure to run into problems from time to time. Here are some advanced tips and tricks for handling these errors, explained accessibly.

Marketing Mix Modeling - How Does Advertising Really Work?

Mon 21 Oct 2019·by Sebastian Cattes

One of the biggest challenges that companies face is to use their advertising budgets efficiently, and to advertise purposefully such that advertising meets the customer when it has the most leverage - without being overwhelming, repetitive, or irrelevant. With Marketing Mix Modeling, we can help to overcome this challenge.


Multi-Armed Bandits as an A/B Testing Solution

Thu 26 Sep 2019·by Marina Wyss

Multi-Armed Bandit algorithms are a modern alternative to traditional A/B testing. Similar to Reinforcement Learning, these algorithms can optimize what is shown to the client to maximize rewards while simultaneously determining the most successful option for your business. 

Data Quality and the Importance of Data Stewardship

Tue 17 Sep 2019·by Marina Wyss

Having understandable, clean, and compliant data is a necessity for business success. Specific care is needed to ensure that analyses made on the basis of data are reliable and offer value to an organization. In this context, the role of a Data Steward is becoming ever-more valuable. This article discusses roles and tasks of Data Stewardship.

Best Practice: Development of Robust Shiny Dashboards as R Packages

Mon 09 Sep 2019·by David Berscheid

This article describes best practice approaches for developing shiny dashboards. The creation of the dashboard in package form, as well as the use of unit tests should enable the development of robust solutions and guarantee high quality.

What's the Best Statistical Software? A Comparison of R, Python, SAS, SPSS and STATA

Thu 25 Jul 2019·by Amit Ghosh

An introduction to and comparison of the market leaders in statistics programs - R, Python, SAS, SPSS, and STATA - to help pick the best one for your needs.

Shiny Modules

Tue 16 Jul 2019·by Andreas Neudecker

In this article we look at how to build a shiny app with clear code, reusable and automatically tested modules. For that, we first go into the package structure and testing a shiny app before we focus on the actual modules.

Best Practice in TV Tracking: Why a Simple Baseline Correction Falls Short!

Wed 19 Jun 2019·by Steffen Wagner

In current online marketing practice, short-term TV-induced web page traffic is usually quantified by a simple baseline correction. In our blog article, we show which measurement errors go along with it, how they can be avoided and how the identified TV impact is correctly considered in the attribution.

rsync as R package

Tue 21 May 2019·by David Berscheid

In this article we present our R package rsync, which serves as an interface between R and the popular Linux command line tool rsync. Rsync allows users of Unix systems to synchronize local and remote files between two locations.

Using Modules in R

Tue 07 May 2019·by Sebastian Warnholz

When a code base grows we may think of using several files first and then source them. Functions, of course, are rightfully advocated to new R users, and are the essential building block. Packages are then, already, the next level of abstraction we have to offer. With the modules package I want to provide something in between.

ggCorpIdent: Stylize ggplot2 Graphics in Your Corporate Design

Mon 25 Mar 2019·by Steffen Wagner

ggCorpIdent is a package for customizing ggplot2 graphics in R easily and without touching the plot code itself. It lets you use custom colors in the plot, which are interpolated if you have not specified as much colors as needed. You can add custom fonts for the text elements within the plot and embed your corporate logo.

R Markdown Template for Business Reports

Wed 30 Jan 2019·by Sarah Wagner

In this post I'd like to introduce the R Markdown template for business reports by INWTlab. It's a nice and clean template for use in a corporate environment that is easy to customize in colors, cover and logo.

Cluster Analysis - Part 2: Hands On

Wed 21 Nov 2018·by Sarah Wagner

In the first part of this blog series, we examined the theoretical foundations of cluster analysis. Now we put the theory into practice using R and find a cluster solution for the mtcars data set. Then the cluster solution is evaluated and interpreted.

Cluster Analysis - Part 1: Introduction

Tue 06 Nov 2018·by Sarah Wagner

This article focuses on introducing the theoretical concepts of cluster analysis. You'll get a basic understanding of the underlying measures and the different methods that can be used for clustering. An evaluation method for group structures and cluster solutions is introduced towards the end of the article.

Optimize your R Code using Memoization

Thu 11 Oct 2018·by Sebastian Warnholz

This article describes how you can apply a programming technique, called Memoization, to speed up your R code and solve performance bottlenecks.

Introducing the Kernel Heaping Package III

Tue 25 Sep 2018·by Marcus Groß

The Kernelheaping package also supports boundary-corrected kernel density estimation, which allows us to exclude certain areas, where we know that the density must be zero. One example is estimating the population density where we like to exclude uninhabited areas such as lakes, forests, parks etc. The Kernelheaping package employs a boundary correction method, where each single kernel is restricted to the area of interest.

Do GPU-based Basic Linear Algebra Subprograms (BLAS) improve the performance of standard modeling techniques in R?

Mon 06 Aug 2018·by Matthäus Deutsch

The speed or run-time of models in R can be a critical factor, especially considering the size and complexity of modern datasets. The number of data points as well as the number of features can easily be in the millions. Even relatively trivial modeling procedures can consume a lot of time, which is critical both for optimization and update of models. An easy way to speed up computations is to use an optimized BLAS (Basic Linear Algebra Subprograms). Especially since R’s default BLAS is well regarded for its stability and portability, not necessarily its speed, this has potential.

Introducing the Kernelheaping Package II

Fri 13 Jul 2018·by Marcus Groß

Interval censoring can be generalised to rectangles or alternatively even arbitrary shapes. That may include counties, zip codes, electoral districts or administrative districts. Standard area-level mapping methods such as choropleth maps suffer from very different area sizes or odd area shapes which can greatly distort the visual impression. The Kernelheaping package provides a way to convert these area-level data to a smooth point estimate.

Prediction: Who will win the 2018 World Cup?

Mon 28 May 2018·by Jonathan Bob

All over the world, at the newsstand, in public transport and above all in countless betting communities, football fans are currently asking themselves the question: Who will be the World Champion of the 2018 Football World Cup? Using statistical data science models, we simulated the 2018 FIFA World Cup 10,000 times to determine the probabilities for the next World Cup winner and thus the World Cup favourites. In the following days of the FIFA World Cup, you will find the answer to the question who are the top favourites for the FIFA World Cup here in our blog - daily updated and based on a lot of data and up-to-date statistical analyses.

Design Patterns in R

Wed 04 Apr 2018·by Sebastian Warnholz

This article is a reflection on how I use different strategies to solve things in R. Design Pattern seems to be a big word, especially because of its use in object-oriented programming. But in the end I think it is simply the correct label for reoccurring strategies to design software.

smoothScatter with ggplot2

Mon 05 Mar 2018·by Sebastian Warnholz

The motivation for this plot is the function:graphics::smoothScatter, basically a plot of a two dimensional density estimator. In the following I want to reproduce the features with ggplot2.

Introducing the Kernelheaping Package

Tue 06 Feb 2018·by Marcus Groß

In this blog article I'd like to introduce the univariate kernel density estimation for heaped (i.e. rounded or interval censored) data with the Kernelheaping package.

INWT's guidelines for R code

Thu 25 Jan 2018·by Mira Céline Klein

Sticking to a styleguide helps writing cleaner code and makes working in a team more comfortable. In this article, we present the styleguide we use at INWT – and how you can check your code for deviations from certain style rules.

A Not So Simple Bar Plot Example Using ggplot2

Tue 12 Dec 2017·by Sebastian Warnholz

This is a reproduction of the (simple) bar plot of chapter 6.1.1 in Datendesign mit R with ggplot2.

Tips for A/B Testing with R

Wed 22 Nov 2017·by Mira Céline Klein

Which layout of an advertisement leads to more clicks? Would a different color or position of the purchase button lead to a higher conversion rate? Does a special offer really attract more customers – and which of two phrasings would be better? For a long time, people have trusted their gut feeling to answer these questions. Today all these questions could be answered by conducting an A/B test.

Promises and Closures in R

Wed 01 Nov 2017·by Sebastian Warnholz

Beginning to (re)discover the usefulness of closures, I remember some (at first sight) very strange behaviour. Actually it is consistent within the scoping rules of R, but until I felt to be on the same level of consistency it took a while.

Plane Crash Data - Part 3: Visualisation

Wed 16 Aug 2017·by Sarah Wagner

This last part is about visualizing the crash location and the flight route with help of the R package leaflet

Plane Crash Data - Part 2: Google Maps Geocoding API Request

Wed 16 Aug 2017·by Sarah Wagner

In this part I'll request the geocoordinates for the crash location and the point of departure as well as for the intendet arrival location from from the Google Maps Geocoding API. 

A meaningful file structure for R projects

Tue 08 Aug 2017·by Mira Céline Klein

Have you ever tried to find your way around in the file structure of an already existing project? To separate relevant from obsolete files in a historically grown directory? To find out in which order existing scripts should be executed? To make all this easier, it helps to have a consistent file and folder structure across your projects. In this article we present our file structure for R projects to help you get started. 

Plane Crash Data - Part 1: Web Scraping

Tue 01 Aug 2017·by Sarah Wagner

This first part is about how to scrape information on aviation accidents from On this site you can find multiple tables inside tables with lots of information on aviation accidents of the last century.

Who will win the 2017 Bundestag election?

Mon 20 Mar 2017·by Marcus Groß

This article presents the election forecast of INWT for the 2017 elections to the Bundestag. A statistical forecasting model based on the survey results of leading German survey institutes is presented. Unlike the survey institutes we can also use our election forecast to predict the probability of possible coalitions after the election.

MariaDB monitor

Tue 07 Mar 2017·by Martin Badicke

MariaDB is currently the fastest growing open source database solution. It is mainly developed by the MariaDB corporation and is a fork of MySQL. This article describes an own solution for monitoring and optimizing our internal database infrastructure implemented with R and Shiny: the MariaDB monitor. It is an open source alternative to existing fee-based or unflexible monitoring tools.  

100 grams of Lego, please.

Wed 15 Feb 2017·by Sarah Wagner

A statistical analysis of more than 150 Lego building kits shows that the price of individual lego components is determined not only by their size, but also by Lego theme, like Star Wars for example.