Blog
  • INWT
  • Blog: Data Science

Cluster Analysis - Part 2: Hands On

Wed 21 Nov 2018·by Sarah Wagner

In the first part of this blog series, we examined the theoretical foundations of cluster analysis. Now we put the theory into practice using R and find a cluster solution for the mtcars data set. Then the cluster solution is evaluated and interpreted.

Cluster Analysis - Part 1: Introduction

Tue 06 Nov 2018·by Sarah Wagner

This article focuses on introducing the theoretical concepts of cluster analysis. You'll get a basic understanding of the underlying measures and the different methods that can be used for clustering. An evaluation method for group structures and cluster solutions is introduced towards the end of the article.

Optimize your R Code using Memoization

Thu 11 Oct 2018·by Sebastian Warnholz

This article describes how you can apply a programming technique, called Memoization, to speed up your R code and solve performance bottlenecks.

Introducing the Kernel Heaping Package III

Tue 25 Sep 2018·by Marcus Groß

The Kernelheaping package also supports boundary-corrected kernel density estimation, which allows us to exclude certain areas, where we know that the density must be zero. One example is estimating the population density where we like to exclude uninhabited areas such as lakes, forests, parks etc. The Kernelheaping package employs a boundary correction method, where each single kernel is restricted to the area of interest.

Do GPU-based Basic Linear Algebra Subprograms (BLAS) improve the performance of standard modeling techniques in R?

Mon 06 Aug 2018·by Matthäus Deutsch

The speed or run-time of models in R can be a critical factor, especially considering the size and complexity of modern datasets. The number of data points as well as the number of features can easily be in the millions. Even relatively trivial modeling procedures can consume a lot of time, which is critical both for optimization and update of models. An easy way to speed up computations is to use an optimized BLAS (Basic Linear Algebra Subprograms). Especially since R’s default BLAS is well regarded for its stability and portability, not necessarily its speed, this has potential.

Introducing the Kernelheaping Package II

Fri 13 Jul 2018·by Marcus Groß

Interval censoring can be generalised to rectangles or alternatively even arbitrary shapes. That may include counties, zip codes, electoral districts or administrative districts. Standard area-level mapping methods such as choropleth maps suffer from very different area sizes or odd area shapes which can greatly distort the visual impression. The Kernelheaping package provides a way to convert these area-level data to a smooth point estimate.

Prediction: Who will win the 2018 World Cup?

Mon 28 May 2018·by Jonathan Bob

All over the world, at the newsstand, in public transport and above all in countless betting communities, football fans are currently asking themselves the question: Who will be the World Champion of the 2018 Football World Cup? Using statistical data science models, we simulated the 2018 FIFA World Cup 10,000 times to determine the probabilities for the next World Cup winner and thus the World Cup favourites. In the following days of the FIFA World Cup, you will find the answer to the question who are the top favourites for the FIFA World Cup here in our blog - daily updated and based on a lot of data and up-to-date statistical analyses.

Design Patterns in R

Wed 04 Apr 2018·by Sebastian Warnholz

This article is a reflection on how I use different strategies to solve things in R. Design Pattern seems to be a big word, especially because of its use in object-oriented programming. But in the end I think it is simply the correct label for reoccurring strategies to design software.

smoothScatter with ggplot2

Mon 05 Mar 2018·by Sebastian Warnholz

The motivation for this plot is the function:graphics::smoothScatter, basically a plot of a two dimensional density estimator. In the following I want to reproduce the features with ggplot2.

Introducing the Kernelheaping Package

Tue 06 Feb 2018·by Marcus Groß

In this blog article I'd like to introduce the univariate kernel density estimation for heaped (i.e. rounded or interval censored) data with the Kernelheaping package.

INWT's guidelines for R code

Thu 25 Jan 2018·by Mira Céline Klein

Sticking to a styleguide helps writing cleaner code and makes working in a team more comfortable. In this article, we present the styleguide we use at INWT – and how you can check your code for deviations from certain style rules.

A Not So Simple Bar Plot Example Using ggplot2

Tue 12 Dec 2017·by Sebastian Warnholz

This is a reproduction of the (simple) bar plot of chapter 6.1.1 in Datendesign mit R with ggplot2.

Tips for A/B Testing with R

Wed 22 Nov 2017·by Mira Céline Klein

Which layout of an advertisement leads to more clicks? Would a different color or position of the purchase button lead to a higher conversion rate? Does a special offer really attract more customers – and which of two phrasings would be better? For a long time, people have trusted their gut feeling to answer these questions. Today all these questions could be answered by conducting an A/B test.

Promises and Closures in R

Wed 01 Nov 2017·by Sebastian Warnholz

Beginning to (re)discover the usefulness of closures, I remember some (at first sight) very strange behaviour. Actually it is consistent within the scoping rules of R, but until I felt to be on the same level of consistency it took a while.

Plane Crash Data - Part 3: Visualisation

Wed 16 Aug 2017·by Sarah Wagner

This last part is about visualizing the crash location and the flight route with help of the R package leaflet

Plane Crash Data - Part 2: Google Maps Geocoding API Request

Wed 16 Aug 2017·by Sarah Wagner

In this part I'll request the geocoordinates for the crash location and the point of departure as well as for the intendet arrival location from from the Google Maps Geocoding API. 

A meaningful file structure for R projects

Tue 08 Aug 2017·by Mira Céline Klein

Have you ever tried to find your way around in the file structure of an already existing project? To separate relevant from obsolete files in a historically grown directory? To find out in which order existing scripts should be executed? To make all this easier, it helps to have a consistent file and folder structure across your projects. In this article we present our file structure for R projects to help you get started. 

Plane Crash Data - Part 1: Web Scraping

Tue 01 Aug 2017·by Sarah Wagner

This first part is about how to scrape information on aviation accidents from planecrashinfo.com. On this site you can find multiple tables inside tables with lots of information on aviation accidents of the last century.

Who will win the 2017 Bundestag election?

Mon 20 Mar 2017·by Marcus Groß

This article presents the election forecast of INWT for the 2017 elections to the Bundestag. A statistical forecasting model based on the survey results of leading German survey institutes is presented. Unlike the survey institutes we can also use our election forecast to predict the probability of possible coalitions after the election.

MariaDB monitor

Tue 07 Mar 2017·by Martin Badicke

MariaDB is currently the fastest growing open source database solution. It is mainly developed by the MariaDB corporation and is a fork of MySQL. This article describes an own solution for monitoring and optimizing our internal database infrastructure implemented with R and Shiny: the MariaDB monitor. It is an open source alternative to existing fee-based or unflexible monitoring tools.  

100 grams of Lego, please.

Wed 15 Feb 2017·by Sarah Wagner

A statistical analysis of more than 150 Lego building kits shows that the price of individual lego components is determined not only by their size, but also by Lego theme, like Star Wars for example.