Blog
  • INWT
  • Blog: Data Science

Thu 27 Oct 2022·by

The purpose of our PAC is to build an infrastructure of data, models, forecasts, and reports that enables rapid comparison of models in terms of their predictive power based on a standardized assessment.

Wed 26 Jan 2022·by

This article describes how to write data from Python into an Excel file and format them. Through automation the content and appearance of the report only need to be programmed once; after that, the report can be created again and again with minimal effort, for example for partial data sets or daily updated data.

Wed 15 Dec 2021·by

Software projects nowadays require much more than just writing code. Different programming languages, frameworks and architectures increase the complexity of projects. Docker provides applications with all their dependencies as packages in so-called "images" and thus enables workflows to be simplified. This article is intended to serve as an introduction to the topic and to give you an overview about the basic concepts of Docker.

Thu 18 Nov 2021·by

In this second part we show an interesting way to combine existing python packages and concepts to tackle some problems of the python programming language. You learn how to create a simple, yet flexible and powerful way to do complex DataFrame validation with Pydantic. With this unit tests for functions that return DataFrames can be reduced, and the data quality of production pipelines can be ensured.

Tue 09 Nov 2021·by

In this article we discuss the downsides of python's dynamic typing capabilities in regard to data quality and code maintainability. We give an introduction into the Pydantic package for input validation and show how decorators work.

Mon 04 Oct 2021·by

In our introductory article, we explained how discrete choice models can generate insights into customers' decision-making behaviour. In this article we show how to estimate an MNL model using RStan, the R interface of the statistical software Stan.

Thu 05 Aug 2021·by

This is the fourth part of our series about code performance in R. In the first part, I introduced methods to measure which part of a given code is slow. The second part lists general techniques to make R code faster. The third part deals with parallelization. In this part we are going to have a look at the challenges that come with large datasets.

Wed 30 Jun 2021·by

This is the third part of our series about code performance in R. In the first part, I introduced methods to measure which part of a given code is slow. The second part lists general techniques to make R code faster. In this part you are going to see how to take advantage from parallelization in R.

Thu 03 Jun 2021·by

Working in social media can be stressful, monotonous, and repetitive. In this blog post we show you how to take a screenshot of any website and post via Twitter API to boost your social media presence automatically.

Wed 05 May 2021·by

This is the second part of our series about code performance in R. It contains a lot of approaches to reduce the time your code needs to run. It's useful to know those ideas before starting to write new code, but it also helps to optimize existing code.

Mon 26 Apr 2021·by

Let's assume you have written some code, it's working, it computes the results you need, but it is really slow. If you don't want to get slowed down in your work, you have no other choice than improving the code's performance. But how to start? The best approach is to find out where to start optimizing. How to do that is the subject of this article.

Mon 29 Mar 2021·by

In this article, we would like to discuss discrete choice models used in marketing analysis and modeling. Discrete choice models allow to develop a better understanding of the decision-making behavior of customers. This understanding can be used, among other things, to make precise predictions about purchase decisions and to evaluate how customers perceive promotional offers, product messages or brand strategies and assess new products or improved product features.

Mon 15 Mar 2021·by

There are numerous reports in the media of the intensity and spatial distribution of new corona infections, which are based on the assumption of a single incidence number for the entire district area. If one assumes a smooth distribution of the risk of infection over the whole of Germany, one obtains a district-independent smooth function that describes the number of new infections occurring in one place. This representation allows the identification of local hotspots and their development over time and thus provides valuable information on uncovering the risk of spreading of the corona pandemic.

Sat 20 Feb 2021·by

The map shows the local 7-day-incidence rate of the officially reported Covid-19 infections in Germany over time. The project originated from a master thesis in the Joint Master Studiengang Statistics, in cooperation with the Freie Universität Berlin and INWT Statistics. An advanced algorithm is used to plot national-wide infection cases, revealing more visible patterns and providing higher accuracy compared to district-level incidence data.

Wed 13 Jan 2021·by

The 2020 US Presidential election past November has been the prime political event of this past year. Political scientists and data scientists had the opportunity to develop forecasting tools to understand and predict American voter behavior. When reflecting on different forecasting methodologies, there are 10 takeaways to keep in mind for every data scientist.

Thu 10 Dec 2020·by

Data-driven approaches to maximizing customer relationships are more important than ever in today’s highly saturated and competitive markets. There are many steps a proactive business can take to positively position themselves to keep high-value customers and prevent customer churn.

Mon 30 Nov 2020·by

Now that Halloween is over and Advent is just around the corner, it is time for some christmas decorations. And what better way to get into the holiday spirit than with a Python 🐍 project?

Tue 20 Oct 2020·by

How to protect your data from one of the most common (and potentially damaging) web security risks. 

Wed 16 Sep 2020·by

Traditionally, marketing decisions have been made by executives on the basis of instinct, experience, and what data are available. But what if this could be automated, with an artificial agent making use of huge amounts of data to automatically determine the optimal marketing strategy for every customer individually at a particular moment in time? This is precisely the promise of reinforcement learning.  

Mon 15 Jun 2020·by

Jenkins is currently the leading open source automation server and programmed in Java. It is distributed under the MIT license. Jenkins is absolutely free and very flexible because it allows the use of a wide range of version control systems and offers more than 1,500 plugins. In this blog article we would like to introduce the CI tool Jenkins and the essential aspects of its user interface.

Mon 15 Jun 2020·by

This article provides a theoretical introduction to Continuous Integration and an overview of the pros and cons of using CI Tools. A selection of different tools for getting started will be presented.

Tue 31 Mar 2020·by

Missing or incomplete data can have a huge negative impact on any data science project. In this blog we explore what kinds of missing data exist, and how we can go about overcoming the challenges they present. 

Thu 13 Feb 2020·by

In this post we’d like to introduce you to our new R package shinyMatrix. It provides you with an editable matrix input field for shiny apps.

Fri 27 Dec 2019·by

Business is changing as a result of the increasing quantity and variety of data available. Significant new opportunities can be realized by harnessing the knowledge contained in these data - if you know where to look. A data science team can help to bring raw data through the analysis process and derive insights that are critical in today’s technologically-competitive environment.

Mon 23 Dec 2019·by

Visualization tools in R and Python offer support for projects in different ways. If you are still unsure which language is right for you, this article could be of interest to you and offer support in decision-making. Common packages of both languages ​​are presented and sample graphics are created.

Tue 19 Nov 2019·by

When you write code, you’re sure to run into problems from time to time. Here are some advanced tips and tricks for handling these errors, explained accessibly.

Mon 21 Oct 2019·by

One of the biggest challenges that companies face is to use their advertising budgets efficiently, and to advertise purposefully such that advertising meets the customer when it has the most leverage - without being overwhelming, repetitive, or irrelevant. With Marketing Mix Modeling, we can help to overcome this challenge.

 

Thu 26 Sep 2019·by

Multi-Armed Bandit algorithms are a modern alternative to traditional A/B testing. Similar to Reinforcement Learning, these algorithms can optimize what is shown to the client to maximize rewards while simultaneously determining the most successful option for your business. 

Tue 17 Sep 2019·by

Having understandable, clean, and compliant data is a necessity for business success. Specific care is needed to ensure that analyses made on the basis of data are reliable and offer value to an organization. In this context, the role of a Data Steward is becoming ever-more valuable. This article discusses roles and tasks of Data Stewardship.

Mon 09 Sep 2019·by

This article describes best practice approaches for developing shiny dashboards. The creation of the dashboard in package form, as well as the use of unit tests should enable the development of robust solutions and guarantee high quality.

Thu 25 Jul 2019·by

An introduction to and comparison of the market leaders in statistics programs - R, Python, SAS, SPSS, and STATA - to help pick the best one for your needs.

Tue 16 Jul 2019·by

In this article we look at how to build a shiny app with clear code, reusable and automatically tested modules. For that, we first go into the package structure and testing a shiny app before we focus on the actual modules.

Wed 19 Jun 2019·by

In current online marketing practice, short-term TV-induced web page traffic is usually quantified by a simple baseline correction. In our blog article, we show which measurement errors go along with it, how they can be avoided and how the identified TV impact is correctly considered in the attribution.

Tue 21 May 2019·by

In this article we present our R package rsync, which serves as an interface between R and the popular Linux command line tool rsync. Rsync allows users of Unix systems to synchronize local and remote files between two locations.

Tue 07 May 2019·by

When a code base grows we may think of using several files first and then source them. Functions, of course, are rightfully advocated to new R users, and are the essential building block. Packages are then, already, the next level of abstraction we have to offer. With the modules package I want to provide something in between.

Mon 25 Mar 2019·by

ggCorpIdent is a package for customizing ggplot2 graphics in R easily and without touching the plot code itself. It lets you use custom colors in the plot, which are interpolated if you have not specified as much colors as needed. You can add custom fonts for the text elements within the plot and embed your corporate logo.

Wed 30 Jan 2019·by

In this post I'd like to introduce the R Markdown template for business reports by INWTlab. It's a nice and clean template for use in a corporate environment that is easy to customize in colors, cover and logo.

Wed 21 Nov 2018·by

In the first part of this blog series, we examined the theoretical foundations of cluster analysis. Now we put the theory into practice using R and find a cluster solution for the mtcars data set. Then the cluster solution is evaluated and interpreted.

Tue 06 Nov 2018·by

This article focuses on introducing the theoretical concepts of cluster analysis. You'll get a basic understanding of the underlying measures and the different methods that can be used for clustering. An evaluation method for group structures and cluster solutions is introduced towards the end of the article.

Thu 11 Oct 2018·by

This article describes how you can apply a programming technique, called Memoization, to speed up your R code and solve performance bottlenecks.

Tue 25 Sep 2018·by

The Kernelheaping package also supports boundary-corrected kernel density estimation, which allows us to exclude certain areas, where we know that the density must be zero. One example is estimating the population density where we like to exclude uninhabited areas such as lakes, forests, parks etc. The Kernelheaping package employs a boundary correction method, where each single kernel is restricted to the area of interest.

Mon 06 Aug 2018·by

The speed or run-time of models in R can be a critical factor, especially considering the size and complexity of modern datasets. The number of data points as well as the number of features can easily be in the millions. Even relatively trivial modeling procedures can consume a lot of time, which is critical both for optimization and update of models. An easy way to speed up computations is to use an optimized BLAS (Basic Linear Algebra Subprograms). Especially since R’s default BLAS is well regarded for its stability and portability, not necessarily its speed, this has potential.

Fri 13 Jul 2018·by

Interval censoring can be generalised to rectangles or alternatively even arbitrary shapes. That may include counties, zip codes, electoral districts or administrative districts. Standard area-level mapping methods such as choropleth maps suffer from very different area sizes or odd area shapes which can greatly distort the visual impression. The Kernelheaping package provides a way to convert these area-level data to a smooth point estimate.

Mon 28 May 2018·by

All over the world, at the newsstand, in public transport and above all in countless betting communities, football fans are currently asking themselves the question: Who will be the World Champion of the 2018 Football World Cup? Using statistical data science models, we simulated the 2018 FIFA World Cup 10,000 times to determine the probabilities for the next World Cup winner and thus the World Cup favourites. In the following days of the FIFA World Cup, you will find the answer to the question who are the top favourites for the FIFA World Cup here in our blog - daily updated and based on a lot of data and up-to-date statistical analyses.

Wed 04 Apr 2018·by

This article is a reflection on how I use different strategies to solve things in R. Design Pattern seems to be a big word, especially because of its use in object-oriented programming. But in the end I think it is simply the correct label for reoccurring strategies to design software.

Mon 05 Mar 2018·by

The motivation for this plot is the function:graphics::smoothScatter, basically a plot of a two dimensional density estimator. In the following I want to reproduce the features with ggplot2.

Tue 06 Feb 2018·by

In this blog article I'd like to introduce the univariate kernel density estimation for heaped (i.e. rounded or interval censored) data with the Kernelheaping package.

Thu 25 Jan 2018·by

Sticking to a styleguide helps writing cleaner code and makes working in a team more comfortable. In this article, we present the styleguide we use at INWT – and how you can check your code for deviations from certain style rules.

Tue 12 Dec 2017·by

This is a reproduction of the (simple) bar plot of chapter 6.1.1 in Datendesign mit R with ggplot2.

Wed 22 Nov 2017·by

Which layout of an advertisement leads to more clicks? Would a different color or position of the purchase button lead to a higher conversion rate? Does a special offer really attract more customers – and which of two phrasings would be better? For a long time, people have trusted their gut feeling to answer these questions. Today all these questions could be answered by conducting an A/B test.

Wed 01 Nov 2017·by

Beginning to (re)discover the usefulness of closures, I remember some (at first sight) very strange behaviour. Actually it is consistent within the scoping rules of R, but until I felt to be on the same level of consistency it took a while.

Wed 16 Aug 2017·by

This last part is about visualizing the crash location and the flight route with help of the R package leaflet

Wed 16 Aug 2017·by

In this part I'll request the geocoordinates for the crash location and the point of departure as well as for the intendet arrival location from from the Google Maps Geocoding API. 

Tue 08 Aug 2017·by

Have you ever tried to find your way around in the file structure of an already existing project? To separate relevant from obsolete files in a historically grown directory? To find out in which order existing scripts should be executed? To make all this easier, it helps to have a consistent file and folder structure across your projects. In this article we present our file structure for R projects to help you get started. 

Tue 01 Aug 2017·by

This first part is about how to scrape information on aviation accidents from planecrashinfo.com. On this site you can find multiple tables inside tables with lots of information on aviation accidents of the last century.

Mon 20 Mar 2017·by

This article presents the election forecast of INWT for the 2017 elections to the Bundestag. A statistical forecasting model based on the survey results of leading German survey institutes is presented. Unlike the survey institutes we can also use our election forecast to predict the probability of possible coalitions after the election.

Tue 07 Mar 2017·by

MariaDB is currently the fastest growing open source database solution. It is mainly developed by the MariaDB corporation and is a fork of MySQL. This article describes an own solution for monitoring and optimizing our internal database infrastructure implemented with R and Shiny: the MariaDB monitor. It is an open source alternative to existing fee-based or unflexible monitoring tools.  

Wed 15 Feb 2017·by

A statistical analysis of more than 150 Lego building kits shows that the price of individual lego components is determined not only by their size, but also by Lego theme, like Star Wars for example.