Code performance in R: Working with large datasets
This is the fourth part of our series about code performance in R. In the first part, I introduced methods to measure which part of a given code is slow. The second part lists general techniques to make R code faster. The third part deals with parallelization. In this part we are going to have a look at the challenges that come with large datasets.
Whether your dataset is "large" not only depends on the number of rows, but also on the method you are going to use. It's easy to compute the mean or sum of as many as 10,000 numbers, but a nonlinear regression with many variables can already take some time with a sample size of 1,000.
Sometimes it may help to parallelize (see part 3 of the series). But with large datasets, you can use parallelization only up to the point where working memory becomes the limiting factor. In addition, there may be tasks that cannot be parallelized at all. In these cases, the strategies from part 2 of this series may be helpful, and there are some more ways:
Some computations will not only become very slow, but even impossible for large datasets, for example, due to working memory. But the good news is that it's often totally sufficient to work on a sample - for instance, to compute summary statistics or estimate a regression model. At least during code development, this can be very useful. Another option is to divide your data into multiple parts, do your computations on each part separately, and recombine them (e.g., by averaging regression coefficients). Sometimes you can even execute those computations in parallel, even if working memory was not sufficient to do it on the whole dataset. This sounds counterintuitive, but the reason is the following: Many methods (e.g., regression analysis) work with matrices. They often grow quadratically with the number of observations, and so do the computational costs and the required working memory. Therefore, doing it with half the sample size requires only more or less a quarter of the resources, not half.
If you run into working memory problems, it helps to check if there are large objects in your workspace that you don't need anymore. Just remove them with
rm followed by a so-called "garbage collection" to return the memory to the operating system (
Garbage collections take place automatically on a regular basis, but this ensures that it happens right away.
Especially for data handling, dplyr is much more elegant than base R, and often faster. But there is an even faster alternative: the data.table package. The difference is already visible for very small operations such as selecting columns or computing the mean for subgroups:
library(data.table) library(dplyr) cols <- c("Sepal.Length", "Sepal.Width") irisDt <- as.data.table(iris) # Select columns microbenchmark("dplyr" = iris %>% select(cols), "data.table" = irisDt[, cols, with = FALSE]) ## Unit: microseconds ## expr min lq mean median uq max neval ## dplyr 1890.744 2099.4445 2770.3403 2401.4760 3132.7005 9259.750 100 ## data.table 62.763 76.5215 179.3211 110.4575 147.2455 5923.169 100 # Compute grouped mean microbenchmark("dplyr" = iris %>% group_by(Species) %>% summarise(mean(Sepal.Length)), "data.table" = irisDt[,.(meanSL = mean(Sepal.Length)), by = Species]) ## Unit: microseconds ## expr min lq mean median uq max neval ## dplyr 3758.252 4686.548 5769.8606 5533.120 6430.0995 14503.304 100 ## data.table 415.039 512.455 665.5811 613.622 718.2905 1646.667 100
The differences for more time-consuming operations are equally impressive.
Instead of loading your complete data into R before each analysis, you can store it in a database, e.g., an SQL database. This has several advantages:
- When you retrieve data from the database, you can specify which rows and columns you need in your database query. You don't need to load the whole dataset into the working memory.
- You can even do some data handling steps in the query (e.g., sorting, grouped computations).
- You can also create and store preprocessed datasets (e.g., aggregated or combined datasets) in the database and access them from R. Databases are in general quite good and fast for these kinds of computations.