Introducing the Kernelheaping Package
In this blog article I'd like to introduce the univariate kernel density estimation for heaped (i.e. rounded or interval censored) data with the Kernelheaping package.
It is not unusual to have interval censored data such as in income surveys due to anonymisation or simplification issues. However, a simple task like plotting a density may fail completely for rounded or interval censored data. When using class centers as input, the density estimate might get bumpy or spiky around the center points. This problem gets even worse for larger sample sizes with decreasing bandwidth. One might try to manually increase the bandwidth until one obtains a sufficiently smooth estimate. However, this will result in oversmoothing as we will see in our example. The Kernelheaping package, available on CRAN, delivers an algorithm to obtain nonparametric density estimates for interval censored data.
In the following chunk, I'll simulate 5000 observations from a log-normal distribution. Then I plot the histogram together with a conventional kernel density estimate.
set.seed(0) dat <- data.frame(x = rlnorm(5000, meanlog = 7.5, sdlog = 0.5)) ggplot(dat, aes(x, y = ..density..)) + geom_histogram(position = "identity", color = "black", fill = "white") + geom_density(fill = "blue", alpha = 0.7, linetype = 0) + ylim(c(0, 0.00055)) + xlim(c(0, 10000))
Now let's split our data into 14 classes imitating a typical income survey.
classes <- c(0, 500, 1000, 1500, 2000, 2500, 3000, 4000, 5000, 6000, 8000, 10000, 15000, Inf) dat$xclass <- cut(dat$x, breaks = classes)
If I now use the class centers as input, the result is a very unrealistic, spiky density.
classCenters <- (classes[-1] - classes[-length(classes)]) / 2 + classes[-length(classes)] classCenters[length(classCenters)] <- 2 * classCenters[length(classCenters) - 1] levels(dat$xclass) <- classCenters dat$xclassCenter <- as.numeric(as.character(dat$xclass)) ggplot(dat, aes(xclassCenter, y = ..density..)) + geom_histogram(breaks = classes, color = "black", fill = "white") + geom_density(fill = "blue", alpha = 0.7, linetype = 0)
I've got the option to adjust the density by manipulating the adjusting parameter. But getting a nice and smooth density shape requires manipulation to such an extent that we end up with serious oversmoothing.
ggplot(dat, aes(xclassCenter, y = ..density..)) + geom_histogram(breaks = classes, color = "black", fill = "white") + geom_density(adjust = 2.5, fill = "blue", alpha = 0.7, linetype = 0) + ylim(c(0, 0.00055)) + xlim(c(0, 10000))
Now, we install and load the Kernelheaping pack
dclass function implements an iterative SEM (Stochastic Expectation Maximization) algorithm.
It can be called with any bandwidth selector available to the
density function in the stats package and also provides optional boundary correction for positive only data.
densityEst <- dclass(xclass = dat$xclass, classes = classes, burnin = 50, samples = 100, evalpoints = 1000)
classes <- data.frame(xclass = densityEst$xclass) estimates <- data.frame(Mestimates = densityEst$Mestimates, gridx = densityEst$gridx) ggplot() + geom_histogram(data = classes, aes(xclass, y = ..density..), breaks = densityEst$classes, color = "black", fill = "white") + geom_area(data = estimates, aes(gridx, Mestimates), fill = "blue", alpha = 0.7, size = 1) + ylim(c(0, 0.00055)) + xlim(c(0, 10000))
The resulting density estimate is smooth and very close to the estimate on the original unclassed data. The package is also able to handle the case of survey participants rounding values on their own, each with a different level of coarseness. Please check out the
dheaping function with given examples.
Soon there will be a second part to this article, where I'll show how to compute bivariate densities given area-level data with the Kernelheaping package.