rggplot2kernel-densitystat-density2d

How can I extract data from a kernel density function in R for many samples at one time


I have a very large data file (>300k rows) and each row is part of a unique sample (>3000 samples). I want to generate a kernel density estimator for each individual sample and extract relevant information (minimum value, maximum value, maximum probability of density estimator, median of density estimator, mean of density estimator) into a separate table along with the sample name.

I have tried to extract information from the ggplot function stat_density_ridges() using the approaches laid out here Adding a mean to geom_density_ridges and here draw line on geom_density_ridges which extract data from stat_density_ridges and ggplot_build with purrr::pluck but it doesn't provide all the information that I want.

The following generates some synthetic data similar to what I want:

set.seed(1)
x = runif( 50, max = 40, min = 20 )
set.seed(2)
y = runif( 50, max = 300, min = 100 )
sample.number = c( rep( 1, 20 ), rep( 2, 15 ), rep( 3, 5 ), rep( 4, 10 ) )
d <- data.frame( x, y , sample.number ) 

And the plot in ggplot that shows the distribution:

ggplot( data = d, aes( x = x, y = as.factor( samples ) ) ) +
  labs( x = expression( paste( "x" ) ), 
    y = expression( paste( "sample number" ) ) ) +
  stat_density_ridges() 

I would like to end up with a data table with the following information: sample.name, max(x), min(x), maximum height of the kernel density estimator and it's x location, median height of the kernel density estimator and it's x location, etc.

The only thing I can think to do is to create a long and arduous loop

sample.numbers <- rep( NA, times = max( d$sample.number ) )
max.x <- rep( NA, times = max( d$sample.number ) )
min.x <- rep( NA, times = max( d$sample.number ) )

for( i in 1:max( d$sample.number ) ) {
  temp.d = d[ d$sample.number == i, ]
  sample.numbers[ i ] = i
  max.x[ i ] = max( temp.d$x )
  min.x[ i ] = min( temp.d$x )
}

and then somehow adding in a bit that creates a density estimator and extracts the information from that. I'm guessing the indexing in R presents an easier way to get through this for the many thousands of samples I have while using group_by, but I cannot figure it out. Note, I'm still having trouble getting my head around piping in R, so some simple explaining might be required if solutions have that in them.


Solution

  • There are different ways to do this. Using dplyr and the pipe operator is, in my opinion, the easiest way. I tried adding comments in the code to make it easier to understand. Take a look at this dplyr cheat sheet.

    Basically, you use group_by to divide your data frame in groups according to sample.number. Then you use summarise to compute summary metrics of the column x inside each group.

    To compute the density you can use density() from base R inside the summarise. This will return a list with a sample of (x,y) values of the density function. To extract quantiles from this density function, you can use the package spatstat.

    One observation: density() computes a value for the bandwidth that depends on the data set. Since we are separating the different groups, each group can end up having a different bandwidth value. I used the function bw.nrd to estimate a single value of the bandwidth using the complete dataset. Then I use this single bandwidth value for all computations.

    # needed to extract quantile from a pdf computed with density()
    library(spatstat)
    # packages for data wrangling
    library(plyr)
    library(dplyr)
    # ploting
    library(ggplot2)
    library(ggridges)
    
    # creata data set
    set.seed(1)
    x = runif( 50, max = 40, min = 20 )
    set.seed(2)
    y = runif( 50, max = 300, min = 100 )
    sample.number = c( rep( 1, 20 ), rep( 2, 15 ), rep( 3, 5 ), rep( 4, 10 ) )
    d <- data.frame( x, y , sample.number )
    
    # first compute bandwidth over all samples
    # if you don't do this, each pdf in the table will have a different bandwidth
    # bw.nrd is a function that computes bandwidth for a kernel density using a "rule of thumb" formula
    # there are other functions that you can use to estimate bw
    bw <- bw.nrd(d$x)
    
    # create the table using the pipe operator and dplyr
    # the pipe operator '%>%' takes what is on the left side and puts inside the function
    # on the right side as an argument
    d %>% 
      # group rows of 'd' by sample number (this is equivalent to your for loop)
      group_by(sample.number) %>%
      # before computing the summaries for each group, create a new column with the 
      # number of elements in each sample (the resulting DF still has 50 rows)
      mutate(n=n()) %>%
      # now remove rows that belong to groups with less than 5 elements (you can change the threshold value here)
      filter(n > 5) %>%
      # for each group in 'd' compute these summary metrics
      summarise(max.x=max(x),
                min.x=min(x), 
                max.density=max(density(x, bw = bw)$y),
                x.mode=density(x, bw = bw)$x[which(density(x, bw = bw)$y == max.density)],
                x.median=quantile(density(x, bw = bw), 0.5),
                median.density=density(x, bw = bw)$y[which(density(x, bw = bw)$x == x.median)])
    
    # OUTPUT (note that sample.number == 3 was removed from the table)
    #># A tibble: 3 x 7
    #>  sample.number max.x min.x max.density x.mode x.median median.density
    #>          <dbl> <dbl> <dbl>       <dbl>  <dbl>    <dbl>          <dbl>
    #>1             1  39.8  21.2      0.0568   34.3     31.4         0.0503
    #>2             2  38.7  20.3      0.0653   26.9     28.4         0.0628
    #>3             4  36.4  20.5      0.0965   33.9     33.0         0.0939
    #
    
    # see the pdfs using stat_density_ridges
    # (note that i am fixing the bandwidth)
    ggplot( data = d, aes( x = x, y = as.factor( sample.number ) ) ) +
      labs( x = expression( paste( "x" ) ), 
            y = expression( paste( "sample number" ) ) ) +
      stat_density_ridges(bandwidth = bw)