I have a very large data file (>300k rows) and each row is part of a unique sample (>3000 samples). I want to generate a kernel density estimator for each individual sample and extract relevant information (minimum value, maximum value, maximum probability of density estimator, median of density estimator, mean of density estimator) into a separate table along with the sample name.
I have tried to extract information from the ggplot
function stat_density_ridges()
using the approaches laid out here
Adding a mean to geom_density_ridges and here draw line on geom_density_ridges which extract data from stat_density_ridges
and ggplot_build
with purrr::pluck
but it doesn't provide all the information that I want.
The following generates some synthetic data similar to what I want:
set.seed(1)
x = runif( 50, max = 40, min = 20 )
set.seed(2)
y = runif( 50, max = 300, min = 100 )
sample.number = c( rep( 1, 20 ), rep( 2, 15 ), rep( 3, 5 ), rep( 4, 10 ) )
d <- data.frame( x, y , sample.number )
And the plot in ggplot
that shows the distribution:
ggplot( data = d, aes( x = x, y = as.factor( samples ) ) ) +
labs( x = expression( paste( "x" ) ),
y = expression( paste( "sample number" ) ) ) +
stat_density_ridges()
I would like to end up with a data table with the following information:
sample.name
, max(x)
, min(x)
, maximum height of the kernel density estimator and it's x
location, median height of the kernel density estimator and it's x
location, etc.
The only thing I can think to do is to create a long and arduous loop
sample.numbers <- rep( NA, times = max( d$sample.number ) )
max.x <- rep( NA, times = max( d$sample.number ) )
min.x <- rep( NA, times = max( d$sample.number ) )
for( i in 1:max( d$sample.number ) ) {
temp.d = d[ d$sample.number == i, ]
sample.numbers[ i ] = i
max.x[ i ] = max( temp.d$x )
min.x[ i ] = min( temp.d$x )
}
and then somehow adding in a bit that creates a density estimator and extracts the information from that. I'm guessing the indexing in R presents an easier way to get through this for the many thousands of samples I have while using group_by
, but I cannot figure it out. Note, I'm still having trouble getting my head around piping in R, so some simple explaining might be required if solutions have that in them.
There are different ways to do this. Using dplyr and the pipe operator is, in my opinion, the easiest way. I tried adding comments in the code to make it easier to understand. Take a look at this dplyr cheat sheet.
Basically, you use group_by
to divide your data frame in groups according to sample.number
. Then you use summarise
to compute summary metrics of the column x
inside each group.
To compute the density you can use density()
from base R inside the summarise
. This will return a list with a sample of (x,y)
values of the density function. To extract quantiles from this density function, you can use the package spatstat
.
One observation: density()
computes a value for the bandwidth that depends on the data set. Since we are separating the different groups, each group can end up having a different bandwidth value. I used the function bw.nrd
to estimate a single value of the bandwidth using the complete dataset. Then I use this single bandwidth value for all computations.
# needed to extract quantile from a pdf computed with density()
library(spatstat)
# packages for data wrangling
library(plyr)
library(dplyr)
# ploting
library(ggplot2)
library(ggridges)
# creata data set
set.seed(1)
x = runif( 50, max = 40, min = 20 )
set.seed(2)
y = runif( 50, max = 300, min = 100 )
sample.number = c( rep( 1, 20 ), rep( 2, 15 ), rep( 3, 5 ), rep( 4, 10 ) )
d <- data.frame( x, y , sample.number )
# first compute bandwidth over all samples
# if you don't do this, each pdf in the table will have a different bandwidth
# bw.nrd is a function that computes bandwidth for a kernel density using a "rule of thumb" formula
# there are other functions that you can use to estimate bw
bw <- bw.nrd(d$x)
# create the table using the pipe operator and dplyr
# the pipe operator '%>%' takes what is on the left side and puts inside the function
# on the right side as an argument
d %>%
# group rows of 'd' by sample number (this is equivalent to your for loop)
group_by(sample.number) %>%
# before computing the summaries for each group, create a new column with the
# number of elements in each sample (the resulting DF still has 50 rows)
mutate(n=n()) %>%
# now remove rows that belong to groups with less than 5 elements (you can change the threshold value here)
filter(n > 5) %>%
# for each group in 'd' compute these summary metrics
summarise(max.x=max(x),
min.x=min(x),
max.density=max(density(x, bw = bw)$y),
x.mode=density(x, bw = bw)$x[which(density(x, bw = bw)$y == max.density)],
x.median=quantile(density(x, bw = bw), 0.5),
median.density=density(x, bw = bw)$y[which(density(x, bw = bw)$x == x.median)])
# OUTPUT (note that sample.number == 3 was removed from the table)
#># A tibble: 3 x 7
#> sample.number max.x min.x max.density x.mode x.median median.density
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>1 1 39.8 21.2 0.0568 34.3 31.4 0.0503
#>2 2 38.7 20.3 0.0653 26.9 28.4 0.0628
#>3 4 36.4 20.5 0.0965 33.9 33.0 0.0939
#
# see the pdfs using stat_density_ridges
# (note that i am fixing the bandwidth)
ggplot( data = d, aes( x = x, y = as.factor( sample.number ) ) ) +
labs( x = expression( paste( "x" ) ),
y = expression( paste( "sample number" ) ) ) +
stat_density_ridges(bandwidth = bw)