rdistributionweibullgamma-distribution

Is weibull the right distribution for this data? How do i find the best parameters using R?


I have some time occurrence data for multiple(1000s) event groups. I need to cluster the event groups showing similar distribution and find the parameters for each cluster. each event group has between 5-15 data points. I took a random sample of 50 event groups and plotted them, frequency against time.

50 samples graph

To me, the distribution seems to be Weibull and now I am looking to find the parameters, but I have been unable to find stable parameters. I have used nls package to find stable parameters for an event group.

dat <- data.frame(x=single_event$time, y=single_event$freq_density)
pars <- expand.grid(a=seq(0.01, 10, len=20),
                b=seq(1, 50, len=20))
res <- nls2(y ~ ((a/b) * ((x/b)^(a-1)) * exp(- (x/b)^a)), data=dat,
        start=pars, algorithm='brute-force')
res1 <- nls(y ~ ((a/b) * ((x/b)^(a-1)) * exp(- (x/b)^a)), data=dat,
        start=as.list(coef(res)))

But am unable to get an output which makes sense. For most event groups, I get the error Error in nls(y ~ ((a/b) * ((x/b)^(a - 1)) * exp(-(x/b)^a)), data = dat, : singular gradient

Now, I am wondering if I have chosen the right distribution.

How do I get the right distribution for this? And how do I find the parameters?

Here is some sample data:

event_group <- c('group_A', 'group_B', 'group_A', 'group_C', 'group_B', 'group_D', 'group_E', 'group_A', 'group_C', 'group_B', 'group_D', 'group_E', 'group_A', 'group_C', 'group_B', 'group_D', 'group_E', 'group_A', 'group_C', 'group_B', 'group_D', 'group_E', 'group_A', 'group_C', 'group_B', 'group_D', 'group_E', 'group_A', 'group_C', 'group_B', 'group_D', 'group_E', 'group_A', 'group_C', 'group_B', 'group_D', 'group_E', 'group_B', 'group_D', 'group_E', 'group_B', 'group_E', 'group_B', 'group_D', 'group_E', 'group_E')

freq_density <- c(0.005747126, 0.015151515, 0.057471264, 0.089552239, 0.015151515, 0.104477612, 0.033057851, 0.103448276, 0.28358209, 0.106060606, 0.044776119, 0.140495868, 0.25862069, 0.298507463, 0.181818182, 0.164179104, 0.090909091, 0.206896552, 0.164179104, 0.212121212, 0.268656716, 0.347107438, 0.247126437, 0.059701493, 0.151515152, 0.179104478, 0.190082645, 0.114942529, 0.074626866, 0.121212121, 0.074626866, 0.05785124, 0.005747126, 0.029850746, 0.075757576, 0.119402985, 0.033057851, 0.045454545, 0.029850746, 0.033057851, 0.060606061, 0.049586777, 0.015151515, 0.014925373, 0.008264463, 0.016528926)

time_min <- c(10, 30, 40, 45, 45, 45, 55, 55, 60, 60, 60, 70, 70, 75, 75, 75, 85, 85, 90, 90, 90, 100, 100, 105, 105, 105, 115, 115, 120, 120, 120, 130, 130, 135, 135, 135, 145, 150, 150, 160, 165, 175, 180, 195, 235, 250)

sample_data <- data.frame(event_group, time_min, freq_density, stringsAsFactors=FALSE)

Solution

  • fitdistrplus::fitdist() can be used to determine the parameters:

    fitdistrplus::fitdist(sample_data$freq_density, distr = "gamma")
    #> Fitting of the distribution ' gamma ' by maximum likelihood 
    #> Parameters:
    #>       estimate Std. Error
    #> shape  1.25139  0.2341895
    #> rate  11.51292  2.6352952
    
    fitdistrplus::fitdist(sample_data$freq_density, distr = "weibull")
    #> Fitting of the distribution ' weibull ' by maximum likelihood 
    #> Parameters:
    #>        estimate Std. Error
    #> shape 1.1657556 0.13768844
    #> scale 0.1145993 0.01526602
    
    # Use a Cullen and Frey graph to choose the 'best' fitting distribution
    fitdistrplus::descdist(sample_data$freq_density)
    

    #> summary statistics
    #> ------
    #> min:  0.005747126   max:  0.3471074 
    #> median:  0.08265491 
    #> mean:  0.1086957 
    #> estimated sd:  0.09034791 
    #> estimated skewness:  0.9060949 
    #> estimated kurtosis:  2.942441
    

    Created on 2021-12-02 by the reprex package (v2.0.1)

    Based on the Cullen and Frey graph a gamma distribution seems a good option for the given data.

    And if you would like to apply fitdistrplus::fitdist() to multiple groups you can for example use purrr::map():

        library(dplyr)   
        sample_data %>%
          split(.$event_group) %>%
          purrr::map(~fitdistrplus::fitdist(.$freq_density, distr = "gamma"))
        #> $group_A
        #> Fitting of the distribution ' gamma ' by maximum likelihood 
        #> Parameters:
        #>        estimate Std. Error
        #> shape 0.8847797  0.3852533
        #> rate  7.0784485  4.0716225
        #> 
        #> $group_B
        #> Fitting of the distribution ' gamma ' by maximum likelihood 
        #> Parameters:
        #>        estimate Std. Error
        #> shape  1.465481  0.5678731
        #> rate  16.121401  7.4261676
        #> 
        #> $group_C
        #> Fitting of the distribution ' gamma ' by maximum likelihood 
        #> Parameters:
        #>        estimate Std. Error
        #> shape  1.906359  0.9434099
        #> rate  13.344416  7.5468387
        #> 
        #> $group_D
        #> Fitting of the distribution ' gamma ' by maximum likelihood 
        #> Parameters:
        #>       estimate Std. Error
        #> shape  1.71704  0.7441117
        #> rate  15.45395  7.7658146
        #> 
        #> $group_E
        #> Fitting of the distribution ' gamma ' by maximum likelihood 
        #> Parameters:
        #>        estimate Std. Error
        #> shape  1.104798  0.4184115
        #> rate  12.152399  5.7735560