rggplot2dplyrapplybeta-distribution

Apply MASS::fitdistr to multiple data by a factor


My question is at the end in bold.

I know how to fit the beta distribution to some data. For instance:

library(Lahman)
library(dplyr)

# clean up the data and calculate batting averages by playerID
batting_by_decade <- Batting %>%
  filter(AB > 0) %>%
  group_by(playerID, Decade = round(yearID - 5, -1)) %>%
  summarize(H = sum(H), AB = sum(AB)) %>%
  ungroup() %>%
  filter(AB > 500) %>%
  mutate(average = H / AB)

# fit the beta distribution
library(MASS)
m <- MASS::fitdistr(batting_by_decade$average, dbeta,
                    start = list(shape1 = 1, shape2 = 10))

alpha0 <- m$estimate[1]
beta0 <- m$estimate[2]

# plot the histogram of data and the beta distribution
ggplot(career_filtered) +
  geom_histogram(aes(average, y = ..density..), binwidth = .005) +
  stat_function(fun = function(x) dbeta(x, alpha0, beta0), color = "red",
                size = 1) +
  xlab("Batting average")

Which yields:

enter image description here

Now I want to calculate different beta parameters alpha0 and beta0 for each batting_by_decade$Decade column of the data so I end up with 15 parameter sets, and 15 beta distributions that I can fit to this ggplot of batting averages faceted by Decade:

batting_by_decade %>% 
  ggplot() +
  geom_histogram(aes(x=average)) +
  facet_wrap(~ Decade)

enter image description here

I can hard code this by filtering for each decade, and passing that decade's worth of data into the fidistr function, repeating this for all decades, but is there a way of calculating all beta parameters per decade quickly and reproducibly, perhaps with one of the apply functions?


Solution

  • You can leverage summarise together with two custom functions for this:

    getAlphaEstimate = function(x) {MASS::fitdistr(x, dbeta,start = list(shape1 = 1, shape2 = 10))$estimate[1]}
    
    getBetaEstimate = function(x) {MASS::fitdistr(x, dbeta,start = list(shape1 = 1, shape2 = 10))$estimate[2]}
    
    batting_by_decade %>%
      group_by(Decade) %>%
      summarise(alpha = getAlphaEstimate(average),
             beta = getBetaEstimate(average)) -> decadeParameters
    

    However, you will not be able to plot it with stat_summary according to Hadley's post here: https://stackoverflow.com/a/1379074/3124909