rdplyrgroup-byiqr

Count number of outliers by group in r and store count in new dataframe


I have a dataset that has 2 columns; column A is State_Name and has 5 different options of state, and column B is Total_Spend which has the average total spend of that state per day. There are 365 observations for each state.

What I want to do is count the number of outliers PER STATE using the 1.5 IQR rule and save the count of outliers per state to a new df or table.

So I would expect an output something like:

State Outlier Count
ATL 5
GA 20
MI 11
NY 50
TX 23

I have managed to get it to work by doing it one state at a time but I can't figure out what to do to achieve this in a single go.

Here is my code at the moment (to return the result for a single state):

  daily_agg %>% 
  select(State_Name, Total_Spend) %>%
  filter(State_Name == "NY")

outlier_NY <- length(boxplot.stats(outlier_df$Total_Spend)$out)

Any help would be appreciated.

Thanks!

EDIT WITH TEST DATASET


outlier_mtcars <- 
  df %>%  
  select(cyl, disp) %>%
  filter(cyl == "6")
  
outliers <- length(boxplot.stats(outlier_mtcars$disp)$out)

The above shows me 1 outlier for 6 cyl cars but I want a table that shows how many outliers for 4, 6, 8 cyl cars


Solution

  • Since I'm not very familiar with the function boxplot.stats, I didn't use this in my solution and instead manually calculates 1.5 * IQR + upper quantile.

    Here mtcars was used as an example. For the records that are outliers, they are "flagged" as TRUE, where we can sum them up in summarize.

    library(dplyr)
    
    mtcars %>% 
      group_by(cyl) %>% 
      mutate(flag = disp >= (IQR(disp) * 1.5 + quantile(disp, probs = 0.75)), .keep = "used") %>% 
      summarize(Outlier = sum(flag))
    
    # A tibble: 3 × 2
        cyl Outlier
      <dbl>   <int>
    1     4       0
    2     6       1
    3     8       0