rsortingggplot2greatest-n-per-groupgeom-bar

Two Categorical Variables, Sorting and displaying top_15 in ggplot


Here is my following problem and my solution. The purpose of this question is to find out if there is an easier way to do the following process because I feel like I over complicated the code; however, I could not find anywhere on Stack Overflow, other websites, or youtube of an easier solution.

The dataset I am using is Los Angeles Crime Statistics filtered by the year 2020 and condensed down to 10 features and 198,908 observations. Here is the the head() of the dataset:

LA2020 %>% 
  select(crime_description, victim_sex) %>% 
  head()
                                        crime_description victim_sex
1                                BATTERY - SIMPLE ASSAULT          F
2                                BATTERY - SIMPLE ASSAULT          M
3               SEX OFFENDER REGISTRANT OUT OF COMPLIANCE          X
4                VANDALISM - MISDEAMEANOR ($399 OR UNDER)          F
5 VANDALISM - FELONY ($400 & OVER, ALL CHURCH VANDALISMS)          X
6                                          RAPE, FORCIBLE          F

All I want to do is group the "crime_description" in ggplot with geom_bar and use the "fill" parameter on the "victime_sex" column. There are 129 different categories within the "crime_description" column so therefore in the geom_bar I just want to display the top 15 crimes with the highest to lowest sorted from left to right.

I have read a lot of solutions to this problem which most responded with preparing the top 15 crimes first before piping into ggplot. I was able to graph it; however, sorting and displaying the top 15 was the biggest obstacle for me. What I had to do was create a dataframe with a series of "LEFTJOINS". See below.

LA2020 %>% 
  count(crime_description, sort = TRUE) %>%
  rename(cnt = n) %>% 
  top_n(15, cnt) %>%
  left_join(
    (LA2020 %>% 
       filter(victim_sex == "M") %>% 
       count(crime_description)),
    by = "crime_description") %>% 
  rename(M = n) %>% 
  left_join(
    (LA2020 %>% 
       filter(victim_sex == "F") %>% 
       count(crime_description)),
    by = "crime_description") %>% 
  rename(F = n) %>% 
  left_join(
    (LA2020 %>% 
       filter(victim_sex == "X") %>% 
       count(crime_description)),
    by = "crime_description") %>% 
  rename(X = n) %>% 
  left_join(
    (LA2020 %>% 
       filter(victim_sex == "H") %>% 
       count(crime_description)),
    by = "crime_description") %>% 
  rename(H = n) %>% 
  left_join(
    (LA2020 %>% 
       filter(victim_sex == "O") %>% 
       count(crime_description)),
    by = "crime_description") %>% 
  rename(O = n)
  {. ->> crime} ### saved as an object "crime"
                                          crime_description   cnt    M    F    X  H     O
1                                          VEHICLE - STOLEN 20702   63   15   18 NA 20606
2                                  BATTERY - SIMPLE ASSAULT 16293 8407 7808   75  1     2
3   VANDALISM - FELONY ($400 & OVER, ALL CHURCH VANDALISMS) 12885 6056 4251 2571  1     6
4                                                  BURGLARY 12793 6478 3361 2943 NA    11
5                                     BURGLARY FROM VEHICLE 12675 7084 5317  265  4     5
6            ASSAULT WITH DEADLY WEAPON, AGGRAVATED ASSAULT 11498 8194 3073  223  2     6
7                        THEFT PLAIN - PETTY ($950 & UNDER) 10816 5193 4505 1113  2     3
8                         INTIMATE PARTNER - SIMPLE ASSAULT 10814 2705 8088   20  1    NA
9           THEFT FROM MOTOR VEHICLE - PETTY ($950 & UNDER)  9704 2992 2189  148 NA  4375
10                                        THEFT OF IDENTITY  8786 4387 4312   82  2     3
11                 VANDALISM - MISDEAMEANOR ($399 OR UNDER)  6951 3119 2738 1092 NA     2
12                                                  ROBBERY  6882 4210 1775  889  1     7
13 THEFT-GRAND ($950.01 & OVER)EXCPT,GUNS,FOWL,LIVESTK,PROD  5492 2903 1986  597 NA     6
14      THEFT FROM MOTOR VEHICLE - GRAND ($950.01 AND OVER)  4767 2856 1673  237 NA     1
15                   CRIMINAL THREATS - NO WEAPON DISPLAYED  4189 2008 2123   58 NA    NA

As you can see, this dataframe list the top 15 crimes according to the "cnt" column and the subsequent columns is the victim_sex distribution of the total "cnt". I named this dataframe "crime" and I used this top 15 dataframe in the filter before piping into ggplot. I essentially grouped, counted and sorted top 15 crimes and saved as a dataframe to be used as my filter criteria for the original dataset "LA2020"

LA2020 %>% 
  filter(crime_description %in% crime$crime_description) %>% 
  ggplot(aes(x = fct_infreq(crime_description), fill = victim_sex)) +
  geom_bar(alpha = 0.5) +
  scale_x_discrete(label = abbreviate) +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.background = element_rect(fill = "white")) +
  labs(title = "Crime Distribution by Gender",
       x = "Crime Description",
       y = "Count",
       fill = "Gender")

Crime Distribution by Gender, LA2020

QUESTION

Is there a simpler way of doing what I just did? I know I did not need to create all those "LEFTJOIN's" and could of just created a dataframe with the top 15, but I wanted to show in a table what ggplot would display. Could I do this without creating a separate dataframe object and just pipe it into the ggplot?

All was explained in the original question


Solution

  • Here is one option to simplify your code, which first computes the counts by crime and sex, then filters for the top 15 crimes for which I use a semi_join. This dataset can then be used for plotting. And if you need the data in a more table like format you could use e.g. tidyr::pivot_wider to reshape to wide.

    Using some fake random example data:

    library(ggplot2)
    library(dplyr, warn = FALSE)
    
    set.seed(123)
    
    LA2020 <- data.frame(
      crime_description = sample(letters, 1000, replace = TRUE),
      victim_sex = sample(c("M", "F"), 1000, replace = TRUE)
    )
    
    top15 <- LA2020 |>
      count(crime_description, victim_sex) |>
      semi_join(
        # Top 15 crimes
        LA2020 |>
          count(crime_description, sort = TRUE) |>
          head(15),
        by = "crime_description"
      ) |>
      mutate(
        crime_description = reorder(crime_description, -n, FUN = sum)
      ) 
    
    top15 |>
      ggplot(
        aes(crime_description, n, fill = victim_sex)
      ) +
      geom_col() +
      # scale_x_discrete(label = abbreviate) +
      theme(
        axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.background = element_rect(fill = "white")
      ) +
      labs(
        title = "Crime Distribution by Gender",
        x = "Crime Description",
        y = "Count",
        fill = "Gender"
      )
    

    
    top15 |> 
      add_count(crime_description, wt = n, name = "cnt") |> 
      tidyr::pivot_wider(names_from = victim_sex, values_from = n)
    #> # A tibble: 15 × 4
    #>    crime_description   cnt     F     M
    #>    <fct>             <int> <int> <int>
    #>  1 c                    43    25    18
    #>  2 f                    36    15    21
    #>  3 g                    45    23    22
    #>  4 h                    51    24    27
    #>  5 j                    52    26    26
    #>  6 k                    39    16    23
    #>  7 n                    37    10    27
    #>  8 q                    37    19    18
    #>  9 t                    41    18    23
    #> 10 u                    37    16    21
    #> 11 v                    37    17    20
    #> 12 w                    48    23    25
    #> 13 x                    37    19    18
    #> 14 y                    49    24    25
    #> 15 z                    38    21    17