Here is my following problem and my solution. The purpose of this question is to find out if there is an easier way to do the following process because I feel like I over complicated the code; however, I could not find anywhere on Stack Overflow, other websites, or youtube of an easier solution.
The dataset I am using is Los Angeles Crime Statistics filtered by the year 2020 and condensed down to 10 features and 198,908 observations. Here is the the head() of the dataset:
LA2020 %>%
select(crime_description, victim_sex) %>%
head()
crime_description victim_sex
1 BATTERY - SIMPLE ASSAULT F
2 BATTERY - SIMPLE ASSAULT M
3 SEX OFFENDER REGISTRANT OUT OF COMPLIANCE X
4 VANDALISM - MISDEAMEANOR ($399 OR UNDER) F
5 VANDALISM - FELONY ($400 & OVER, ALL CHURCH VANDALISMS) X
6 RAPE, FORCIBLE F
All I want to do is group the "crime_description" in ggplot with geom_bar and use the "fill" parameter on the "victime_sex" column. There are 129 different categories within the "crime_description" column so therefore in the geom_bar I just want to display the top 15 crimes with the highest to lowest sorted from left to right.
I have read a lot of solutions to this problem which most responded with preparing the top 15 crimes first before piping into ggplot. I was able to graph it; however, sorting and displaying the top 15 was the biggest obstacle for me. What I had to do was create a dataframe with a series of "LEFTJOINS". See below.
LA2020 %>%
count(crime_description, sort = TRUE) %>%
rename(cnt = n) %>%
top_n(15, cnt) %>%
left_join(
(LA2020 %>%
filter(victim_sex == "M") %>%
count(crime_description)),
by = "crime_description") %>%
rename(M = n) %>%
left_join(
(LA2020 %>%
filter(victim_sex == "F") %>%
count(crime_description)),
by = "crime_description") %>%
rename(F = n) %>%
left_join(
(LA2020 %>%
filter(victim_sex == "X") %>%
count(crime_description)),
by = "crime_description") %>%
rename(X = n) %>%
left_join(
(LA2020 %>%
filter(victim_sex == "H") %>%
count(crime_description)),
by = "crime_description") %>%
rename(H = n) %>%
left_join(
(LA2020 %>%
filter(victim_sex == "O") %>%
count(crime_description)),
by = "crime_description") %>%
rename(O = n)
{. ->> crime} ### saved as an object "crime"
crime_description cnt M F X H O
1 VEHICLE - STOLEN 20702 63 15 18 NA 20606
2 BATTERY - SIMPLE ASSAULT 16293 8407 7808 75 1 2
3 VANDALISM - FELONY ($400 & OVER, ALL CHURCH VANDALISMS) 12885 6056 4251 2571 1 6
4 BURGLARY 12793 6478 3361 2943 NA 11
5 BURGLARY FROM VEHICLE 12675 7084 5317 265 4 5
6 ASSAULT WITH DEADLY WEAPON, AGGRAVATED ASSAULT 11498 8194 3073 223 2 6
7 THEFT PLAIN - PETTY ($950 & UNDER) 10816 5193 4505 1113 2 3
8 INTIMATE PARTNER - SIMPLE ASSAULT 10814 2705 8088 20 1 NA
9 THEFT FROM MOTOR VEHICLE - PETTY ($950 & UNDER) 9704 2992 2189 148 NA 4375
10 THEFT OF IDENTITY 8786 4387 4312 82 2 3
11 VANDALISM - MISDEAMEANOR ($399 OR UNDER) 6951 3119 2738 1092 NA 2
12 ROBBERY 6882 4210 1775 889 1 7
13 THEFT-GRAND ($950.01 & OVER)EXCPT,GUNS,FOWL,LIVESTK,PROD 5492 2903 1986 597 NA 6
14 THEFT FROM MOTOR VEHICLE - GRAND ($950.01 AND OVER) 4767 2856 1673 237 NA 1
15 CRIMINAL THREATS - NO WEAPON DISPLAYED 4189 2008 2123 58 NA NA
As you can see, this dataframe list the top 15 crimes according to the "cnt" column and the subsequent columns is the victim_sex distribution of the total "cnt". I named this dataframe "crime" and I used this top 15 dataframe in the filter before piping into ggplot. I essentially grouped, counted and sorted top 15 crimes and saved as a dataframe to be used as my filter criteria for the original dataset "LA2020"
LA2020 %>%
filter(crime_description %in% crime$crime_description) %>%
ggplot(aes(x = fct_infreq(crime_description), fill = victim_sex)) +
geom_bar(alpha = 0.5) +
scale_x_discrete(label = abbreviate) +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_rect(fill = "white")) +
labs(title = "Crime Distribution by Gender",
x = "Crime Description",
y = "Count",
fill = "Gender")
Crime Distribution by Gender, LA2020
QUESTION
Is there a simpler way of doing what I just did? I know I did not need to create all those "LEFTJOIN's" and could of just created a dataframe with the top 15, but I wanted to show in a table what ggplot would display. Could I do this without creating a separate dataframe object and just pipe it into the ggplot?
All was explained in the original question
Here is one option to simplify your code, which first computes the counts by crime and sex, then filters for the top 15 crimes for which I use a semi_join
. This dataset can then be used for plotting. And if you need the data in a more table like format you could use e.g. tidyr::pivot_wider
to reshape to wide.
Using some fake random example data:
library(ggplot2)
library(dplyr, warn = FALSE)
set.seed(123)
LA2020 <- data.frame(
crime_description = sample(letters, 1000, replace = TRUE),
victim_sex = sample(c("M", "F"), 1000, replace = TRUE)
)
top15 <- LA2020 |>
count(crime_description, victim_sex) |>
semi_join(
# Top 15 crimes
LA2020 |>
count(crime_description, sort = TRUE) |>
head(15),
by = "crime_description"
) |>
mutate(
crime_description = reorder(crime_description, -n, FUN = sum)
)
top15 |>
ggplot(
aes(crime_description, n, fill = victim_sex)
) +
geom_col() +
# scale_x_discrete(label = abbreviate) +
theme(
axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_rect(fill = "white")
) +
labs(
title = "Crime Distribution by Gender",
x = "Crime Description",
y = "Count",
fill = "Gender"
)
top15 |>
add_count(crime_description, wt = n, name = "cnt") |>
tidyr::pivot_wider(names_from = victim_sex, values_from = n)
#> # A tibble: 15 × 4
#> crime_description cnt F M
#> <fct> <int> <int> <int>
#> 1 c 43 25 18
#> 2 f 36 15 21
#> 3 g 45 23 22
#> 4 h 51 24 27
#> 5 j 52 26 26
#> 6 k 39 16 23
#> 7 n 37 10 27
#> 8 q 37 19 18
#> 9 t 41 18 23
#> 10 u 37 16 21
#> 11 v 37 17 20
#> 12 w 48 23 25
#> 13 x 37 19 18
#> 14 y 49 24 25
#> 15 z 38 21 17