I want to plot the percentage of people by socio-economic status by AgeGroup. Below is the example of dataframe.
S.NO | SocioEcnomicStatus | Age | AgeGroup |
---|---|---|---|
P1 | 2 | 43 | 36-45 AgeGroup |
P2 | 5 | 27 | 26-35 AgeGroup |
P3 | 1 | 34 | 26-35 AgeGroup |
P4 | 2 | 43 | 36-45 AgeGroup |
P5 | 3 | 78 | 76-85 AgeGroup |
P6 | 4 | 89 | 86+ AgeGroup |
The socio economic status range from 1 to 5 and age group depends upon the age. On x-axis I want AgeGroup and on y-axis I want to plot the percentage of people existing in each AgeGroup by Socio Economic Status.
Below is the example graph I made in excel. Looking for the similar graph in R.
I came up with something along your lines using the tidyverse
package:
library(tidyverse)
# create dummy dataset - weight SocioEcononomicStatus sampling for
# real world-esque distribution
set.seed(1234)
n_total <- 1001 # number of people
data <- data.frame(S.NO=1:n_total,
SocioEcononomicStatus=factor(sample(1:5, n_total, replace=T,
prob=c(0.1, 0.3, 0.3, 0.15, 0.05))),
AgeGroup=sample(c('18-25', '26-35', '36-45', '46-55',
'56-65', '66-75', '76-85', '86+'),
n_total, replace=T))
print(data)
# count the number of people in each combination of AgeGroup and SocioEcononomicStatus
data_by_age <- group_by(data, SocioEcononomicStatus, AgeGroup) %>%
summarize(n=dplyr::n(), .groups='drop')
arrange(data_by_age, AgeGroup) # check the raw numbers
# convert to percentages within each AgeGroup
data_by_age <- group_by(data_by_age, AgeGroup) %>%
mutate(pct=(n/sum(n))*100)
arrange(data_by_age, AgeGroup) # check the percentages
ggplot(data_by_age, aes(x=AgeGroup, y=pct, col=SocioEcononomicStatus,
group=SocioEcononomicStatus)) +
geom_point() +
geom_line() +
theme_classic()
There's probably a way to do this in base R, but the combination of the group_by
/summarize
/mutate
functions makes this very intuitive for me.