rggplot2ggplotlylinegraph

Plot line graph or geom-line based on three columns and groupby


I want to plot the percentage of people by socio-economic status by AgeGroup. Below is the example of dataframe.

S.NO SocioEcnomicStatus Age AgeGroup
P1 2 43 36-45 AgeGroup
P2 5 27 26-35 AgeGroup
P3 1 34 26-35 AgeGroup
P4 2 43 36-45 AgeGroup
P5 3 78 76-85 AgeGroup
P6 4 89 86+ AgeGroup

The socio economic status range from 1 to 5 and age group depends upon the age. On x-axis I want AgeGroup and on y-axis I want to plot the percentage of people existing in each AgeGroup by Socio Economic Status.

Below is the example graph I made in excel. Looking for the similar graph in R.

enter image description here


Solution

  • I came up with something along your lines using the tidyverse package:

    library(tidyverse)
    
    # create dummy dataset - weight SocioEcononomicStatus sampling for 
    #   real world-esque distribution
    set.seed(1234)
    n_total <- 1001 # number of people
    data <- data.frame(S.NO=1:n_total, 
                       SocioEcononomicStatus=factor(sample(1:5, n_total, replace=T,
                                                          prob=c(0.1, 0.3, 0.3, 0.15, 0.05))),
                       AgeGroup=sample(c('18-25', '26-35', '36-45', '46-55', 
                                         '56-65', '66-75', '76-85', '86+'),
                                       n_total, replace=T))
    print(data)
    
    # count the number of people in each combination of AgeGroup and SocioEcononomicStatus
    data_by_age <- group_by(data, SocioEcononomicStatus, AgeGroup) %>% 
      summarize(n=dplyr::n(), .groups='drop')
    arrange(data_by_age, AgeGroup)  # check the raw numbers
    # convert to percentages within each AgeGroup
    data_by_age <- group_by(data_by_age, AgeGroup) %>% 
      mutate(pct=(n/sum(n))*100)
    arrange(data_by_age, AgeGroup)  # check the percentages
    
    ggplot(data_by_age, aes(x=AgeGroup, y=pct, col=SocioEcononomicStatus, 
                            group=SocioEcononomicStatus)) +
      geom_point() +
      geom_line() +
      theme_classic()
    

    There's probably a way to do this in base R, but the combination of the group_by/summarize/mutate functions makes this very intuitive for me.

    line graph of age/SES distributions