rdatecategorical-datadata-processingtopmost

Finding most common value seperated by day


I want to see which category occurs most often each day per participant. There are multiple categories which occur each day and I want a new column which states the category which occured mostly that specific day for a specific participant.

I have a column 'user_id', 'date' and a column 'category' (characters). Which code should I use to add a new column which only states the category which has most occurences for a specific user at a specific day?

dput:

structure(list(user_id = c("10257", "10580", "10280", "10202", "10275","10281"),
date = structure(c(1552521600, 1552003200, 1551139200,1551484800, 1552867200, 1552521600), class = c("POSIXct", "POSIXt"), tzone = "UTC"), 
better_category = c("Email", "Internet_Browser", "Instant_Messaging","News","Background_Process","Instant_Messaging")),
row.nams = c(176300L, 184332L, 469288L, 119462L, 112507L, 399236L), 
class = "data.frame")

Solution

  • Let's create some data:

    require(dplyr)
    set.seed(100)
    data<-data.frame(user_id=rep(c(1,2,3),10),date=rep(c("tuesday","wednesday","thursday"),each=10),category=(sample(c(1:3),30,replace=TRUE)))
    

    If we arrange it for convenient viewing, we can get this:

        data<-data %>% arrange(user_id,date)
        data
           user_id      date category
        1        1  thursday        3
        2        1  thursday        2
        3        1  thursday        3
        4        1   tuesday        1
        5        1   tuesday        1
        6        1   tuesday        3
        7        1   tuesday        1
        8        1 wednesday        1
        9        1 wednesday        3
        10       1 wednesday        2
        11       2  thursday        2
        12       2  thursday        1
        13       2  thursday        2
        14       2   tuesday        1
        15       2   tuesday        2
        16       2   tuesday        2
        17       2 wednesday        2
        18       2 wednesday        2
        19       2 wednesday        1
        20       2 wednesday        3
        21       3  thursday        2
        22       3  thursday        3
        23       3  thursday        3
        24       3  thursday        1
        25       3   tuesday        2
        26       3   tuesday        2
        27       3   tuesday        2
        28       3 wednesday        3
        29       3 wednesday        3
        30       3 wednesday        2
    

    Now we'll group it by user_id and date, and create a new column called max that takes the most frequent category from each group. We do this using table over `category, which creates a crosstabs of the column for each grouping:

        data %>% group_by(user_id,date) %>% 
          dplyr::mutate(max=names(sort(table(category),decreasing=TRUE))[1])
    
    # A tibble: 30 x 4
    # Groups:   user_id, date [9]
       user_id date      category max  
         <dbl> <fct>        <int> <chr>
     1       1 thursday         3 3    
     2       1 thursday         2 3    
     3       1 thursday         3 3    
     4       1 tuesday          1 1    
     5       1 tuesday          1 1    
     6       1 tuesday          3 1    
     7       1 tuesday          1 1    
     8       1 wednesday        1 1    
     9       1 wednesday        3 1    
    10       1 wednesday        2 1    
    # ... with 20 more rows
    

    As you can see, each user-day grouping gets its own max. In the last example shown her (1-wednesday), there is one of each of the three categories, so the first is selected, i.e. 1.

    Here is the result using your dput data (in which every line has a unique user/date pairing):

    # A tibble: 6 x 4
    # Groups:   user_id, date [6]
      user_id date                better_category    max               
      <fct>   <dttm>              <fct>              <chr>             
    1 10257   2019-03-14 00:00:00 Email              Email             
    2 10580   2019-03-08 00:00:00 Internet_Browser   Internet_Browser  
    3 10280   2019-02-26 00:00:00 Instant_Messaging  Instant_Messaging 
    4 10202   2019-03-02 00:00:00 News               News              
    5 10275   2019-03-18 00:00:00 Background_Process Background_Process
    6 10281   2019-03-14 00:00:00 Instant_Messaging  Instant_Messaging 
    

    So I created an identical table but duplicated the last row twice and then changed one of the categories there to "News", and ran the same code:

    # A tibble: 8 x 4
    # Groups:   user_id, date [6]
      user_id date                better_category    max               
      <chr>   <dttm>              <chr>              <chr>             
    1 10257   2019-03-14 00:00:00 Email              Email             
    2 10580   2019-03-08 00:00:00 Internet_Browser   Internet_Browser  
    3 10280   2019-02-26 00:00:00 Instant_Messaging  Instant_Messaging 
    4 10202   2019-03-02 00:00:00 News               News              
    5 10275   2019-03-18 00:00:00 Background_Process Background_Process
    6 10281   2019-03-14 00:00:00 News               Instant_Messaging 
    7 10281   2019-03-14 00:00:00 Instant_Messaging  Instant_Messaging 
    8 10281   2019-03-14 00:00:00 Instant_Messaging  Instant_Messaging 
    

    Note the last three rows.