rsortingtabulate

Create cross-tabulation of most frequent value of string variable and sort by frequency


I have a sample dataset:

df <- data.frame(category = c("A", "A", "B", "C", "C", "D", "E", "C", "E", "A", "B", "C", "B", "B", "B", "D", "D", "D", "D", "B"), year = c(1, 2, 1, 2, 3, 2, 3, 1, 3, 2, 1, 1, 2, 1, 2, 3, 1, 2, 3, 1))

and would like to create a cross-tabulation of year and category such that only the 3 most frequent categories are in the table and also sorted by total number of occurences:

    1 2 3
  B 4 2 0
  D 1 2 2
  C 2 1 1

Using something like

df %>% 
  add_count(category) %>% 
  filter(n %in% tail(sort(unique(n)),3)) %>% 
  arrange(desc(n)) %>% {table(.$category, .$year)}

will filter for the three most occurring categories but leave the table unsorted

    1 2 3
  B 4 2 0
  C 2 1 1
  D 1 2 2

Solution

  • This should give you what you want.

    # Make a table
    df.t <- table(df)
    # Order by top occurrences (sum over margin 1)
    df.t <- df.t[order(apply(df.t, 1, sum), decreasing=TRUE),]
    # Keep top 3 results
    df.t <- df.t[1:3,]
    

    Output:

            year
    category 1 2 3
           B 4 2 0
           D 1 2 2
           C 2 1 1