rtidyverseforcats

Group low counts in tidyverse R


I'm currently working with a dataset in a tibble format with 714 rows (each row corresponds to a new sequence that are specific for a given virus, but multiple sequences are from the same virus if that makes sense).

So if you look in the data, there is e.g. 21 B19 sequences.

I want to make a new column in my tibble where I group all virus-strains that exist few times (lower than 50 counts) into one group ("Others") and where all virus strains with high counts remains in each of their own group so that CMV is CMV. So that will be a new column added to a tibble where everytime a low-count strain occurs, the 'newID' will be others (See fig 1). Until now, I used 'mutate(newID = case_when(Origin == "CMV" ~ "CMV") and then grouped it manually based on counts (see Data figure), but there should be an easier and less 'hard-coding' option, right?

Data:

 1 B19         21
 2 BKPyV        8
 3 CMV        161
 4 Covid-19    68
 5 EBV        204
 6 FLU-A       22
 7 HAdV-C      10
 8 hCoV        84
 9 HHV-1       27
10 HHV-2        3
11 HHV-6B       1
12 HIV-1       18
13 HMPV         3
14 HPV         37
15 JCPyV        4
16 NWV         12
17 unknown      9
18 VACV         9
19 VZV         13

I hope you can help!


Solution

  • You can use fct_lump() from the forcats package (tidyverse).

    I am using the top 4 viruses based on your count:

    library(forcats)
    data %>% 
      mutate(virus = as.factor(virus)) %>% 
      mutate(newID = fct_lump(virus, 4, w = count))
    

    Output is:

    # A tibble: 19 × 4
          id virus    count newID
       <dbl> <fct>    <dbl> <fct>
     1     1 B19         21 Other
     2     2 BKPyV        8 Other
     3     3 CMV        161 CMV  
     4     4 Covid-19    68 Covid-19
     5     5 EBV        204 EBV  
     6     6 FLU-A       22 Other
     7     7 HAdV-C      10 Other
     8     8 hCoV        84 hCoV 
     9     9 HHV-1       27 Other
    10    10 HHV-2        3 Other
    11    11 HHV-6B       1 Other
    12    12 HIV-1       18 Other
    13    13 HMPV         3 Other
    14    14 HPV         37 Other
    15    15 JCPyV        4 Other
    16    16 NWV         12 Other
    17    17 unknown      9 Other
    18    18 VACV         9 Other
    19    19 VZV         13 Other
    
    

    I used:

    library(dplyr)
    
    data <- tribble(
      ~id, ~virus, ~count,
      1, "B19"   ,      21,
      2, "BKPyV"  ,      8,
      3, "CMV"    ,    161,
      4, "Covid-19",    68,
      5, "EBV"      ,  204,
      6, "FLU-A"  ,     22,
      7, "HAdV-C"  ,   10,
      8, "hCoV"   ,     84,
      9, "HHV-1" ,     27,
      10, "HHV-2"  ,      3,
      11, "HHV-6B" ,      1,
      12, "HIV-1"  ,     18,
      13, "HMPV"   ,      3,
      14, "HPV"   ,      37,
      15, "JCPyV"  ,      4,
      16, "NWV"     ,    12,
      17, "unknown"  ,    9,
      18, "VACV"    ,     9,
      19, "VZV"     ,    13  
    )