rreplacerowcount

Make a new column based on the count of rows, in an existing column, that meet a condition


My dataframe is called detections. One column is OBSERVER. That column contains the initials of 32 different people. Some of those observers are responsible for 1000+ observations (i.e. 1000+ lines of the database), some as few as 3. I would like to lump the infrequent observers (with fewer than 200 observations), so I would like to create a new column OBSERVER_LUMPED which replaces initials of small time observers with "OTHERS" and leaves the initials of the major observers as is. What I have tried (N.B. this does not create a new column, just modifies the existing one):

detections$OBSERVER[with(detections, OBSERVER %in% unique(OBSERVER)[table(OBSERVER) < 200])] <- "OTHERS"

and it replaces some observer initials with "OTHERS' but not the right ones! If after running that line of code I then run

table(detections$OBSERVER) 

I get back the below. DBC with three lines and WAR with 49 were not replaced with "OTHERS" but many other observers, including one with 2000 lines in the database, did get replaced.

DBC DCP KRM OTHERS WAR

3 93 942 4605 49

I'm not good at creating reproduceable examples but something that worked on something as simple as

df <- iris[44:51,4:5]

will create a workable facsimile, I think. In case it's not clear, the goal for this minimal df would be to add a column SPECIES_LUMPED that replaced the rarer species name ("versicolor") that appears in fewer than 4 rows with "Others" and leave the more common species name ("setosa") intact.


Solution

  • Base R

    I'll make a quick lookup table where the name is the "mapping from" and the value is the "mapping to" strings.

    The fact that df$Species is a factor presents "virginica" because table(.) is counting all of its levels, even if not present in the subset of data. This is just a curiosity here, removing missing levels does not affect the results. This factor necessitates as.character() in the ave call; if your $OBSERVER is not a factor than you should not need as.character.

    tb <- table(df$Species)
    tb
    #     setosa versicolor  virginica 
    #          7          1          0 
    tb <- ifelse(tb < 4, "Others", names(tb))
    tb
    #     setosa versicolor  virginica 
    #   "setosa"   "Others"   "Others" 
    
    df$SPECIES_LUMPED <- ave(as.character(df$Species), df$Species, FUN = \(z) tb[z])
    df
    #    Petal.Width    Species SPECIES_LUMPED
    # 44         0.6     setosa         setosa
    # 45         0.4     setosa         setosa
    # 46         0.3     setosa         setosa
    # 47         0.2     setosa         setosa
    # 48         0.2     setosa         setosa
    # 49         0.2     setosa         setosa
    # 50         0.2     setosa         setosa
    # 51         1.4 versicolor         Others
    

    dplyr

    This does not require predefining tb from above:

    library(dplyr)
    df |>
      mutate(.by = Species, SPECIES_LUMPED = if (n() < 4) "Others" else Species)
    #    Petal.Width    Species SPECIES_LUMPED
    # 44         0.6     setosa         setosa
    # 45         0.4     setosa         setosa
    # 46         0.3     setosa         setosa
    # 47         0.2     setosa         setosa
    # 48         0.2     setosa         setosa
    # 49         0.2     setosa         setosa
    # 50         0.2     setosa         setosa
    # 51         1.4 versicolor         Others
    

    data.table

    library(data.table)
    as.data.table(df)[, SPECIES_LUMPED := if (.N < 4) "Others" else Species]
    #    Petal.Width    Species SPECIES_LUMPED
    #          <num>     <fctr>         <fctr>
    # 1:         0.6     setosa         setosa
    # 2:         0.4     setosa         setosa
    # 3:         0.3     setosa         setosa
    # 4:         0.2     setosa         setosa
    # 5:         0.2     setosa         setosa
    # 6:         0.2     setosa         setosa
    # 7:         0.2     setosa         setosa
    # 8:         1.4 versicolor     versicolor
    

    Data

    df <- iris[44:51,4:5]