My dataframe is called detections. One column is OBSERVER. That column contains the initials of 32 different people. Some of those observers are responsible for 1000+ observations (i.e. 1000+ lines of the database), some as few as 3. I would like to lump the infrequent observers (with fewer than 200 observations), so I would like to create a new column OBSERVER_LUMPED which replaces initials of small time observers with "OTHERS" and leaves the initials of the major observers as is. What I have tried (N.B. this does not create a new column, just modifies the existing one):
detections$OBSERVER[with(detections, OBSERVER %in% unique(OBSERVER)[table(OBSERVER) < 200])] <- "OTHERS"
and it replaces some observer initials with "OTHERS' but not the right ones! If after running that line of code I then run
table(detections$OBSERVER)
I get back the below. DBC with three lines and WAR with 49 were not replaced with "OTHERS" but many other observers, including one with 2000 lines in the database, did get replaced.
DBC DCP KRM OTHERS WAR
3 93 942 4605 49
I'm not good at creating reproduceable examples but something that worked on something as simple as
df <- iris[44:51,4:5]
will create a workable facsimile, I think. In case it's not clear, the goal for this minimal df would be to add a column SPECIES_LUMPED that replaced the rarer species name ("versicolor") that appears in fewer than 4 rows with "Others" and leave the more common species name ("setosa") intact.
I'll make a quick lookup table where the name is the "mapping from" and the value is the "mapping to" strings.
The fact that df$Species is a factor presents "virginica" because table(.) is counting all of its levels, even if not present in the subset of data. This is just a curiosity here, removing missing levels does not affect the results. This factor necessitates as.character() in the ave call; if your $OBSERVER is not a factor than you should not need as.character.
tb <- table(df$Species)
tb
# setosa versicolor virginica
# 7 1 0
tb <- ifelse(tb < 4, "Others", names(tb))
tb
# setosa versicolor virginica
# "setosa" "Others" "Others"
df$SPECIES_LUMPED <- ave(as.character(df$Species), df$Species, FUN = \(z) tb[z])
df
# Petal.Width Species SPECIES_LUMPED
# 44 0.6 setosa setosa
# 45 0.4 setosa setosa
# 46 0.3 setosa setosa
# 47 0.2 setosa setosa
# 48 0.2 setosa setosa
# 49 0.2 setosa setosa
# 50 0.2 setosa setosa
# 51 1.4 versicolor Others
dplyrThis does not require predefining tb from above:
library(dplyr)
df |>
mutate(.by = Species, SPECIES_LUMPED = if (n() < 4) "Others" else Species)
# Petal.Width Species SPECIES_LUMPED
# 44 0.6 setosa setosa
# 45 0.4 setosa setosa
# 46 0.3 setosa setosa
# 47 0.2 setosa setosa
# 48 0.2 setosa setosa
# 49 0.2 setosa setosa
# 50 0.2 setosa setosa
# 51 1.4 versicolor Others
data.tablelibrary(data.table)
as.data.table(df)[, SPECIES_LUMPED := if (.N < 4) "Others" else Species]
# Petal.Width Species SPECIES_LUMPED
# <num> <fctr> <fctr>
# 1: 0.6 setosa setosa
# 2: 0.4 setosa setosa
# 3: 0.3 setosa setosa
# 4: 0.2 setosa setosa
# 5: 0.2 setosa setosa
# 6: 0.2 setosa setosa
# 7: 0.2 setosa setosa
# 8: 1.4 versicolor versicolor
Data
df <- iris[44:51,4:5]