
How can I count unique 2 word phrases that are seperated by a comma within a cell in R?

I have a dataframe of different locations (Location) along with the species of animals (Spp) found at each location. The species of animals are coded using their unique Genus species names. I would like to be able to know how frequent each unique Genus species is in the dataframe.

Example Data

df1 <- data.frame(matrix(ncol = 2, nrow = 3))
x <- c("Location","Spp")
colnames(df1) <- x
df1$Location <- seq(1,3,1)
df1[1,2] <- c("Genus1 species1")
df1[2,2] <- c("Genus1 species1, Genus1 species2")
df1[3,2] <- c("Genus1 species1, Genus1 species2, Genus2 species1")

Output should look something like this

            Spp Freq
Genus1 species1    3
Genus1 species2    2
Genus2 species1    1

I have tried using the corpus package to answer this problem but can only get it to work on counting the unique words rather than the unique Genus species phrase.


text <- df1[,2]
docs <- Corpus(VectorSource(text))
docs <- docs %>%
dtm <- TermDocumentMatrix(docs)
matrix <- as.matrix(dtm)
words <- sort(rowSums(matrix), decreasing = TRUE)
words ### only provides count of unique individual Genus and species words. I want similar but need to keep Genus and species together.


  • This is a quick solution:

    table(unlist(strsplit(df1$Spp,', ')))
    #> Genus1 species1 Genus1 species2 Genus2 species1 
    #>               3               2               1

