rstringgrepl

Understanding why grepl doesn't appear to be correctly identifying words


I'm trying to count occurrences of a word in a document (as part of some research I'm doing into how politicians use language). I don't understand why the value I'm getting back in R is not the same as the value I get when I independently count the number of words.

#Counting the occurrences of the word 'migrant' in a political debate
fileContent <- readLines("https://www.theyworkforyou.com/pwdata/scrapedxml/debates/debates2024-01-17c.xml")
wordToCount <- c("Migrant") 
wordCount <- sum(grepl(wordToCount, fileContent, ignore.case = TRUE))
wordCount #returns 20

This returns the number 20, however if I open the document and ctrl + f for 'Migrant' I get 22 hits (I understand that the above code should identify scenarios within strings as well as whole words).

I've also tried parsing the xml, but even more confusingly this returns only 18, despite the fact that again if I manually check the parsed data there are still 22 hits:

#Same as above but parsing the xml
fileContent <- read_xml("https://www.theyworkforyou.com/pwdata/scrapedxml/debates/debates2024-01-17c.xml")
fileContent <- xml_find_all(fileContent, ".//speech")
fileContent <- xml_text(fileContent)
wordToCount <- c("Migrant") 
wordCount <- sum(grepl(wordToCount, fileContent, ignore.case = TRUE))
wordCount #returns 18

#Outputting the data to double-check values
output <- file("output.txt")
writeLines(fileContent, output)
close(output)

Can anyone help me to understand why these two pieces of code are not returning 22?


Solution

  • grepl will return TRUE if it finds at least one occurrence of migrant. If a string contains it twice, it will only be counted once. See this example:

    sum(grepl("migrant", 
          c("Something about migrants. Something else about migrants ")))
    

    You can use the stringr package to do what you want:

    fileContent <- read_xml("https://www.theyworkforyou.com/pwdata/scrapedxml/debates/debates2024-01-17c.xml")
    fileContent <- xml_find_all(fileContent, ".//speech")
    fileContent <- xml_text(fileContent)
    migrant_count <- stringr::str_count(tolower(fileContent), "migrant")
    total_migrant_count <- sum(migrant_count)
    print(total_migrant_count) # -> 22