rcrosstab

searching a string of text in R producing crossable


I have a data frame where columns have strings of text. I want to search within the text of multiple columns and count the occurrence of specific words, and then make a table that reports the co-occurence (by multiple columns).

Here's a small data frame to make this reproducible:

#create dataframe


TI <- c('Studies of pancreatic cancer',
        'colon, breast, and pancreatic cancer',
        'cancer and CVD paper', 
        'CVD paper'
        )
AB <- c('Autoimmune pancreatitis (AIP) is now considered a disease to be treated by diet.',
        'dietary patterns, positive associations were found between pancreatic cancer risk ',
        'Cardiovascular diseases (CVD) is linked to Diet',
        'making this match with pancreas'
        )

df <- data.frame(TI, AB)

#TI = title (of a paper)
#AB = abstract of the same paper

I then want to see how many titles have either the word 'cancer' or 'CVD' in the title (TI column), and the co-occurrence with the words 'pancreatic (or some variation)' or 'diet' in the abstract (AB column)

I can count the single occurence of the words I want with grep

pancreatic.ab <- length(grep("pancreat*", df$AB, ignore.case = TRUE, perl = FALSE))
pancreatic.ab

diet.ab <- length(grep("diet*", df$AB, ignore.case = TRUE, perl = FALSE))
diet.ab

cancer.ti <- length(grep("cancer*", df$TI, ignore.case = TRUE, perl = FALSE))
cancer.ti

CVD.ti <- length(grep("CVD", df$TI, ignore.case = TRUE, perl = FALSE))
CVD.ti

but not sure how to do this for a complicated cross tabs table.

any suggestions?

An example of desired output would be something like this example table


Solution

  • You can bind the rows that have your targets in the TI columns, providing a label for the columns belonging to each (in my example ti). Then, grouping by that label (i.e. group_by(ti)), you can create the columns with the counts, using summarize()

    bind_rows(
      df %>% filter(grepl("cancer*", TI, ignore.case=T)) %>% mutate(ti="CANCER"),
      df %>% filter(grepl("CVD*", TI, ignore.case=T)) %>% mutate(ti="CVD")
    ) %>% 
      group_by(ti) %>% 
      summarize(
        TotalAB = n(),
        Pancreatic = sum(grepl("pancreat*", AB, ignore.case=T)),
        Diet = sum(grepl("diet*", AB, ignore.case=T))
        )
    

    Output:

      ti     TotalAB Pancreatic  Diet
      <chr>    <int>      <int> <int>
    1 CANCER       3          2     3
    2 CVD          2          1     1