[SOLVED] R text analysis: Counting occurences of any combinations of words from two different keyword lists with a given distance of each other

R text analysis: Counting occurences of any combinations of words from two different keyword lists with a given distance of each other

Thanks for reading. For a reserach project, I'm doing some text analysis. We are analyzing large texts (company reports) and I'm looking to count keyword frequencies within that text.

However, I have two lists of keywords, and I dont want to count the number of occurances of these words, but the number of times any two words from these lists appear within a certain distance from each other in the main text.

text <- c("The house is blue. The car is very big and red.")
words1 <- c("car", "house") 
words2 <- c("blue", "red")

The desired functionality should, for example, return 1 for distance 3. (Number of any combinations in given distance.)

I know about the text_count function from the stringb package and kwic from quantea. However, thats not really what Im looking for.

Thanks, any help is appreciated.

Solution

The quanteda package has the function fcm() that counts frequency of their co-occurrences.

require(quanteda)
txt <- c("The house is blue. The car is very big and red.")
toks <- tokens(txt) %>% tokens_tolower()
fcm(toks, window = 3, tri = FALSE)
#> Feature co-occurrence matrix of: 10 by 10 features.
#>         features
#> features the house is blue . car very big and red
#>    the     1     2  4    2 4   2    2   2   2   2
#>    house   2     0  2    1 2   1    1   1   1   1
#>    is      4     2  1    2 4   2    2   2   2   2
#>    blue    2     1  2    0 2   1    1   1   1   1
#>    .       4     2  4    2 1   2    2   2   2   2
#>    car     2     1  2    1 2   0    1   1   1   1
#>    very    2     1  2    1 2   1    0   1   1   1
#>    big     2     1  2    1 2   1    1   0   1   1
#>    and     2     1  2    1 2   1    1   1   0   1
#>    red     2     1  2    1 2   1    1   1   1   0