rnlpcorpusquanteda

R text analysis: Counting occurences of any combinations of words from two different keyword lists with a given distance of each other


Thanks for reading. For a reserach project, I'm doing some text analysis. We are analyzing large texts (company reports) and I'm looking to count keyword frequencies within that text.

However, I have two lists of keywords, and I dont want to count the number of occurances of these words, but the number of times any two words from these lists appear within a certain distance from each other in the main text.

text <- c("The house is blue. The car is very big and red.")
words1 <- c("car", "house") 
words2 <- c("blue", "red") 

The desired functionality should, for example, return 1 for distance 3. (Number of any combinations in given distance.)

I know about the text_count function from the stringb package and kwic from quantea. However, thats not really what Im looking for.

Thanks, any help is appreciated.


Solution

  • The quanteda package has the function fcm() that counts frequency of their co-occurrences.

    require(quanteda)
    txt <- c("The house is blue. The car is very big and red.")
    toks <- tokens(txt) %>% tokens_tolower()
    fcm(toks, window = 3, tri = FALSE)
    #> Feature co-occurrence matrix of: 10 by 10 features.
    #>         features
    #> features the house is blue . car very big and red
    #>    the     1     2  4    2 4   2    2   2   2   2
    #>    house   2     0  2    1 2   1    1   1   1   1
    #>    is      4     2  1    2 4   2    2   2   2   2
    #>    blue    2     1  2    0 2   1    1   1   1   1
    #>    .       4     2  4    2 1   2    2   2   2   2
    #>    car     2     1  2    1 2   0    1   1   1   1
    #>    very    2     1  2    1 2   1    0   1   1   1
    #>    big     2     1  2    1 2   1    1   0   1   1
    #>    and     2     1  2    1 2   1    1   1   0   1
    #>    red     2     1  2    1 2   1    1   1   1   0