I have the following data set:
df <- data.frame (text = c("House Sky Blue",
"House Sky Green",
"House Sky Red",
"House Sky Yellow",
"House Sky Green",
"House Sky Glue",
"House Sky Green"))
I'd like to find the percentage of co-occurrence of some terms of tokens. For example, out of all documents, where can I find the token "House" and at the same time how many of them also include the term "Green"?
In out data we have 7 documents that have the term House and 3 out of those 7 p=(100*3/7) also include the term Green, It would be so nice to see also what terms or tokens appear within some p threshold along side the token "House".
I have used these two functions:
textstat_collocations(tokens)
> textstat_collocations(tokens)
collocation count count_nested length lambda z
1 house sky 7 0 2 5.416100 2.622058
2 sky green 3 0 2 2.456736 1.511653
Fun textstat_simil
textstat_simil(dfm(tokens),margin="features")
textstat_simil object; method = "correlation"
house sky blue green red yellow glue
house NaN NaN NaN NaN NaN NaN NaN
sky NaN NaN NaN NaN NaN NaN NaN
blue NaN NaN 1.000 -0.354 -0.167 -0.167 -0.167
green NaN NaN -0.354 1.000 -0.354 -0.354 -0.354
red NaN NaN -0.167 -0.354 1.000 -0.167 -0.167
yellow NaN NaN -0.167 -0.354 -0.167 1.000 -0.167
glue NaN NaN -0.167 -0.354 -0.167 -0.167 1.000
but they do not seem to give my desired output also I wonder why the correlation btw green and house is NaN for the textsats_simil
fun
My desired output would show the following info:
feature="House"
percentage of co-occurrence
Green = 3/7
Blue= 1/7
Red = 1/7
Yellow = 1/7
Glue = 1/7
In the quetda docs I can't seem to find a function that can give me my desired output, although I know there must be a way around since I find this library to be so fast and complete.
One way to do this is using the fcm()
to get document-level co-occurrences for a target feature. Below, I show how to do this using fcm()
, fcm_remove()
to remove the target feature, then a loop to get the desired printed output.
library("quanteda")
#> Package version: 3.2.4
#> Unicode version: 14.0
#> ICU version: 70.1
#> Parallel computing: 10 of 10 threads used.
#> See https://quanteda.io for tutorials and examples.
df <- data.frame(text = c("House Sky Blue",
"House Sky Green",
"House Sky Red",
"House Sky Yellow",
"House Sky Green",
"House Sky Glue",
"House Sky Green"))
corp <- corpus(df)
coocc_fract <- function(corp, feature) {
# create a document-level co-occurrence matrix
fcmat <- fcm(dfm(tokens(corp), tolower = FALSE), context = "document")
# select for the given feature
fcmat <- fcm_remove(fcmat, feature)
cat("feature=\"", feature, "\"\n", sep = "")
cat(" percentage of co-occurrence\n\n")
for (f in featnames(fcmat)) {
# skip zeroes
freq <- as.character(fcmat[1, f])
if (freq != "0") {
cat(f, " = ", as.character(fcmat[1, f]), "/", ndoc(corp),
"\n", sep = "")
}
}
}
This produces this output:
coocc_fract(corp, feature = "House")
#> feature="House"
#> percentage of co-occurrence
#>
#> Blue = 1/7
#> Green = 3/7
#> Red = 1/7
#> Yellow = 1/7
#> Glue = 1/7
Created on 2023-01-02 with reprex v2.0.2