I'm working on a corpus of documents (clinical narratives from hospital stays), mainly using the Quanteda package. The objective is to be able to classify documents based on the presence/absence of a feature, let's say "spastic cough".
I would like to be able to reproduce the behaviour of an Apache Lucene "proximity search" (https://lucene.apache.org/core/8_11_2/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#Proximity_Searches) using R.
Let's take an example: "spastic and productive cough in a 91-year-old patient following femoral neck surgery"
I would begin tokenizing the phrase as follows:
toks =
tokens(
c(text1 = "spastic and productive cough in a 91-year-old patient following femoral neck surgery"),
remove_punct = T, remove_symbols = T, remove_numbers = T, padding = T
) %>%
tokens_remove(pattern = stopwords("en",source = "nltk"))
which yields the following output:
Tokens consisting of 1 document.
text1 :
[1] "spastic" "productive" "cough" "91-year-old" "patient" "following" "femoral"
[8] "neck" "surgery"
I can then proceed to generate n-grams and skip-grams:
toks = tokens_ngrams(toks,n=4,skip = 0:3)
toks
[1] "spastic_productive_cough_91-year-old" "spastic_productive_cough_patient"
[3] "spastic_productive_cough_following" "spastic_productive_cough_femoral"
[5] "spastic_productive_91-year-old_patient" "spastic_productive_91-year-old_following"
[7] "spastic_productive_91-year-old_femoral" "spastic_productive_91-year-old_neck"
[9] "spastic_productive_patient_following" "spastic_productive_patient_femoral"
[11] "spastic_productive_patient_neck" "spastic_productive_patient_surgery"
[13] "spastic_productive_following_femoral" "spastic_productive_following_neck"
[15] "spastic_productive_following_surgery" "spastic_cough_91-year-old_patient"
[17] "spastic_cough_91-year-old_following" "spastic_cough_91-year-old_femoral"
[19] "spastic_cough_91-year-old_neck" "spastic_cough_patient_following"
[21] "spastic_cough_patient_femoral" "spastic_cough_patient_neck"
[23] "spastic_cough_patient_surgery" "spastic_cough_following_femoral"
[25] "spastic_cough_following_neck" "spastic_cough_following_surgery"
[27] "spastic_cough_femoral_neck" "spastic_cough_femoral_surgery"
[29] "spastic_91-year-old_patient_following" "spastic_91-year-old_patient_femoral"
[31] "spastic_91-year-old_patient_neck" "spastic_91-year-old_patient_surgery"
.........
At this point i guess i could simply:
any(str_detect(as.character(toks),"spastic_cough"))
[1] TRUE
but I'm not sure I'm using the correct approach as it feels clunky compared to how a Lucene query would work. If I were trying to identify patients with "spastic cough" using Apache Lucene to query the corpus I may use something like "spastic cough"~3 where "~3" means that any skip-gram 0:3 would match.
Any input about how and where I could improve my method?
This may do the trick: https://search.r-project.org/CRAN/refmans/corpustools/html/search_features.html
but, at the moment, I can't figure out how to include it in the workflow.
It seems like i can query the corpus using subset_query using a Lucene like syntax. The big problem i'm facing now is that "corpustools" isn't accepting as input tokens object and the function tokens_to_corpus() isn't working for me. This prevents me from being able to control the tokenization process
Actually, after delving deeper into the documentation, the "corpustools" package offers all I need for an Apache Lucene like experience in R =)