I am using DocumentTermMatrix to find a list of keywords in a long text. Most of the words in my list are correctly found, but there are a couple that are missing. Now, I would love to post here a minimal working example, but the problem is: there is one of the words ("insolvency", so not a short word as in the problem here) in a document of 32 pages which is missed. Now, this word is actually in page 7 of the text. But if I reduce my text with text <- text[7]
, then DocumentTermMatrix actually finds it! So I am not able to reproduce this with a minimal working example...
Do you have any ideas?
Below a sketch of my script:
library(fastpipe)
library(openxlsx)
library(tm)
`%>>%` <- fastpipe::`%>>%`
source("cleanText.R") # Custom function to clean up the text from reports
keywords_xlsx <- read.xlsx(paste0(getwd(),"/Keywords.xlsx"),
sheet = "all",
startRow = 1,
colNames = FALSE,
skipEmptyRows = TRUE,
skipEmptyCols = TRUE)
keywords <- keywords_xlsx[1] %>>%
tolower(as.character(.[,1]))
# Custom function to read pdfs
read <- readPDF(control = list(text = "-layout"))
# Extract text from pdf
report <- "my_report.pdf"
document <- Corpus(URISource(paste0("./Annual reports/", report)), readerControl = list(reader = read))
text <- content(document[[1]])
text <- cleanText(report, text) # This is a custom function to clean up the texts
# text <- text[7] # If I do this, my word is found! Otherwise it is missed
# Create a corpus
text_corpus <- Corpus(VectorSource(text))
matrix <- t(as.matrix(inspect(DocumentTermMatrix(text_corpus,
list(dictionary = keywords,
list(wordLengths=c(1, Inf))
)
))))
words <- sort(rowSums(matrix),decreasing=TRUE)
df <- data.frame(word = names(words),freq=words)
The problem lies in your use of inspect
. Only use inspect to check if your code is working and to see if a dtm has any values. Never use inspect
inside functions / transformations, because inspect
by default only shows the firs 10 rows and 10 columns of a document term matrix.
Also if you want to transpose the outcome of a dtm, use TermDocumentMatrix
.
Your last line should be:
mat <- as.matrix(TermDocumentMatrix(text_corpus,
list(dictionary = keywords,
list(wordLengths=c(1, Inf)))))
Note that turning a dtm / tdm into a matrix will use a lot more memory than having the data inside a sparse matrix.