I have a number of pdf files and I need to search each of them for particular key words/phrases. For each pdf, I want to know how many of these key words/phrases appear (but not how many times they appear). For each key word/phrase that appears, I want to assign one point (no matter how many times it appears). For each that doesn't, zero points. I would like a script that can scan pdfs and count key words/phrases in the above way. I've looked at the readPDF function and the PDF Data Extractor (PDE), and have created code with advice I received from a previous question. Here is the code:
# set the current script's location as working directory
setwd('/Users/steve/OneDrive/Desktop')
#install.packages(c("pdftools", "stringr", "dplyr"))
library(pdftools)
library(stringr)
library(dplyr)
## Define the list of keywords/phrases here #########
keywords <- c("keyword1", "keywordn")
# Function to count keywords in a PDF
count_keywords <- function(file = "C:\\Users\\steve\
\OneDrive\Desktop\\CL\\156.pdf", keywords) {
return(data.frame(
PDF = basename("C:\\Users\\steve\\OneDrive\\Desktop\
\CL\\156.pdf"),
Keyword = keywords,
One_if_found = sapply(keywords, function(keyword) {
as.numeric(str_detect(paste(pdf_text("C:\\Users\\steve\\OneDrive\
\Desktop\\CL\\156.pdf"), collapse = " "),
fixed(keywords, ignore_case = TRUE))) })
))
}
# Directory containing PDF files # create a folder on the same
#level as this script called "pdfs" and store your pdfs there
pdf_dir <- "CL"
# Combine results into a single data frame
final_results <- bind_rows(lapply(list.files(pdf_dir, pattern = "
\.pdf$", full.names = TRUE), count_keywords, keywords = keywords))
print(final_results)# Print the results
The result I get is a very large dataframe with a row for each instance a keyword appears (not whether or not it appears), and columns for each keyword, whose values are all the same across the row, like this:
PDF Keyword One_if_found.keyword1 One_if_found.keywordn ...
156.pdf keyword1 1 1
156.pdf keywordn 0 0
156.pdf keyword1 1 1
156.pdf keywordn 0 0
...
I'd like to have a dataframe with just three columns: the PDF name, the keyword, and whether the keyword appears in the PDF (1) or not (2). How can I change my code to do this?
You can use the pdftools
package to extract text from PDFs and then search for keywords/phrases.
"pdf_dir <- "pdfs"
and place your pdfs in there!keywords
Adding to my previous response.
setwd(dirname(rstudioapi::getSourceEditorContext()$path)) # set the current script's location as working directory
#install.packages(c("pdftools", "stringr", "dplyr"))
library(pdftools)
library(stringr)
library(dplyr)
## Define the list of keywords/phrases here
keywords <- c("information", "hairsser", "example phrase")
# Function to count keywords in a PDF
count_keywords <- function(pdf_path, keywords) {
# this function checks if the words in "keywords" are present in the pdf file with path "pdf_path".
# Returns data frame:
# the name of the pdf | the keyword |1 if found, 2 if not
# PDF_name Keyword Found
# 1 somatosensory - 2.pdf information 1
# 2 somatosensory - 2.pdf hairsser 2
# 3 somatosensory - 2.pdf example phrase 2
# Args:
# pdf_path: the path to the pdf files which should be checked for keywords
# the vector containing the keywords: e.g. "c("information", "hairsser", "example phrase")"
return(
data.frame(
PDF_name = basename(pdf_path),
Keyword = keywords,
Found = sapply(keywords, function(keyword) {
if_else(
str_detect( paste(pdf_text(pdf_path), collapse = " "),fixed(keyword, ignore_case = TRUE)),
1, # one if true -> string detected
2 # two if false -> string NOT detected
)
})
,row.names = NULL)
)
}
# Check all pdf files stored in one directory "pdf_dir"
# Directory containing PDF files # create a folder on the same level as this script called "pdfs" and store your pdfs there
pdf_dir <- "pdfs"
pdf_files <- list.files(pdf_dir, pattern = "\\.pdf$", full.names = TRUE)
# This iterates over all pdfs files in "pdf_files" and adds the results together using bind_rows
final_results <- bind_rows(lapply(pdf_files, count_keywords, keywords = keywords))
print(final_results)# Print the results
Which results in
> print(final_results)# Print the results
PDF_name Keyword Found
1 somatosensory - 2.pdf information 1
2 somatosensory - 2.pdf hairsser 2
3 somatosensory - 2.pdf example phrase 2
4 somatosensory.pdf information 1
5 somatosensory.pdf hairsser 2
6 somatosensory.pdf example phrase 2
Or use the function count_keywords
only on one pdf
# Or use "count_keywords" to check only one pdf
# - "pdfs/somatosensory - 2.pdf" is the RELATIVE path to the pdf file going out from this script
# this script
# pdfs
# |_somatosensory - 2.pdf
count_keywords("pdfs/somatosensory - 2.pdf", keywords = keywords)
which results in:
PDF_name Keyword Found
1 somatosensory - 2.pdf information 1
2 somatosensory - 2.pdf hairsser 2
3 somatosensory - 2.pdf example phrase 2
or for your example:
one_pdf_result <- count_keywords("C:\\Users\\steve\\OneDrive\Desktop\\CL\\156.pdf", keywords = c("keyword1", "keywordn"))