rpdf

R script to count keywords in a PDF?


I have a number of pdf files and I need to search each of them for particular key words/phrases. For each pdf, I want to know how many of these key words/phrases appear (but not how many times they appear). For each key word/phrase that appears, I want to assign one point (no matter how many times it appears). For each that doesn't, zero points. I would like a script that can scan pdfs and count key words/phrases in the above way. I've looked at the readPDF function and the PDF Data Extractor (PDE), and have created code with advice I received from a previous question. Here is the code:

# set the current script's location as working directory
setwd('/Users/steve/OneDrive/Desktop')

#install.packages(c("pdftools", "stringr", "dplyr"))
library(pdftools)
library(stringr)
library(dplyr)

## Define the list of keywords/phrases here #########
keywords <- c("keyword1", "keywordn")

# Function to count keywords in a PDF
count_keywords <- function(file = "C:\\Users\\steve\
\OneDrive\Desktop\\CL\\156.pdf", keywords) {
return(data.frame(
PDF = basename("C:\\Users\\steve\\OneDrive\\Desktop\
\CL\\156.pdf"),
Keyword = keywords,
One_if_found = sapply(keywords, function(keyword) { 
as.numeric(str_detect(paste(pdf_text("C:\\Users\\steve\\OneDrive\
\Desktop\\CL\\156.pdf"), collapse = " "),
fixed(keywords, ignore_case = TRUE))) })
))
}

# Directory containing PDF files # create a folder on the same   
#level as this script called "pdfs" and store your pdfs there
pdf_dir <- "CL"

# Combine results into a single data frame
final_results <- bind_rows(lapply(list.files(pdf_dir, pattern = "
\.pdf$", full.names = TRUE), count_keywords, keywords = keywords))

print(final_results)# Print the results

The result I get is a very large dataframe with a row for each instance a keyword appears (not whether or not it appears), and columns for each keyword, whose values are all the same across the row, like this:

PDF     Keyword     One_if_found.keyword1    One_if_found.keywordn    ...
156.pdf keyword1    1                        1
156.pdf keywordn    0                        0
156.pdf keyword1    1                        1
156.pdf keywordn    0                        0
...

I'd like to have a dataframe with just three columns: the PDF name, the keyword, and whether the keyword appears in the PDF (1) or not (2). How can I change my code to do this?


Solution

  • You can use the pdftools package to extract text from PDFs and then search for keywords/phrases.

    1. you have to create a folder in the same place as the r-script: "pdf_dir <- "pdfs" and place your pdfs in there!
    2. Also specify your keywords in keywords

    Adding to my previous response.

    setwd(dirname(rstudioapi::getSourceEditorContext()$path)) # set the current script's location as working directory
    #install.packages(c("pdftools", "stringr", "dplyr"))
    library(pdftools)
    library(stringr)
    library(dplyr)
    
    ## Define the list of keywords/phrases here
    keywords <- c("information", "hairsser", "example phrase")
    
    # Function to count keywords in a PDF
    count_keywords <- function(pdf_path, keywords) {
      # this function checks if the words in "keywords" are present in the pdf file with path "pdf_path". 
      # Returns data frame:
      # the name of the pdf    | the keyword      |1 if found, 2 if not
      # PDF_name                Keyword            Found
      # 1 somatosensory - 2.pdf    information     1
      # 2 somatosensory - 2.pdf       hairsser     2
      # 3 somatosensory - 2.pdf example phrase     2
      # Args:
      # pdf_path: the path to the pdf files which should be checked for keywords
      # the vector containing the keywords: e.g. "c("information", "hairsser", "example phrase")"
      return(
        data.frame(
          PDF_name = basename(pdf_path),
          Keyword = keywords,
          Found = sapply(keywords, function(keyword) {
            if_else(
              str_detect( paste(pdf_text(pdf_path), collapse = " "),fixed(keyword, ignore_case = TRUE)),
              1, # one if true  -> string detected
              2  # two if false -> string NOT detected
            )
          })
        ,row.names = NULL)
      )
    }
    
    # Check all pdf files stored in one directory "pdf_dir"
    # Directory containing PDF files # create a folder on the same level as this script called "pdfs" and store your pdfs there
    pdf_dir <- "pdfs"
    pdf_files <- list.files(pdf_dir, pattern = "\\.pdf$", full.names = TRUE)
    # This iterates over all pdfs files in "pdf_files" and adds the results together using bind_rows
    final_results <- bind_rows(lapply(pdf_files, count_keywords, keywords = keywords))
    print(final_results)# Print the results
    

    Which results in

    > print(final_results)# Print the results
                   PDF_name        Keyword Found
    1 somatosensory - 2.pdf    information     1
    2 somatosensory - 2.pdf       hairsser     2
    3 somatosensory - 2.pdf example phrase     2
    4     somatosensory.pdf    information     1
    5     somatosensory.pdf       hairsser     2
    6     somatosensory.pdf example phrase     2
    

    Or use the function count_keywords only on one pdf

    # Or use "count_keywords" to check only one pdf
    # - "pdfs/somatosensory - 2.pdf" is the RELATIVE path to the pdf file going out from this script
    # this script
    # pdfs 
    #   |_somatosensory - 2.pdf
    count_keywords("pdfs/somatosensory - 2.pdf", keywords = keywords)
    

    which results in:

                   PDF_name        Keyword Found
    1 somatosensory - 2.pdf    information     1
    2 somatosensory - 2.pdf       hairsser     2
    3 somatosensory - 2.pdf example phrase     2
    

    or for your example:

    one_pdf_result <- count_keywords("C:\\Users\\steve\\OneDrive\Desktop\\CL\\156.pdf", keywords = c("keyword1", "keywordn"))