rapplytext-extractionqdapregex

Extract Subpart of pdf text in r


I have a list of .pdf files in a folder for which I want to first access the first two paragraphs of text then store them in .csv file, I'm able to convert the pdf text but not able to extract first two paragraphs.

This is what I have tried

setwd("D/All_PDF_Files")
install.packages("pdftools")
install.packages("qdapRegex")
library(pdftools)
library(qdapRegex)
All_files=Sys.glob("*.pdf")
txt <- pdf_text("first.pdf")
cat(txt[1])
rm_between(txt, 'This ', '1. ', extract=TRUE)[[1]]

But this gives me "NA"

The output of cat(txt[1]) is:

"Maharashtra Real Estate Regulatory Authority
                                         REGISTRATION CERTIFICATE OF PROJECT
                                                             FORM 'C'
                                                           [See rule 6(a)]
This registration is granted under section 5 of the Act to the following project under project registration number :
P52100000255
Project: Ganga Legend A3 And B3.., Plot Bearing / CTS / Survey / Final Plot No.: Sr No 305 P , 306 P and 339 P ,
Village Bavdhan Budruk, Taluka Mulashi,District Pune at Pune (M Corp.), Pune City, Pune, 411001;
   1. Goel Ganga Developers (I) Pvt Ltd having its registered office / principal place of business at Tehsil: Pune City,
      District: Pune, Pin: 411001.
   2. This registration is granted subject to the following conditions, namely:­"

What I want to extract is the text

This registration is granted under section 5 of the Act to the following project under project registration number :
P52100000255
Project: Ganga Legend A3 And B3.., Plot Bearing / CTS / Survey / Final Plot No.: Sr No 305 P , 306 P and 339 P ,
Village Bavdhan Budruk, Taluka Mulashi,District Pune at Pune (M Corp.), Pune City, Pune, 411001;

Is there a better approach to go with?


Solution

  • library(pdftools)
    
    setwd("D/All_PDF_Files")
    All_files=Sys.glob("*.pdf")
    
    df <- data.frame()
    for (i in 1:length(All_files))
    {
      txt <- pdf_text(All_files[i])
      
      file_name <- All_files[i]
      #skip first 4 header rows (you may need to adjust this count according to your use case)
      FirstPara <- unlist(strsplit(txt[1], split=c("\r\n", "\r", "\n")))[1+4]
      SecondPara <- unlist(strsplit(txt[1], split=c("\r\n", "\r", "\n")))[2+4]
      
      df <- rbind(df, cbind(file_name, FirstPara, SecondPara))
    }
    df