rtidyversepdftools

Load only the names of many pdfs and make data frame


Im need obtain the names of set a many pdf files (36000 files). But only the names not load all object. Finally make a data frame like this:

enter image description here

The link of 21 example files: https://drive.google.com/drive/folders/1zUKyVJFICq4Q69zs48wqFNq1UPDvCgbf?usp=sharing

Im use this code:

#set directory 
library(pdftools)
library(tm)

files=list.files(pattern = "pdf$")
files

all=lapply(files, pdf_text)
lapply(all, length) 
x=Corpus(URISource(files), readerControl = list(reader = readPDF))
x

class(x) #character

DAT_FINAL <- data.frame(text = sapply(x, as.character), stringsAsFactors = T)
DAT_FINAL

The idea is has a data frame because I need compare the numeric names with an excel file for find the missing numbers between documents.

Update:

enter image description here


Solution

  • A possible solution (instead of /tmp/PDFS/, use the path to the directory where your PDF are placed):

    library(tidyverse)
    
    data.frame(pdfs = list.files("/tmp/PDFS/")) %>% 
      mutate(number = str_extract(pdfs, "^\\d+"), .before = pdfs)
    
    #>    number   pdfs
    #> 1       1  1.pdf
    #> 2      10 10.pdf
    #> 3      12 12.pdf
    #> 4      13 13.pdf
    #> 5      14 14.pdf
    #> 6      15 15.pdf
    #> 7      16 16.pdf
    #> 8      17 17.pdf
    #> 9      18 18.pdf
    #> 10     19 19.pdf
    #> 11      2  2.pdf
    #> 12     20 20.pdf
    #> 13     21 21.pdf
    #> 14     22 22.pdf
    #> 15     23 23.pdf
    #> 16      3  3.pdf
    #> 17      4  4.pdf
    #> 18      5  5.pdf
    #> 19      6  6.pdf
    #> 20      8  8.pdf
    #> 21      9  9.pdf
    

    Or using tidyr::extract:

    data.frame(pdfs = list.files("/tmp/PDFS/")) %>% 
      extract(pdfs, into = "number", "(\\d+)\\.pdf", remove = F, convert = T) %>% 
      select(number, pdfs)
    

    EDIT

    To answer a further question of the OP (see comments below):

    library(tidyverse)
    
    data.frame(pdfs = list.files("/tmp/PDFS/")) %>% 
      mutate(number = str_extract(pdfs, ".*(?=\\.pdf)"), .before = pdfs)
    
    #>    number      pdfs
    #> 1       1     1.pdf
    #> 2      10    10.pdf
    #> 3     10A   10A.pdf
    #> 4      12    12.pdf
    #> 5      13    13.pdf
    #> 6      14    14.pdf
    #> 7      15    15.pdf
    #> 8      16    16.pdf
    #> 9      17    17.pdf
    #> 10    17A   17A.pdf
    #> 11     18    18.pdf
    #> 12     19    19.pdf
    #> 13      2     2.pdf
    #> 14     20    20.pdf
    #> 15     21    21.pdf
    #> 16  21ABV 21ABV.pdf
    #> 17     22    22.pdf
    #> 18     23    23.pdf
    #> 19      3     3.pdf
    #> 20      4     4.pdf
    #> 21      5     5.pdf
    #> 22      6     6.pdf
    #> 23      8     8.pdf
    #> 24      9     9.pdf