Im need obtain the names of set a many pdf files (36000 files). But only the names not load all object. Finally make a data frame like this:
The link of 21 example files: https://drive.google.com/drive/folders/1zUKyVJFICq4Q69zs48wqFNq1UPDvCgbf?usp=sharing
Im use this code:
#set directory
library(pdftools)
library(tm)
files=list.files(pattern = "pdf$")
files
all=lapply(files, pdf_text)
lapply(all, length)
x=Corpus(URISource(files), readerControl = list(reader = readPDF))
x
class(x) #character
DAT_FINAL <- data.frame(text = sapply(x, as.character), stringsAsFactors = T)
DAT_FINAL
The idea is has a data frame because I need compare the numeric names with an excel file for find the missing numbers between documents.
Update:
A possible solution (instead of /tmp/PDFS/
, use the path to the directory where your PDF are placed):
library(tidyverse)
data.frame(pdfs = list.files("/tmp/PDFS/")) %>%
mutate(number = str_extract(pdfs, "^\\d+"), .before = pdfs)
#> number pdfs
#> 1 1 1.pdf
#> 2 10 10.pdf
#> 3 12 12.pdf
#> 4 13 13.pdf
#> 5 14 14.pdf
#> 6 15 15.pdf
#> 7 16 16.pdf
#> 8 17 17.pdf
#> 9 18 18.pdf
#> 10 19 19.pdf
#> 11 2 2.pdf
#> 12 20 20.pdf
#> 13 21 21.pdf
#> 14 22 22.pdf
#> 15 23 23.pdf
#> 16 3 3.pdf
#> 17 4 4.pdf
#> 18 5 5.pdf
#> 19 6 6.pdf
#> 20 8 8.pdf
#> 21 9 9.pdf
Or using tidyr::extract
:
data.frame(pdfs = list.files("/tmp/PDFS/")) %>%
extract(pdfs, into = "number", "(\\d+)\\.pdf", remove = F, convert = T) %>%
select(number, pdfs)
EDIT
To answer a further question of the OP (see comments below):
library(tidyverse)
data.frame(pdfs = list.files("/tmp/PDFS/")) %>%
mutate(number = str_extract(pdfs, ".*(?=\\.pdf)"), .before = pdfs)
#> number pdfs
#> 1 1 1.pdf
#> 2 10 10.pdf
#> 3 10A 10A.pdf
#> 4 12 12.pdf
#> 5 13 13.pdf
#> 6 14 14.pdf
#> 7 15 15.pdf
#> 8 16 16.pdf
#> 9 17 17.pdf
#> 10 17A 17A.pdf
#> 11 18 18.pdf
#> 12 19 19.pdf
#> 13 2 2.pdf
#> 14 20 20.pdf
#> 15 21 21.pdf
#> 16 21ABV 21ABV.pdf
#> 17 22 22.pdf
#> 18 23 23.pdf
#> 19 3 3.pdf
#> 20 4 4.pdf
#> 21 5 5.pdf
#> 22 6 6.pdf
#> 23 8 8.pdf
#> 24 9 9.pdf