i am attempting to expand a text corpus that was made available to me. The file itself is a .RDS file, and i need to expand it using the text from 20 different PDF documents, where 1 PDF file is its own document entry in the corpus itself.
All the packages that i am using in the project is:
This is the code for all the PDF's i am trying to convert to text and expand the corpus:
pdf_paths <- c("NGODocuments/1234567_EPIC_NGO.pdf",
"NGODocuments/F2662175_Allied-Startups_NGO.pdf",
"NGODocuments/F2662292_Civil-Liberties_NGO.pdf",
"NGODocuments/F2662654_PGEU_NGO.pdf",
"NGODocuments/F2663061_Not-for-profit-law_NGO.pdf",
"NGODocuments/F2663127_Eurocities_NGO.pdf",
"NGODocuments/F2663268_European-Disability_NGO.pdf",
"NGODocuments/F2663380_Information-Accountability_NGO.pdf",
"NGODocuments/F2665208_Hospital-Pharmacy_NGO.pdf",
"NGODocuments/F2665222_European-Radiology_NGO.pdf",
"BusinessDocs/123_DeepMind_Business.pdf",
"BusinessDocs/1234_LinedIn_Business.pdf",
"BusinessDocs/12345_AVAAZ_Business.pdf",
"BusinessDocs/F2488672_SAZKA_Business.pdf",
"BusinessDocs/F2662492_Google_Business.pdf",
"BusinessDocs/F2662771_SICK_Business.pdf",
"BusinessDocs/F2662846_sanofi_Business.pdf",
"BusinessDocs/F2662935_EnBV_Business.pdf",
"BusinessDocs/F2662941_Siemens_Business.pdf",
"BusinessDocs/F2662944_BlackBerry_Business.pdf")
This is the code that i do for trying to extract the text and then expand the corpus:
pdf_text <- lapply(pdf_paths, read_file)
corpus <- tm::Corpus(VectorSource(pdf_text))
prev_corpus <- readRDS("data_corpus_aiact.RDS")
new_corpus <- c(prev_corpus, corpus)
writeCorpus(new_corpus, filenames = pdf_paths)
However, when i run this code, i run in to an error from the new_corpus variable saying:
Error: as.corpus() only works on corpus objects.
I have searchhed all over the web trying to find a solution, but whatever i find, it does not seem to work. I did try once with a package called pdftools, but i got an error when translating the PDfs to text, saying that it had an illegal font weight on the document, which is why i switched to readr.
The goal is to have a new corpus generated, which includes the content from the old corpus, with the new content added to the corpus, and having it saved as a new .RDS file.
Here's how I would do it, with only quanteda and readtext.
library("quanteda")
#> Package version: 3.3.0.9001
#> Unicode version: 14.0
#> ICU version: 71.1
#> Parallel computing: 10 of 10 threads used.
#> See https://quanteda.io for tutorials and examples.
prev_corpus <- readRDS("~/Downloads/pdf documents/data_corpus_aiact.rds")
pdfpath <- "~/Downloads/pdf documents/PDF documents/NGODocuments/*.pdf"
new_corpus <- readtext::readtext(pdfpath,
docvarsfrom = "filenames",
docvarnames = c("id", "actor", "type_actor")) |>
corpus()
#> PDF error: Invalid Font Weight
#> PDF error: Invalid Font Weight
#> PDF error: Invalid Font Weight
#> PDF error: Invalid Font Weight
#> PDF error: Invalid Font Weight
#> PDF error: Invalid Font Weight
#> PDF error: Invalid Font Weight
#> PDF error: Invalid Font Weight
#> PDF error: Invalid Font Weight
#> PDF error: Invalid Font Weight
#> PDF error: Invalid Font Weight
#> PDF error: Invalid Font Weight
#> PDF error: Invalid Font Weight
#> PDF error: Invalid Font Weight
#> PDF error: Invalid Font Weight
You have some oddities in the pdf files, but this is not uncommon. You should consider inspecting the texts to see if readtext::readtext()
converted them correctly.
Now we can change the document names to match what was in your RDS file:
docnames(new_corpus) <- with(docvars(new_corpus),
paste0(actor, " (", type_actor, ")"))
print(new_corpus, 2)
#> Corpus consisting of 40 documents and 3 docvars.
#> EPIC (NGO) :
#> " FEEDBACK OF THE ELECTRONIC PRIVACY INFORMATION CEN..."
#>
#> Allied-Startups (NGO) :
#> "Feedback reference F2662175 Submitted on 13 July 2021 Submit..."
#>
#> [ reached max_ndoc ... 38 more documents ]
head(docvars(new_corpus))
#> id actor type_actor
#> 1 1234567 EPIC NGO
#> 2 F2662175 Allied-Startups NGO
#> 3 F2662292 Civil-Liberties NGO
#> 4 F2662654 PGEU NGO
#> 5 F2663061 Not-for-profit-law NGO
#> 6 F2663127 Eurocities NGO
Some of those will collide with old docnames, and in quanteda, these should be unique. So:
# to avoid ducplicated docids
duplicated_index <- which(docnames(new_corpus) %in% docnames(prev_corpus))
docnames(new_corpus)[duplicated_index] <-
paste(docnames(new_corpus)[duplicated_index], "new")
Now we can simply combine them, and the +
operator will automatically match up the docvar columns.
# combine the two
new_corpus <- prev_corpus + new_corpus
print(new_corpus, 0, 0)
#> Corpus consisting of 60 documents and 3 docvars.
head(docvars(new_corpus))
#> actor type_actor id
#> 1 Access Now NGO <NA>
#> 2 ACM NGO <NA>
#> 3 AlgorithmWatch NGO <NA>
#> 4 AVAAZ NGO <NA>
#> 5 Bits of Freedom NGO <NA>
#> 6 Centre for Democracy and Technology NGO <NA>
tail(docvars(new_corpus))
#> actor type_actor id
#> 55 Impact-AI NGO F2665589
#> 56 Croation-AI NGO F2665590
#> 57 GLEIF NGO F2665591
#> 58 Fraud-Corruption-AI NGO F2665605
#> 59 Future-Society NGO F2665611
#> 60 Climate-Change-AI NGO F2665623
Created on 2023-05-15 with reprex v2.0.2