rcorpusquantedareadr

How to expand a .RDS text corpus using R and Readr package


i am attempting to expand a text corpus that was made available to me. The file itself is a .RDS file, and i need to expand it using the text from 20 different PDF documents, where 1 PDF file is its own document entry in the corpus itself.

All the packages that i am using in the project is:

This is the code for all the PDF's i am trying to convert to text and expand the corpus:

pdf_paths <- c("NGODocuments/1234567_EPIC_NGO.pdf",
           "NGODocuments/F2662175_Allied-Startups_NGO.pdf",
           "NGODocuments/F2662292_Civil-Liberties_NGO.pdf",
           "NGODocuments/F2662654_PGEU_NGO.pdf",
           "NGODocuments/F2663061_Not-for-profit-law_NGO.pdf",
           "NGODocuments/F2663127_Eurocities_NGO.pdf",
           "NGODocuments/F2663268_European-Disability_NGO.pdf",
           "NGODocuments/F2663380_Information-Accountability_NGO.pdf",
           "NGODocuments/F2665208_Hospital-Pharmacy_NGO.pdf",
           "NGODocuments/F2665222_European-Radiology_NGO.pdf",
           "BusinessDocs/123_DeepMind_Business.pdf",
           "BusinessDocs/1234_LinedIn_Business.pdf",
           "BusinessDocs/12345_AVAAZ_Business.pdf",
           "BusinessDocs/F2488672_SAZKA_Business.pdf",
           "BusinessDocs/F2662492_Google_Business.pdf",
           "BusinessDocs/F2662771_SICK_Business.pdf",
           "BusinessDocs/F2662846_sanofi_Business.pdf",
           "BusinessDocs/F2662935_EnBV_Business.pdf", 
           "BusinessDocs/F2662941_Siemens_Business.pdf",
           "BusinessDocs/F2662944_BlackBerry_Business.pdf")

This is the code that i do for trying to extract the text and then expand the corpus:

pdf_text <- lapply(pdf_paths, read_file)
corpus <- tm::Corpus(VectorSource(pdf_text))

prev_corpus <- readRDS("data_corpus_aiact.RDS")
new_corpus <- c(prev_corpus, corpus)
writeCorpus(new_corpus, filenames = pdf_paths)

However, when i run this code, i run in to an error from the new_corpus variable saying:

Error: as.corpus() only works on corpus objects.

I have searchhed all over the web trying to find a solution, but whatever i find, it does not seem to work. I did try once with a package called pdftools, but i got an error when translating the PDfs to text, saying that it had an illegal font weight on the document, which is why i switched to readr.

The goal is to have a new corpus generated, which includes the content from the old corpus, with the new content added to the corpus, and having it saved as a new .RDS file.


Solution

  • Here's how I would do it, with only quanteda and readtext.

    library("quanteda")
    #> Package version: 3.3.0.9001
    #> Unicode version: 14.0
    #> ICU version: 71.1
    #> Parallel computing: 10 of 10 threads used.
    #> See https://quanteda.io for tutorials and examples.
    
    prev_corpus <- readRDS("~/Downloads/pdf documents/data_corpus_aiact.rds")
    pdfpath <- "~/Downloads/pdf documents/PDF documents/NGODocuments/*.pdf"
    
    new_corpus <- readtext::readtext(pdfpath, 
                                     docvarsfrom = "filenames",
                                     docvarnames = c("id", "actor", "type_actor")) |>
        corpus()
    #> PDF error: Invalid Font Weight
    #> PDF error: Invalid Font Weight
    #> PDF error: Invalid Font Weight
    #> PDF error: Invalid Font Weight
    #> PDF error: Invalid Font Weight
    #> PDF error: Invalid Font Weight
    #> PDF error: Invalid Font Weight
    #> PDF error: Invalid Font Weight
    #> PDF error: Invalid Font Weight
    #> PDF error: Invalid Font Weight
    #> PDF error: Invalid Font Weight
    #> PDF error: Invalid Font Weight
    #> PDF error: Invalid Font Weight
    #> PDF error: Invalid Font Weight
    #> PDF error: Invalid Font Weight
    

    You have some oddities in the pdf files, but this is not uncommon. You should consider inspecting the texts to see if readtext::readtext() converted them correctly.

    Now we can change the document names to match what was in your RDS file:

    docnames(new_corpus) <- with(docvars(new_corpus),
                                 paste0(actor, " (", type_actor, ")"))
    print(new_corpus, 2)
    #> Corpus consisting of 40 documents and 3 docvars.
    #> EPIC (NGO) :
    #> "          FEEDBACK OF THE ELECTRONIC PRIVACY INFORMATION CEN..."
    #> 
    #> Allied-Startups (NGO) :
    #> "Feedback reference F2662175 Submitted on 13 July 2021 Submit..."
    #> 
    #> [ reached max_ndoc ... 38 more documents ]
    head(docvars(new_corpus))
    #>         id              actor type_actor
    #> 1  1234567               EPIC        NGO
    #> 2 F2662175    Allied-Startups        NGO
    #> 3 F2662292    Civil-Liberties        NGO
    #> 4 F2662654               PGEU        NGO
    #> 5 F2663061 Not-for-profit-law        NGO
    #> 6 F2663127         Eurocities        NGO
    

    Some of those will collide with old docnames, and in quanteda, these should be unique. So:

    # to avoid ducplicated docids
    duplicated_index <- which(docnames(new_corpus) %in% docnames(prev_corpus))
    docnames(new_corpus)[duplicated_index] <- 
        paste(docnames(new_corpus)[duplicated_index], "new")
    

    Now we can simply combine them, and the + operator will automatically match up the docvar columns.

    
    # combine the two
    new_corpus <- prev_corpus + new_corpus
    print(new_corpus, 0, 0)
    #> Corpus consisting of 60 documents and 3 docvars.
    head(docvars(new_corpus))
    #>                                 actor type_actor   id
    #> 1                          Access Now        NGO <NA>
    #> 2                                 ACM        NGO <NA>
    #> 3                      AlgorithmWatch        NGO <NA>
    #> 4                               AVAAZ        NGO <NA>
    #> 5                     Bits of Freedom        NGO <NA>
    #> 6 Centre for Democracy and Technology        NGO <NA>
    tail(docvars(new_corpus))
    #>                  actor type_actor       id
    #> 55           Impact-AI        NGO F2665589
    #> 56         Croation-AI        NGO F2665590
    #> 57               GLEIF        NGO F2665591
    #> 58 Fraud-Corruption-AI        NGO F2665605
    #> 59      Future-Society        NGO F2665611
    #> 60   Climate-Change-AI        NGO F2665623
    

    Created on 2023-05-15 with reprex v2.0.2