rpdfpdftools

Read PDF table into R where rows have varying numbers of lines


I'm hoping to read the following PDF into a tidy data frame within R: PDF Table. The table even stretches across 70+ pages.

I am adept at reading in tables where each cell has one line, but I'm not sure how to extend that knowledge to cases where rows have a varying number of lines

Any help would be much appreciated!


Solution

  • I would suggest you to use tabulizer. It is better to extract tables from pdf files. Here the code for the file you shared:

    library(tabulizer)
    lst <- extract_tables(file = '8-31-2020 Paragraph IV Update_0.pdf') 
    #Format
    renames <- function(x)
    {
      colnames(x) <- x[1,]
      x <- x[2:dim(x)[1],,drop=F]
      return(as.data.frame(x))
    }
    #Apply
    lst21 <- lapply(lst,renames)
    #Bind all
    df <- do.call(rbind,lst21)
    

    Output (some rows):

    head(df)
    
                                           DRUG NAME   DOSAGE FORM              STRENGTH
    1                               Abacavir Sulfate       Tablets                300 mg
    2                                       Abacavir Oral Solution              20 mg/mL
    3 Abacavir Sulfate, Dolutegravir\rand Lamivudine       Tablets  600 mg/50 mg/300\rmg
    4               Abacavir Sulfate and\rLamivudine       Tablets         600 mg/300 mg
    5   Abacavir Sulfate, Lamivudine\rand Zidovudine       Tablets 300 mg/150 mg/300\rmg
    6                            Abiraterone Acetate       Tablets                125 mg
              RLD/NDA DATE OF\rSUBMISSION NUMBER OF\rANDAs\rSUBMITTED 180-DAY\rSTATUS
    1   Ziagen\r20977           1/28/2009                           1        Eligible
    2   Ziagen\r20978          12/27/2012                           1        Eligible
    3 Triumeq\r205551           8/14/2017                           5                
    4  Epzicom\r21652           9/27/2007                           1        Eligible
    5 Trizivir\r21205           3/22/2011                           1        Eligible
    6   Yonsa\r210308           7/23/2018                           1                
      180-DAY\rDECISION\rPOSTING\rDATE DATE OF\rFIRST\rAPPLICANT\rAPPROVAL
    1                        2/11/2020                           6/18/2012
    2                        2/11/2020                           9/26/2016
    3                                                                     
    4                        2/11/2020                           9/29/2016
    5                        2/11/2020                           12/5/2013
    6                                                                     
      DATE OF FIRST\rCOMMERCIAL\rMARKETING BY\rFTF EXPIRATION\rDATE OF LAST\rQUALIFYING\rPATENT
    1                                    6/19/2012                                    5/14/2018
    2                                    9/15/2017                                    5/14/2018
    3                                                                                 12/8/2029
    4                                    9/29/2016                                    5/14/2018
    5                                   12/17/2013                                    5/14/2018
    6                                                                                 3/17/2034