[SOLVED] Read PDF table into R where rows have varying numbers of lines

Read PDF table into R where rows have varying numbers of lines

I'm hoping to read the following PDF into a tidy data frame within R: PDF Table. The table even stretches across 70+ pages.

I am adept at reading in tables where each cell has one line, but I'm not sure how to extend that knowledge to cases where rows have a varying number of lines

Any help would be much appreciated!

Solution

I would suggest you to use tabulizer. It is better to extract tables from pdf files. Here the code for the file you shared:

library(tabulizer)
lst <- extract_tables(file = '8-31-2020 Paragraph IV Update_0.pdf') 
#Format
renames <- function(x)
{
  colnames(x) <- x[1,]
  x <- x[2:dim(x)[1],,drop=F]
  return(as.data.frame(x))
}
#Apply
lst21 <- lapply(lst,renames)
#Bind all
df <- do.call(rbind,lst21)

Output (some rows):

head(df)

                                       DRUG NAME   DOSAGE FORM              STRENGTH
1                               Abacavir Sulfate       Tablets                300 mg
2                                       Abacavir Oral Solution              20 mg/mL
3 Abacavir Sulfate, Dolutegravir\rand Lamivudine       Tablets  600 mg/50 mg/300\rmg
4               Abacavir Sulfate and\rLamivudine       Tablets         600 mg/300 mg
5   Abacavir Sulfate, Lamivudine\rand Zidovudine       Tablets 300 mg/150 mg/300\rmg
6                            Abiraterone Acetate       Tablets                125 mg
          RLD/NDA DATE OF\rSUBMISSION NUMBER OF\rANDAs\rSUBMITTED 180-DAY\rSTATUS
1   Ziagen\r20977           1/28/2009                           1        Eligible
2   Ziagen\r20978          12/27/2012                           1        Eligible
3 Triumeq\r205551           8/14/2017                           5                
4  Epzicom\r21652           9/27/2007                           1        Eligible
5 Trizivir\r21205           3/22/2011                           1        Eligible
6   Yonsa\r210308           7/23/2018                           1                
  180-DAY\rDECISION\rPOSTING\rDATE DATE OF\rFIRST\rAPPLICANT\rAPPROVAL
1                        2/11/2020                           6/18/2012
2                        2/11/2020                           9/26/2016
3                                                                     
4                        2/11/2020                           9/29/2016
5                        2/11/2020                           12/5/2013
6                                                                     
  DATE OF FIRST\rCOMMERCIAL\rMARKETING BY\rFTF EXPIRATION\rDATE OF LAST\rQUALIFYING\rPATENT
1                                    6/19/2012                                    5/14/2018
2                                    9/15/2017                                    5/14/2018
3                                                                                 12/8/2029
4                                    9/29/2016                                    5/14/2018
5                                   12/17/2013                                    5/14/2018
6                                                                                 3/17/2034