I'm hoping to read the following PDF into a tidy data frame within R: PDF Table. The table even stretches across 70+ pages.
I am adept at reading in tables where each cell has one line, but I'm not sure how to extend that knowledge to cases where rows have a varying number of lines
Any help would be much appreciated!
I would suggest you to use tabulizer
. It is better to extract tables from pdf files. Here the code for the file you shared:
library(tabulizer)
lst <- extract_tables(file = '8-31-2020 Paragraph IV Update_0.pdf')
#Format
renames <- function(x)
{
colnames(x) <- x[1,]
x <- x[2:dim(x)[1],,drop=F]
return(as.data.frame(x))
}
#Apply
lst21 <- lapply(lst,renames)
#Bind all
df <- do.call(rbind,lst21)
Output (some rows):
head(df)
DRUG NAME DOSAGE FORM STRENGTH
1 Abacavir Sulfate Tablets 300 mg
2 Abacavir Oral Solution 20 mg/mL
3 Abacavir Sulfate, Dolutegravir\rand Lamivudine Tablets 600 mg/50 mg/300\rmg
4 Abacavir Sulfate and\rLamivudine Tablets 600 mg/300 mg
5 Abacavir Sulfate, Lamivudine\rand Zidovudine Tablets 300 mg/150 mg/300\rmg
6 Abiraterone Acetate Tablets 125 mg
RLD/NDA DATE OF\rSUBMISSION NUMBER OF\rANDAs\rSUBMITTED 180-DAY\rSTATUS
1 Ziagen\r20977 1/28/2009 1 Eligible
2 Ziagen\r20978 12/27/2012 1 Eligible
3 Triumeq\r205551 8/14/2017 5
4 Epzicom\r21652 9/27/2007 1 Eligible
5 Trizivir\r21205 3/22/2011 1 Eligible
6 Yonsa\r210308 7/23/2018 1
180-DAY\rDECISION\rPOSTING\rDATE DATE OF\rFIRST\rAPPLICANT\rAPPROVAL
1 2/11/2020 6/18/2012
2 2/11/2020 9/26/2016
3
4 2/11/2020 9/29/2016
5 2/11/2020 12/5/2013
6
DATE OF FIRST\rCOMMERCIAL\rMARKETING BY\rFTF EXPIRATION\rDATE OF LAST\rQUALIFYING\rPATENT
1 6/19/2012 5/14/2018
2 9/15/2017 5/14/2018
3 12/8/2029
4 9/29/2016 5/14/2018
5 12/17/2013 5/14/2018
6 3/17/2034