Im trying to import this pdf, https://www.mountwashington.org/uploads/forms/2018/01.pdf , to r and get it formatted as a data frame. Is there a way to work with the weird headers and get just the main headers(not the bigger headers like location and station) for the data efficiently?
I was able to get what I wanted by converting the pdf to an excel file with a converter website and then manually editing the columns/rows in excel and then importing to r but this was very inefficient and would like to do it in r somehow. I tried using the tabulizer package but it gave me data as characters and completely unorganized.
This is what I'd like it to look like:
> a
DAY MAX MIN AVG NORM DEPART HEAT COOL TOTAL..EQUIV. SNOW...ICE AVG.WIND.SPEED..MPH. FASTEST.SPEED DIR
1 1 -14 -25 -19 6 -25 84 0 0.00 0.0 55.3 79 310 (NW)
2 2 -7 -23 -15 6 -21 80 0 0.01 0.7 53.8 84 280 (W)
3 3 7 -7 0 6 -6 65 0 T T 39.2 64 280 (W)
And this is what I was able to get with tabulizer:
[,1]
[1,] "WS FORM F-6"
[2,] ""
[3,] "PRELIMINARY LOCAL CLIMATOLOGICAL DATA"
[4,] ""
[5,] "LATITUDE LONGITUDE"
[6,] "44 DEGREES16 MINUTESNORTH 71 DEGREES 18 MINUTES WEST"
[7,] "TEMPERATURE (°F) PRECIPITATION (IN.)"
[8,] "DEGREE DAYS TOTAL SNOW & SNOW/ICE ON AVG"
[9,] "DAY MAX MIN AVG NORM DEPART HEAT COOL (EQUIV) ICE GROUND-7AM SPEED"
[10,] "1 -14 -25 -19 6 -25 84 0 0.00 0.0 23 55.3"
and then many more lines afterwards with more unorganized data that seemed to randomly pulled from the page.
Any help would be great, thanks!
You can use tabulizer
to extract the table. Use locate_areas
to find the coordinates of the area to extract.
Take a look of this link
library(tabulizer)
# I used locate_areas("https://www.mountwashington.org/uploads/forms/2018/01.pdf")
# to find the area of the table to extract
mw_table <- extract_tables(
"https://www.mountwashington.org/uploads/forms/2018/01.pdf",
output = "data.frame",
area = list(c(103.49321, 15.79171, 402.56716, 586.74627)),
guess = FALSE
)
mw_table[[1]]
Then you just need to change the names of the dataframe.