rtabulizer

Importing pdf tables to r with weird headers


Im trying to import this pdf, https://www.mountwashington.org/uploads/forms/2018/01.pdf , to r and get it formatted as a data frame. Is there a way to work with the weird headers and get just the main headers(not the bigger headers like location and station) for the data efficiently?

I was able to get what I wanted by converting the pdf to an excel file with a converter website and then manually editing the columns/rows in excel and then importing to r but this was very inefficient and would like to do it in r somehow. I tried using the tabulizer package but it gave me data as characters and completely unorganized.

This is what I'd like it to look like:

> a
   DAY MAX MIN AVG NORM DEPART HEAT COOL TOTAL..EQUIV. SNOW...ICE AVG.WIND.SPEED..MPH. FASTEST.SPEED      DIR
1    1 -14 -25 -19    6    -25   84    0          0.00        0.0                 55.3            79 310 (NW)
2    2  -7 -23 -15    6    -21   80    0          0.01        0.7                 53.8            84  280 (W)
3    3   7  -7   0    6     -6   65    0             T          T                 39.2            64  280 (W)

And this is what I was able to get with tabulizer:

 [,1]                                                                                                                                       
 [1,] "WS FORM F-6"                                                                                                                              
 [2,] ""                                                                                                                                         
 [3,] "PRELIMINARY LOCAL CLIMATOLOGICAL DATA"                                                                                                    
 [4,] ""                                                                                                                                         
 [5,] "LATITUDE LONGITUDE"                                                                                                                       
 [6,] "44 DEGREES16 MINUTESNORTH 71 DEGREES  18 MINUTES  WEST"                                                                                   
 [7,] "TEMPERATURE (°F) PRECIPITATION (IN.)"                                                                                                    
 [8,] "DEGREE DAYS TOTAL SNOW & SNOW/ICE ON AVG"                                                                                                 
 [9,] "DAY MAX MIN AVG NORM DEPART HEAT COOL (EQUIV) ICE GROUND-7AM SPEED"                                                                       
[10,] "1 -14 -25 -19 6 -25 84 0 0.00 0.0 23 55.3"

and then many more lines afterwards with more unorganized data that seemed to randomly pulled from the page.

Any help would be great, thanks!


Solution

  • You can use tabulizer to extract the table. Use locate_areas to find the coordinates of the area to extract.

    Take a look of this link

    library(tabulizer)
    
    # I used locate_areas("https://www.mountwashington.org/uploads/forms/2018/01.pdf") 
    # to find the area of the table to extract
    
    mw_table <- extract_tables(
      "https://www.mountwashington.org/uploads/forms/2018/01.pdf",
      output = "data.frame",
      area =  list(c(103.49321,  15.79171, 402.56716, 586.74627)),
      guess = FALSE
      )
    
    mw_table[[1]]
    

    Then you just need to change the names of the dataframe.