htmlrdataframeparsingtext

Converting text file from a webite in a table


I am trying to transform an htm file containing text tables into a dataframe. I went through previous questions here and here, but didn't solve my issue. The table gets messed up.

Following a reproducible example.

# URL of the website 
  url <- paste0("https://www.uspto.gov/web/offices/ac/ido/oeip/taf/st_co_93.htm") 
  
# Read the HTML code of the page regarding the interested table
  html_code <- paste(readLines(url))[44:99]
  
# Transform text into a table
  table_df <- read.table(text = html_code, skip = 3, fill = NA, 
                         col.names = c("CODE", "STATE/COUNTRY", "UTILITY", "DESIGN", "PLANT","REISSUE","TOTALS","SIRS"))
  

Solution

  • It's a fixed width format that you can parse with read.fwf() or readr::read_fwf(), latter tries to guess column widths for you and generally works quite well.
    If you are OK with hardcoded indices, this should do:

    library(readr)
    l <- read_lines("https://www.uspto.gov/web/offices/ac/ido/oeip/taf/st_co_93.htm")
    
    # section where first table starts
    stringr::str_view(l)[40:50]
    #> [40] │ 
    #> [41] │ 
    #> [42] │ STATE-COUNTRY COUNTS FROM CALENDAR YEAR 1993 PATENT FILE
    #> [43] │ 
    #> [44] │ MAIL
    #> [45] │ CODE   STATE/COUNTRY    UTILITY   DESIGN   PLANT   REISSUE   SIRS    TOTALS
    #> [46] │ 
    #> [47] │ AL     ALABAMA          271       51       1       1         0       324     
    #> [48] │ AK     ALASKA           50        14       0       0         0       64      
    #> [49] │ AZ     ARIZONA          848       75       0       5         1       929     
    #> [50] │ AR     ARKANSAS         114       36       4       0         0       154
    
    # a fixed-width-format, readr::read_fwf() provides convenient column width guessing
    l[47:101] |> 
      I() |> 
      read_fwf() |> 
      setNames(strsplit(paste(l[44:45], collapse = "_"), "\\s+")[[1]])
    #> Rows: 55 Columns: 8
    #> ── Column specification ────────────────────────────────────────────────────────
    #> 
    #> chr (2): X1, X2
    #> dbl (6): X3, X4, X5, X6, X7, X8
    #> 
    #> ℹ Use `spec()` to retrieve the full column specification for this data.
    #> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
    

    Result:

    #> # A tibble: 55 × 8
    #>    MAIL_CODE `STATE/COUNTRY` UTILITY DESIGN PLANT REISSUE  SIRS TOTALS
    #>    <chr>     <chr>             <dbl>  <dbl> <dbl>   <dbl> <dbl>  <dbl>
    #>  1 AL        ALABAMA             271     51     1       1     0    324
    #>  2 AK        ALASKA               50     14     0       0     0     64
    #>  3 AZ        ARIZONA             848     75     0       5     1    929
    #>  4 AR        ARKANSAS            114     36     4       0     0    154
    #>  5 CA        CALIFORNIA         8170   1211   159      28    11   9579
    #>  6 CO        COLORADO            910    140     0       4     3   1057
    #>  7 CT        CONNECTICUT        1544    194     0       8     3   1749
    #>  8 DE        DELAWARE            507     13     1       0     4    525
    #>  9 FL        FLORIDA            1777    316    12       9     3   2117
    #> 10 GA        GEORGIA             705    154     0       2     1    862
    #> # ℹ 45 more rows
    

    Created on 2024-11-14 with reprex v2.1.1