rtabulizer

Extract the "column name" of a data frame scraped by tabulapdf in R


I am using tabulizer/tabulapdf to scrape a table from a pdf. My scripts worked a few months ago but now I'm getting a data frame that I'm unfamiliar with - and it's throwing errors. The issue seems that the original table does not have a header row, but tabulapdf is treating the first row as one and I can't extract the data from the "header" of the data frame.

Here is the data frame from dput():

scraped_data <- list(
  structure(
    list(
      `NORTH AMERICA. PREMIER LEAGUE (W) (25.10.2022)` = "Team A 2 – 0 Team B"
    ),
    row.names = c(NA, -1L),
    spec = structure(
      list(
        cols = list(
          `NORTH AMERICA. PREMIER LEAGUE (W) (25.10.2024)` = structure(
            list(),
            class = c("collector_character", "collector")
          )
        ),
        default = structure(
          list(),
          class = c("collector_guess", "collector")
        ),
        delim = "\t"
      ),
      class = "col_spec"
    ),
    # problems = <pointer: 0x1347fb760>, # This throws an error when trying to assign
                                         # to `new_scraped_data`. Commenting out.
    class = c("spec_tbl_df", "tbl_df", "tbl", "data.frame")
  )
)

Previously, I would extract the elements using scraped_data[[1]][1,1] and scraped_data[[1]][2,1]... is there a way to preserve this functionality so I don't have to rewrite all the code (there are more tables like this)?

So looking for something along the lines of"

scraped_data |> turn_header_into_a_row()

Solution

  • Extract the header from the list and add it as a new row.

    header_row <- colnames(scraped_data[[1]])
    scraped_data[[1]] <- rbind(header_row, scraped_data[[1]])
    

    Note that the column name remains the same