I am using tabulizer/tabulapdf to scrape a table from a pdf. My scripts worked a few months ago but now I'm getting a data frame that I'm unfamiliar with - and it's throwing errors. The issue seems that the original table does not have a header row, but tabulapdf is treating the first row as one and I can't extract the data from the "header" of the data frame.
Here is the data frame from dput()
:
scraped_data <- list(
structure(
list(
`NORTH AMERICA. PREMIER LEAGUE (W) (25.10.2022)` = "Team A 2 – 0 Team B"
),
row.names = c(NA, -1L),
spec = structure(
list(
cols = list(
`NORTH AMERICA. PREMIER LEAGUE (W) (25.10.2024)` = structure(
list(),
class = c("collector_character", "collector")
)
),
default = structure(
list(),
class = c("collector_guess", "collector")
),
delim = "\t"
),
class = "col_spec"
),
# problems = <pointer: 0x1347fb760>, # This throws an error when trying to assign
# to `new_scraped_data`. Commenting out.
class = c("spec_tbl_df", "tbl_df", "tbl", "data.frame")
)
)
Previously, I would extract the elements using scraped_data[[1]][1,1]
and scraped_data[[1]][2,1]
... is there a way to preserve this functionality so I don't have to rewrite all the code (there are more tables like this)?
So looking for something along the lines of"
scraped_data |> turn_header_into_a_row()
Extract the header from the list and add it as a new row.
header_row <- colnames(scraped_data[[1]])
scraped_data[[1]] <- rbind(header_row, scraped_data[[1]])
Note that the column name remains the same