rdataframemergedata-cleaningtabulizer

Merge multiple rows of dataframe together if followed by an empty row in R


I have the following dataframe:

location <- "https://www.mofa.go.jp/announce/info/conferment/pdfs/2013_sp.pdf"
out <- tabulizer::extract_tables(location)
final <- do.call(rbind, out)
final <- as.data.frame(final) %>%
  janitor::row_to_names(row_number = 2) %>%
  janitor::clean_names()

Unfortunately, due to extraction issues with tabulizer::extract_table (see this thread), the dataframe is unclean.

A data point spans over multiple rows, followed by empty rows (rows 20 and 26 in the screenshot):

enter image description here

Is it possible to automatically merge mutliple rows together into a single row if they have an empty row afterwards (or if there is no row afterwards, as in the last row of the dataframe)?

In other words, rows 13-19 should form a single row, and rows 21-25 should also serve as a single row. The columns are correct.

I would be grateful for your help!


Solution

  • Data is messy because you can have empty rows between same group (rows 126 and 127). I've defined starting of a group when decoration != "". It would be easier to define groups with nationality because it has ( in it (problem are people from Taiwan).

    library(tidyverse)
    library(data.table)
    
    tidyPage <- function(dt){
      
      setDT(dt)
      dt <- dt[, map(.SD, as.character)]
      dt[, flag := !decoration == ""]
      dt <- dt[which.max(flag):.N]
      dt[, group := rleid(flag)]
      dt[flag == TRUE, flag := c(TRUE, rep(FALSE, .N - 1)), by = group]
      dt[, group := cumsum(flag)]
      
      split(dt, dt$group) %>%
        map_dfr(~map_chr(select(.x, -flag, -group), str_c, collapse = " ")) %>%
        mutate(across(where(is.character), str_squish))
    }
    
    location <- "https://www.mofa.go.jp/announce/info/conferment/pdfs/2013_sp.pdf"
    out <- tabulizer::extract_tables(location) %>%
      map(~
        as.data.frame(.x) %>%
          janitor::row_to_names(row_number = 2) %>%
          janitor::clean_names()
      ) %>%
      map_dfr(tidyPage)