rpdftools

Cleaning downloaded pdf dataset in R


I have downloaded the pdf file from this site (from the Table tab) and want to clean the dataset in R and convert it to a csv or excel file.

I am using pdftools package and have downloaded the other required packages. I want to focus on the data for the age groups. So far I have the dataset narrowed down by using these codes.

#Load the dataset 
PDF1 <- pdf_text("agegr_1-4-21.pdf") %>%
  readr::read_lines() #open the PDF inside your project folder
PDF1
PDF1.grass <-PDF1[-c(1:10,17:19)] # remove lines
PDF1.grass
write.table(PDF1.grass, file="docd_pdf.csv", sep=",", row.names=FALSE)

all_stat_lines <- PDF1.grass 

pdf_transpose = t(all_stat_lines)
write.table(pdf_transpose, file="docd_pdf.csv", sep=",", row.names=FALSE)

df <- plyr::ldply(pdf_transpose) #create a data frame
head(df)

However the data frame that I am getting includes everything on one variable. Is there is a way to efficiently break up the datasets and have different columns for the age groups? I downloaded the pdf file from the site and named it agegr_1-4-21.pdf.

The output I am getting is

enter image description here


Solution

  • One approach to achieve this is via tidyr::extract. I first extract the header from the first row and second extract the data from the other rows.

    library(dplyr)
    
    
    regex_header <- paste0(
      "^(\\w+)\\s+",
      paste(rep("(\\d+\\-\\d+ years)", 7), collapse = "\\s+"), "\\s+",
      "(\\d+\\+ years)\\s+",
      "(\\w+)"
    )
    
    header <- tidyr::extract(data = slice(df, 1), col = V1, into = paste0("var", 1:10), regex = regex_header) %>%
      t() %>%
      .[, 1]
    
    regex_body <- paste0("^([\\w\\*]+)\\s+", paste(rep("([\\d,\\.]+)", 9), collapse = "\\s+"))
    
    tidyr::extract(data = slice(df, 2:nrow(df)), col = V1, into = header, regex = regex_body)
    #>        Outcome 0-17 years 18-29 years 30-39 years 40-49 years 50-59 years
    #> 1         Case      2.090       3.435       2.706       2.190       1.887
    #> 2 Hospitalized         20          81         133         188         264
    #> 3         Died          0           4           4          11          36
    #> 4    Missing**        612       1.740       1.369       1.076       1.013
    #> 5  Gesamtsumme      2.722       5.260       4.212       3.465       3.200
    #>   60-69 years 70-79 years 80+ years Gesamtsumme
    #> 1       1.218         504       224      14.254
    #> 2         299         219       151       1.355
    #> 3          58          83       110         306
    #> 4         674         295       208       6.987
    #> 5       2.249       1.101       693      22.902
    

    DATA For the data I downloaded one of the tables and cleaned it using your code.

    df <- structure(list(V1 = c(
      "Outcome                 0-17 years       18-29 years     30-39 years      40-49 years   50-59 years     60-69 years 70-79 years     80+ years Gesamtsumme",
      "Case                      2.090            3.435            2.706            2.190        1.887           1.218        504            224        14.254",
      "Hospitalized                20               81              133              188          264             299         219            151         1.355",
      "Died                         0                4                4               11           36              58          83            110          306",
      "Missing**                  612             1.740            1.369            1.076        1.013            674         295            208         6.987",
      "Gesamtsumme               2.722            5.260            4.212            3.465        3.200           2.249       1.101           693        22.902"
    )), class = "data.frame", row.names = c(NA, -6L))