I have downloaded the pdf file from this site (from the Table tab) and want to clean the dataset in R and convert it to a csv or excel file.
I am using pdftools package and have downloaded the other required packages. I want to focus on the data for the age groups. So far I have the dataset narrowed down by using these codes.
#Load the dataset
PDF1 <- pdf_text("agegr_1-4-21.pdf") %>%
readr::read_lines() #open the PDF inside your project folder
PDF1
PDF1.grass <-PDF1[-c(1:10,17:19)] # remove lines
PDF1.grass
write.table(PDF1.grass, file="docd_pdf.csv", sep=",", row.names=FALSE)
all_stat_lines <- PDF1.grass
pdf_transpose = t(all_stat_lines)
write.table(pdf_transpose, file="docd_pdf.csv", sep=",", row.names=FALSE)
df <- plyr::ldply(pdf_transpose) #create a data frame
head(df)
However the data frame that I am getting includes everything on one variable. Is there is a way to efficiently break up the datasets and have different columns for the age groups? I downloaded the pdf file from the site and named it agegr_1-4-21.pdf.
The output I am getting is
One approach to achieve this is via tidyr::extract
. I first extract the header from the first row and second extract the data from the other rows.
library(dplyr)
regex_header <- paste0(
"^(\\w+)\\s+",
paste(rep("(\\d+\\-\\d+ years)", 7), collapse = "\\s+"), "\\s+",
"(\\d+\\+ years)\\s+",
"(\\w+)"
)
header <- tidyr::extract(data = slice(df, 1), col = V1, into = paste0("var", 1:10), regex = regex_header) %>%
t() %>%
.[, 1]
regex_body <- paste0("^([\\w\\*]+)\\s+", paste(rep("([\\d,\\.]+)", 9), collapse = "\\s+"))
tidyr::extract(data = slice(df, 2:nrow(df)), col = V1, into = header, regex = regex_body)
#> Outcome 0-17 years 18-29 years 30-39 years 40-49 years 50-59 years
#> 1 Case 2.090 3.435 2.706 2.190 1.887
#> 2 Hospitalized 20 81 133 188 264
#> 3 Died 0 4 4 11 36
#> 4 Missing** 612 1.740 1.369 1.076 1.013
#> 5 Gesamtsumme 2.722 5.260 4.212 3.465 3.200
#> 60-69 years 70-79 years 80+ years Gesamtsumme
#> 1 1.218 504 224 14.254
#> 2 299 219 151 1.355
#> 3 58 83 110 306
#> 4 674 295 208 6.987
#> 5 2.249 1.101 693 22.902
DATA For the data I downloaded one of the tables and cleaned it using your code.
df <- structure(list(V1 = c(
"Outcome 0-17 years 18-29 years 30-39 years 40-49 years 50-59 years 60-69 years 70-79 years 80+ years Gesamtsumme",
"Case 2.090 3.435 2.706 2.190 1.887 1.218 504 224 14.254",
"Hospitalized 20 81 133 188 264 299 219 151 1.355",
"Died 0 4 4 11 36 58 83 110 306",
"Missing** 612 1.740 1.369 1.076 1.013 674 295 208 6.987",
"Gesamtsumme 2.722 5.260 4.212 3.465 3.200 2.249 1.101 693 22.902"
)), class = "data.frame", row.names = c(NA, -6L))