I figured out how to scrape this PDF, but I have a lot of these files that I need to go through. My intention was to set this as a function, import data from all of the pdfs (one pdf per month for several years) and then do an rbind() to make one data table that I can then write as a csv.
This works.
library(tidyverse)
library(tabulizer)
#import the data
jan16s_raw <- extract_tables("https://www.nvsos.gov/sos/home/showdocument?id=4062")
#create data frame
cleanNvsen <- do.call(rbind, jan16s_raw)
cleanNvsen2 <-as.data.frame(cleanNvsen[3:nrow(cleanNvsen),])
#rename all of the columns
names(cleanNvsen2)[1] <- "District"
names(cleanNvsen2)[2] <- "Democrat"
names(cleanNvsen2)[3] <- "Independent American"
names(cleanNvsen2)[4] <- "Libertarian"
names(cleanNvsen2)[5] <- "Nonpartisan"
names(cleanNvsen2)[6] <- "Other"
names(cleanNvsen2)[7] <- "Republican"
names(cleanNvsen2)[8] <- "Total"
#check to see if it worked
head(example)
But this results in a 1 x 1 data frame
library(tidyverse)
library(tabulizer)
#load data
jan16s_raw <- extract_tables("https://www.nvsos.gov/sos/home/showdocument?id=4062")
#create function to create data frame and then rename
clean <- function(x) {
cleanNvsen <- do.call(rbind, x)
cleanNvsen2 <-as.data.frame(cleanNvsen[3:nrow(cleanNvsen),])
names(cleanNvsen2)[1] <- "District"
names(cleanNvsen2)[2] <- "Democrat"
names(cleanNvsen2)[3] <- "Independent American"
names(cleanNvsen2)[4] <- "Libertarian"
names(cleanNvsen2)[5] <- "Nonpartisan"
names(cleanNvsen2)[6] <- "Other"
names(cleanNvsen2)[7] <- "Republican"
names(cleanNvsen2)[8] <- "Total"
}
x2 <- clean(jan16s_raw)
head(x2)
I'd really like to get this to work so that I can just feed R the url's and then run this clean function I've created. I have dozens of files to go through.
You can write the clean
function to extract the data and renaming the columns. We can rename multiple columns at once and don't need to rename them individually.
clean <- function(url) {
jan16s_raw <- extract_tables(url)
#create data frame
cleanNvsen <- do.call(rbind, jan16s_raw)
cleanNvsen2 <- as.data.frame(cleanNvsen[3:nrow(cleanNvsen),])
#rename all of the columns
names(cleanNvsen2) <- c("District", "Democrat", "Independent American",
"Libertarian","Nonpartisan","Other","Republican","Total")
return(cleanNvsen2)
}
Create a vector of all the urls from which you want to extract the data.
list_of_urls <- c('https://www.nvsos.gov/sos/home/showdocument?id=4062',
'https://www.nvsos.gov/sos/home/showdocument?id=4064')
Then call clean
function for each of the url and combine the data.
all_data <- purrr::map_df(list_of_urls, clean)
#OR
#all_data <- do.call(rbind, lapply(list_of_urls, clean))