htmlrpdfpagedown

Using R pagedown package to extract webpages as PDFs without pop-ups and cookie warnings


So a friend of mine has written over 800 articles in a food blog, and I am looking to extract all of these to PDFs so that I can bind them nicely and gift them to him. There are simply too many articles to use Chrome's "Save as PDF" manually, so I am looking for the crispest possible way to run through a loop that saves the sites in this format. I have a working solution, however, the final PDFs have ugly ads and cookie warning banners on every single page. I don't see this when I manually select "Print" as PDF on Chrome. Is there a way to pass settings to Chromium using pagedown to have it print without these elements? I've pasted my code below, with the website in question.

library(rvest)
library(dplyr)
library(tidyr)
library(stringr)
library(purrr)
library(downloader)

#Specifying the url for desired website to be scraped

url1 <- paste0('https://www.foodrepublic.com/author/george-embiricos/page/', '1', '/')

#Reading the HTML code from the website
webpage1 <- read_html(url1)

# Pull the links for all articles on George's initial author page

dat <- html_attr(html_nodes(webpage1, 'a'), "href") %>%
  as_tibble() %>%
  filter(str_detect(value, "([0-9]{4})")) %>%
  unique() %>%
  rename(link=value)

# Pull the links for all articles on George's 2nd-89th author page

for (i in 2:89) {

url <- paste0('https://www.foodrepublic.com/author/george-embiricos/page/', i, '/')

#Reading the HTML code from the website
webpage <- read_html(url)

links <- html_attr(html_nodes(webpage, 'a'), "href") %>%
  as_tibble() %>%
  filter(str_detect(value, "([0-9]{4})")) %>%
  unique() %>%
  rename(link=value)

dat <- bind_rows(dat, links) %>%
  unique()

}

dat <- dat %>%
  arrange(link)

# form 1-link vector to test with

tocollect<- dat$link[1]

pagedown::chrome_print(input=tocollect,
                       wait=20,
                       format = "pdf",
                       verbose = 0,
                       timeout=300)

Solution

  • I would rather strip the page of all the elements you do not need (especially the scripts, whereas you want to keep the stylesheets), save as a temporary HTML and then print it. The written HTML file looks nice in the browser, I could not test the printing though:

    for(l in articleUrls) {
      a <- read_html(l) 
      xml_remove(a %>% xml_find_all("aside"))
      xml_remove(a %>% xml_find_all("footer"))
      xml_remove(a %>% xml_find_all(xpath = "//*[contains(@class, 'article-related mb20')]"))
      xml_remove(a %>% xml_find_all(xpath = "//*[contains(@class, 'tags')]"))
      xml_remove(a %>% xml2::xml_find_all("//script"))
      xml_remove(a %>% xml_find_all("//*[contains(@class, 'ad box')]"))
      xml_remove(a %>% xml_find_all("//*[contains(@class, 'newsletter-signup')]"))
      xml_remove(a %>% xml_find_all("//*[contains(@class, 'article-footer')]"))
      xml_remove(a %>% xml_find_all("//*[contains(@class, 'article-footer-sidebar')]"))
      xml_remove(a %>% xml_find_all("//*[contains(@class, 'site-footer')]"))
      xml_remove(a %>% xml_find_all("//*[contains(@class, 'sticky-newsletter')]"))
      xml_remove(a %>% xml_find_all("//*[contains(@class, 'site-header')]"))
      
      xml2::write_html(a, file = "currentArticle.html")
      
      pagedown::chrome_print(input = "currentArticle.html")
    }