pdfurlautomationdownloaditeration

How to download multiple pages as pdfs if i have a list of urls?


I'd like to download a lot of web pages (theese in particular consist of lines of text with occasionally images) as PDFs, but it's a bit too much to do it manually. The urls per se are easibly iterable as they are in the form "https://www.(site).com/(stuff)/(number) (site) and (stuff) are static, while the number changes. Is there a way to download all the sites from number n to m, using chrome standard print as PDF or any other method. I tried to look a bit on the internet, but I didn't really find much that could help. I can code a bit in python, c, css and HTML, but if I need another coding language I'm ready to learn it. P.S: I'm sorry if the post is a bit dry, but it's my first and I'm not sure on what to write. Thanks in advance!


Solution

  • Your Answer is based on the programming you specified.

    https://www.(site).com/(stuff)/(number) where (site) and (stuff) are fixed thus only the number changes.

    enter image description here

    So as simply as 1, 2, 3, just create your loop in your shell and then call your browser.

    I am using Windows so my Chrome is an alias to MS Edge but they work on the same programming code base. I have allowed the heading to be included, but there is a difference in how that is switched off, you would need to checkout via your browser command level. (Search this site for https://stackoverflow.com/search?q=headless+no-header+print-to-pdf )

    for /l %i in (1,1,3) do @%chrome% --headless --print-to-pdf="%cd%\%i.pdf" https://www.example.com/stuff/%i
    

    In above case (n,1,m) is the numbering integer %i whilst %cd% is current working directory folder. That save location should be fully qualified or you may get a blank output and thus on Windows if there are spaces it should be written with quotes ="%cd%\%i.pdf".