curlweb-scrapingbrowserelinks

How to get html source code of a web page


I was using curl to scrape html code from a certain website. then they changed their server settings and curl no longer can get the page content giving error code 1020 then I changed my script to use elinks.

but again they are now using cloudflare and elinks no longer works (only in this particular website). and it gives the same error code 1020.

is there any command line or option to use other browsers (firefox,chromium, google-chrome...) and get the page html in a terminal ?


Solution

  • If you can write scripts for Node.js, here is a small example using puppeteer library. It logs page source code after the page is loaded in a headless (invisible) Chrome, with dynamic content generated by page scripts:

    import puppeteer from 'puppeteer';
    
    const browser = await puppeteer.launch({ headless: false, defaultViewport: null });
    
    try {
      const [page] = await browser.pages();
      await page.goto('https://example.org/');
      console.log(await page.content());
    
    } catch (err) { console.error(err); } finally { await browser.close(); }