web-scrapingwgetmirror

Wget Mirror HTML only


I have a small website that I try to mirror to my local machine with only the html file, no images, image attach files... pdf, ..etc.

I have never mirrored a website before and think it would be a good idea to ask the question before doing anything catastrophical.

This is the command that I want to run and wondering if anything else should be added.

wget --mirror <url> 

Thanks!


Solution

  • The -R and -A options are used to reject or accept specific file types.

    Also consider the bandwidth used to download a whole website. You may want to add the --random-wait option as well.

    If you want to skip all images and pdfs, your command will look something like:

    wget --mirror --random-wait -R gif,jpg,pdf <url>
    

    Note: mirroring a website may go against the policy, so I suggest you check first.

    Sources: