wgetwarc

wget --warc-file --recursive, prevent writing individual files


I run wget to create a warc archive as follows:

$ wget --warc-file=/tmp/epfl --recursive --level=1 http://www.epfl.ch/

$ l -h /tmp/epfl.warc.gz
-rw-r--r--  1 david  wheel   657K Sep  2 15:18 /tmp/epfl.warc.gz

$ find .
./www.epfl.ch/index.html
./www.epfl.ch/public/hp2013/css/homepage.70a623197f74.css
[...]

I only need the epfl.warc.gz file. How do I prevent wget to creating all the individual files?

I tried as follows:

$ wget --warc-file=/tmp/epfl --recursive --level=1 --output-document=/dev/null http://www.epfl.ch/
ERROR: -k or -r can be used together with -O only if outputting to a regular file.

Solution

  • tl;dr Add the options --delete-after and --no-directories.

    Option --delete-after instructs wget to delete each downloaded file immediately after its download is complete. As a consequence, the maximum disk usage during execution will be the size of the WARC file plus the size of the single largest downloaded file.

    Option --no-directories prevents wget from leaving behind a useless tree of empty directories. By default wget creates a directory tree that mirrors the one on the host, and downloads each file into the appropriate directory of the mirrored tree. wget does this even when the downloaded file is temporary due to --delete-after. To prevent that, use option --no-directories.

    The below demonstrates the result, using your given example (slightly altered).

    $ cd $(mktemp -d)
    $ wget --delete-after --no-directories \
      --warc-file=epfl --recursive --level=1 http://www.epfl.ch/
    ...
    Total wall clock time: 12s
    Downloaded: 22 files, 1.4M in 5.9s (239 KB/s)
    $ ls -lhA
    -rw-rw-r--. 1 chadv chadv 1.5M Aug 31 07:55 epfl.warc
    

    If you forget to use --no-directories, you can easily clean up the tree of empty directories with find -type d -delete.