httpdownloadwget

How to (re)download file with wget only when the file is newer or the size changed?


I am downloading an archive with wget, how can I use wget to only redownload that file when the file is newer on the server or the size has changed?

I'm aware of the -N flag but it doesn't work.


Solution

  • TL;DR There is a critical bug introduced in or around wget 1.17 that broke this feature.

    The server must support HEAD request and provide both timestamp (Last-Modified) and size (Content-Length).

    Debugging

    Use the -d flag to display request headers response headers for debugging.

    wget --version
    wget -N -d https://example.com/file.zip
    truncate --size 1 file.zip
    wget -N -d https://example.com/file.zip
    

    In older versions where it used to work, wget sends a HEAD request to obtain the last modified time and the file size, then if either changed, wget sends a GET request (without Last-Modified-Since) to download the file.

    In newer versions wget sends a single GET request with a Last-Modified-Since, to only download the file if it has changed since the last date. Unfortunately that doesn't work in practice.

    The change in behavior is broken by design. It cannot detect changes in file size, and as a side effect it prevents wget from recovering after a partial interrupted download.

    When sending a HTTP GET request with a timestamp, the server can respond 304 Not Modified code with no content and no file size header. Unfortunately this leaves no chance to wget to ever know about the file size or to redownload the file.

    # wget 1.21 in ubuntu 22, broken
    wget -N https://example.com/file.zip -d
    truncate --size 1 file.zip
    wget -N https://example.com/file.zip -d
    
    ---request begin---
    GET /file.zip HTTP/1.1
    Host: examplpe.com
    If-Modified-Since: Thu, 31 Aug 2023 18:22:20 GMT
    User-Agent: Wget/1.21.2
    Accept: */*
    Accept-Encoding: identity
    Connection: Keep-Alive
    
    ---request end---
    HTTP request sent, awaiting response...
    ---response begin---
    HTTP/1.1 304 Not Modified
    Date: Wed, 06 Sep 2023 09:10:16 GMT
    Connection: keep-alive
    Last-Modified: Thu, 31 Aug 2023 18:22:20 GMT
    ETag: f37ffefc58f99f0b996a38154d87820344d86d41
    Accept-Ranges: bytes
    Content-Disposition: attachment; filename="file.zip"; filename*=UTF-8''file.zip
    
    ---response end---
    304 Not Modified
    Registered socket 3 for persistent reuse.
    File ‘file.zip’ not modified on server. Omitting download.
    

    web browsers do not suffer from this caching issue because they store the ETag header from the initial response, a unique id representing a unique version of the file. Apache and nginx generate the ETag automatically when serving static files based on last modification time and file size.