tcpprotocolsintegrity

Is the integrity of file uploads/downloads guaranteed by TCP/HTTPS?


Assume I want to upload a file to a web server. Maybe even a rather big file (e.g. 30 MB). It's done with a typical file upload form (see minimal example below).

Now networks are not perfect. I see those types of errors being possible:

  1. Bitflips can happen
  2. packages can get lost
  3. the order in which packages arrive might not be the order in which they were sent
  4. a package could be received twice

Reading the TCP wiki article, I see

At the lower levels of the protocol stack, due to network congestion, traffic load balancing, or unpredictable network behaviour, IP packets may be lost, duplicated, or delivered out of order. TCP detects these problems, requests re-transmission of lost data, rearranges out-of-order data and even helps minimize network congestion to reduce the occurrence of the other problems. If the data still remains undelivered, the source is notified of this failure. Once the TCP receiver has reassembled the sequence of octets originally transmitted, it passes them to the receiving application. Thus, TCP abstracts the application's communication from the underlying networking details.

Reading that, the only reason I can see why a downloaded file might be broken is (1) something went wrong after it was downloaded or (2) the connection was interrupted.

Do I miss something? Why do sites that offer Linux images often also provide an MD5 hash? Is the integrity of a file upload/download over HTTPS (and thus also over TCP) guaranteed or not?

Minimal File Upload Example

HTML:

<!DOCTYPE html>
<html>
<head><title>Upload a file</title></head>
<body>
<form method="post" enctype="multipart/form-data">
    <input name="file" type="file" />
    <input type="submit"/>
</form>
</body>
</html>

Python/Flask:

"""
Prerequesites:

  $ pip install flask
  $ mkdir uploads
"""

import os
from flask import Flask, flash, request, redirect, url_for
from werkzeug.utils import secure_filename


app = Flask(__name__)
app.config["UPLOAD_FOLDER"] = "uploads"


@app.route("/", methods=["GET", "POST"])
def upload_file():
    if request.method == "POST":
        # check if the post request has the file part
        if "file" not in request.files:
            flash("No file part")
            return redirect(request.url)
        file = request.files["file"]
        # if user does not select file, browser also
        # submit an empty part without filename
        if file.filename == "":
            flash("No selected file")
            return redirect(request.url)
        filename = secure_filename(file.filename)
        file.save(os.path.join(app.config["UPLOAD_FOLDER"], filename))
        return redirect(url_for("upload_file", filename=filename))
    else:
        return """<!DOCTYPE html>
<html>
<head><title>Upload a file</title></head>
<body>
<form method="post" enctype="multipart/form-data">
    <input name="file" type="file" />
    <input type="submit"/>
</form>
</body>
</html>"""
    return "upload handled"


if __name__ == "__main__":
    app.run()


Solution

  • Is the integrity of file uploads/downloads guaranteed by TCP/HTTPS?

    In short: No. But it is better with HTTPS than with plain TCP.

    TCP only has a very weak error detection, so it will likely detect simple bit flips and discard (and resend) the corrupted packet - but it will not detect more complex errors. HTTPS though has (through the TLS layer) a pretty solid integrity protection and undetected data corruption on transport is essentially impossible.

    TCP also has a robust detection and prevention of duplicates and reordering. TLS (in HTTPS) has an even more robust detection of this kind of data corruption.

    But it gets murky when the TCP connection simply closes early, for example if a server crashes. TCP has no indication of a message by itself so a connection close is often used as an end of message indicator. This is for example true for FTP data connections but it can also be true for HTTP (and thus HTTPS). While HTTP has usually a length indicator (Content-length header or explicit chunk sizes with Transfer-Encoding: chunked) it defines also end of TCP connection as an end of message. Clients vary in the behavior if the end of connection is reached before the declared end of message: some will treat the data as corrupted, other will assume a broken server (i.e. wrong length declaration) and treat connection close as valid end of message.

    In theory TLS (in HTTPS) has a clear end-of-TLS message (TLS shutdown) which might help in detecting an early connection close. In practice though implementation might simply close the underlying socket w/o this explicit TLS shutdown so that one unfortunately cannot fully rely on it.

    Why do sites that offer Linux images often also provide an MD5 hash?

    There is also another point of failure: the download might have been corrupted before it gets downloaded. Download sites often have several mirrors and the corruption might happen when sending the file to the download mirror, or even when sending the file to the download master. Having some strong checksum in parallel to the download helps to detect such errors, as long as the checksum was created at the origin of the download and thus before the data corruption.