csocketsgziphttpserverchunked

How to fix weird distortion in image from c socket http server with gzip compression and chunked transfer encoding


I am currently coding a simple c socket HTTP server that support gzip and chunked transfer.

The snippet for gzip and chunked write to socket is as follows:

    // MAXLINE is the buffer size for out and in, which MAXLINE = 1000
    fd = open(filePath, O_RDONLY, 0);

    s.zalloc = s.zfree = s.opaque = NULL;
    deflateInit2(&s, Z_DEFAULT_COMPRESSION, Z_DEFLATED, 15 | 16, 8, Z_DEFAULT_STRATEGY);
    while ((s.avail_in = read(fd, in, MAXLINE)) > 0) {
      s.avail_out = MAXLINE;
      s.next_out = out;
      s.next_in = in;
      deflate(&s, Z_SYNC_FLUSH);
      sprintf(header, "%X\r\n", MAXLINE - s.avail_out);
      write(new_socket, header, strlen(header));
      write(new_socket, out, MAXLINE - s.avail_out);
      write(new_socket, "\r\n", 2);
    }

The above code works fine, when the requested file is pdf, html, pptx. And they could be downloaded by the browser without any problems or corruption.

However, when I try to request an image, the displayed/ downloaded image are distorted as follows:

Original image: enter image description here

Downloaded image: enter image description here

I suspects that there are some problems with the code for writing to socket with gzip and chunked transfer, but I can't seem to figure out the problem.

Any idea why this happened? And why it causes problems for images but not other file type such as pdfs? Any idea how to fix this? Thank You.

Update:

I have tested this using a large text file as suggested by user253751 from the comment, and the downloaded text file has identical content.

So there is no distortion happening to sending a text file with gzip and chunking.

Also, before adding the gzip compression (ie. only chunking), the image was not distorted at all.

So most probably is the gzip compression that causes this problem. But, I'm not sure why and how to fix this.

By comparing both original image and downloaded image using a hex editor, I found out that:

  1. There are a lot of bytes missing at the end, as shown in the screen shot below (Left is downloaded, Right is original):

enter image description here

  1. Some row are identical, while some are not.

For example, line with offset 0551980 (first line, 01 44 87 ... DA E0 B4) is identical in both files, but the next line with offset 0552000 (7C 92 77 ... 34 2E 4B; 0C C5 8F ... 1F CD 08) is difference.

I'm not sure how to interpret the result from this comparison, since this is my first time using a hex editor, and also the comparison highlighting was confusing to me.

Since the above difference was not highlighted by wxHexEditor, while in a different line with offset 0552380, only the same C7 was highlighted. So the editor highlights highlights when there is same data? But then why didn't it highlight the first line?

enter image description here

Moreover, by experimenting with different settings. When modifying the buffer size, the width if the distortion changes, as shown below with MAXLINE = 2000:

enter image description here

And with MAXLINE = 7000, the distortion disappears, but with a white line at the bottom:

enter image description here

So it seems that the issue here is probably due to the read buffer loop which probably caused some bytes to be exchanged or omitted?

Solution:

Thank You user253751 for figuring out the issue. It turns out that:

if deflate doesn't read all the input bytes? (if s.avail_in > 0) It just ignores the bytes it didn't read, and overwrites them with the next bytes in the file! So those bytes never get compressed and sent!

So, to mitigate this problem, a loop needs to surround deflate() and check whether the available out buffer(s.avail_out) is empty or not. If s.avail_out == 0 after deflate, that means compression used up all the spaces of out buffer, and we need to recall deflate() to deal with the bytes it didn't read/compressed.

Or checking for s.avail_in != 0 for the while loop.

The working code is as follows:

    // MAXLINE is the buffer size for out and in, which MAXLINE = 1000
    fd = open(filePath, O_RDONLY, 0);

    s.zalloc = s.zfree = s.opaque = NULL;
    deflateInit2(&s, Z_DEFAULT_COMPRESSION, Z_DEFLATED, 15 | 16, 8, Z_DEFAULT_STRATEGY);
    while ((s.avail_in = read(fd, in, MAXLINE)) > 0) {
      s.next_in = in;
      do {
        s.avail_out = MAXLINE;
        s.next_out = out;
        deflate(&s, Z_SYNC_FLUSH);
        sprintf(header, "%X\r\n", MAXLINE - s.avail_out);
        write(new_socket, header, strlen(header));
        write(new_socket, out, MAXLINE - s.avail_out);
        write(new_socket, "\r\n", 2);
      //} while (s.avail_out == 0);
      } while (s.avail_in != 0);
    }


Solution

  • deflate reads some uncompressed bytes from the in buffer and writes some compressed bytes to the out buffer. Your code is careful to send all the compressed bytes down the socket, even if the socket doesn't send them all at once. But your code is not careful with the uncompressed bytes!

    If deflate fills up the output buffer first, then there are still input bytes left over when it returns. Your code ignores those leftover input bytes and instead of trying to compress them again, it overwrites them with the next bytes from the file.

    The reason you see this with JPEG files but not with text files is that JPEG files are already compressed, so they can't be compressed any more. That means the gzipped JPEG output is bigger than the original JPEG, so the output buffer fills up before the input buffer is empty. With the text file, it compresses well and there is plenty of room in the output buffer.