ccompressionzlib

Zlib deflate uncompress data error not able to inflate


Plese guide me, what I am doing wrong. I am not able to coming up with working solution.

I am trying to inflate PDF - deflate stream with Zlib. The below code written by me.

  int main(){

 char* filename = "zlib.3.pdf";

  FILE* filePtr = fopen(filename,"rb");

  if(!filePtr){
    printf("Unable to read file %s\n",filename);
    exit(1);
  }

  // file size 

  int seek_end = fseek(filePtr,0,SEEK_END);
  long fileSize = ftell(filePtr);
  int seek_reset = fseek(filePtr,0,SEEK_SET);

  char* fileBuffer = (char*) malloc(fileSize * sizeof(char));

  for(long i=0; i<fileSize; i++){
    fread(fileBuffer+i,sizeof(char),1,filePtr);
  }

  //starting and ending point

  long start_index, end_index;

  for(unsigned k = 0; k<fileSize; k++){
    if(strncmp("stream",fileBuffer+k,6) == 0){
      start_index = k+6;
      printf("startindex %ld\n",start_index);
      break;
    }
  }
  
  for(unsigned j=start_index; j<fileSize; j++){
    if(strncmp("endstream",fileBuffer+j,9) == 0){
      end_index = j;
      printf("endindex %ld\n",end_index);
      break;
    }
  }

  printf("Printing compressed stream\n");

  for(unsigned k=start_index; k<end_index; k++){
    printf("%c",*(fileBuffer+k));
  }
  printf("\nPrinting finished\n");

  Bytef *source = (Bytef*)(fileBuffer+start_index);

  uLong sourceLen = (uLong)(end_index - start_index);
  uLongf destLen = sourceLen * 8;
  Bytef *dest = calloc(sizeof(Bytef), destLen);

  int uncompressResult = uncompress(dest, &destLen, source, sourceLen);

  if(uncompressResult != Z_OK){
    printf("Failed to uncompress %d\n",uncompressResult);
  }

  char* outPut = (char*)dest;

  printf("Output %s %d\n",outPut,(int)destLen);

  return 0;
}

The input file : I am adding summary

%PDF-1.7
%µµµµ
...
4 0 obj
<</Filter/FlateDecode/Length 3012>>
stream
// Stream Data
endstream
endobj
%%EOF

I am getting deflate error ("Data Error" or -3)


Solution

  • The "stream" is followed by one or two end-of-line characters, either \r\n or just \n, according to the PDF specification. The first byte of the compressed data is right after the \n. Either way, all you need to do is search for the \n and start after that.

    If you add this after setting start_index:

    while (fileBuffer[start_index++] != '\n')
        ;
    

    then the decompression succeeds.

    However there are still other issues with your code. Searching for "endstream" to figure out where the stream ends will not work in general. There is nothing keeping "endstream" from appearing in the compressed data. What you have to do, and what you are expected to do, is to decode the dictionary that precedes "stream". You will note that it says:

    <</Filter/FlateDecode/Length 3012>>
    

    The filter "FlateDecode" tells you that, in fact, zlib decompression is what you want to do with the data (it could have been a different filter or filters), and it tells you the length of the binary data that follows the \n, i.e. 3012 bytes. That number is how you determine where the compressed data ends. Not "endstream". Feel free to check that there is indeed an end-of-line indication, then "endstream, then another end-of-line after the compressed data. But you already knew before then where it ended.

    Note that above I said "according the specification". You actually need to be a little more liberal to accommodate PDF files that don't exactly meet the specification. They're out there. First, there can be spaces after the "stream" and before the end-of-line. The while loop above takes care of that. Second, the end-of-line can also be a single \r. The while loop above does not handle that. You should modify the logic to look for any of \n, \r\n, or \r followed by a byte that isn't \n. Then the compressed data stream is after one of those.

    sourceLen * 8 won't always cover the size of the uncompressed data. You could tell that it wasn't enough by uncompress() returning Z_BUF_ERROR, and then trying it all over again after allocating double the space, repeating until it succeeds. However I would instead recommend using the inflate*() functions, which will allow you to process any amount of data a chunk at a time. You will also then be able to check that the deflate stream ended where the PDF dictionary claimed it would end. You can't tell that when using the uncompress() function, since it doesn't tell you where it stopped decompressing.

    Other nits:

    If you have the length of the input file, and you've allocated space for the whole thing, then just read it all in with a single fread() call.

    Don't print binary data. That's just an unreadable mess. If you really want to look at the compressed data for some reason, then convert it to hexadecimal for display.

    You need to check for the case where "stream" never occurs in the input.

    You need to check return codes from the f functions. Always check return codes. fseek(), ftell(), fread() could all fail and you'd never know.

    From kindergarten, put away things when you're done playing with them. Close the file that you opened.

    Use memcmp() instead of strncmp(). Your search is not for zero-terminated strings. Just matching bytes.

    Use malloc() instead of calloc(). There's no point in zeroing out memory you're about to completely write over.